Homework #1

This homework is to be completed on your own, without input, code, or assistance from other students. See me or the TA if you have questions.

1. Show the decision tree that would be learned by C4.5 assuming that it is given the five training examples for the EnjoySport target concept shown in the table below. Show the value of the information gain for each candidate attribute at each step in growing the tree. You may break ties randomly.
    Sky      AirTemp    Humidity     Wind      Water   Forecast   EnjoySport
   sunny     warm       normal     strong     warm      same             yes
   sunny     warm        high      strong     warm      same             yes
   rainy     cold        high      strong     warm     change             no
   sunny     warm        high      strong     cool     change            yes
   sunny     warm       normal      weak      warm      same              no

2. Given the data used for problem 1, construct a neural net to classify examples in terms of the EnjoySport concept. Assume that there are two hidden nodes. Show the structure of the neural network with the number of input nodes and output nodes. Assign each edge a value of 0.5 and show what the output of the network would be for the first example in the dataset. For simplicity, use the step (perceptron) output function instead of the sigmoid rule and assume the threshold value is 0.

3. A framework for comparing machine learning algorithms is available on cse in the directory /export2/home2/faculty1/cs6362/ml2.0 (view the README file for directions on the use of this system). The code is configured to compare four different learning algorithms: a majority class learner (whichever class appears most often in the training set is the default classification), the C4.5 decision tree learner, a backpropagation neural network, and a naive Bayesian classifier. Due to copyright restrictions, only the binary for C4.5 is provided. If you want to manipulate source code, try using the DTL decision tree learner provided in the ml2.0 directory instead.

For each of the six data sets found in the ml2.0/data directory, use the ml program to run a 3-fold cross validation on the four learning algorithms. Display the results of your experiments in the format shown below.

  Domain          Majority (avg/stddev)  DTL (a/s)  BP (a/s)   NB (a/s)
Next, generate a learning curve for each algorithm by training on the first 1/10 of the examples and testing on the rest (round to the nearest integer), then training on 2/10, 3/10, ..., 9/10, always testing on the remaining examples. Generate this learning curve using only one data (I recommend the credit database, as this has the largest number of examples). Generating the curve for more than one dataset is optional.

Compare the different algorithms based on your tabulated results for both experiments. Discuss what changes might be made to the algorithms to improve performance.

Have fun!