Homework #1
This homework is to be completed on your own, without input, code, or
assistance from other students. See me or the TA if you have questions.
1. Show the decision tree that would be learned by C4.5 assuming that it is
given the five training examples for the EnjoySport target concept shown in
the table below. Show the value of the information gain for each candidate
attribute at each step in growing the tree. You may break ties randomly.
Sky AirTemp Humidity Wind Water Forecast EnjoySport
-------------------------------------------------------------------------
sunny warm normal strong warm same yes
sunny warm high strong warm same yes
rainy cold high strong warm change no
sunny warm high strong cool change yes
sunny warm normal weak warm same no
2. Given the data used for problem 1, construct a neural net to classify
examples in terms of the EnjoySport concept. Assume that there are two hidden
nodes. Show the structure of the neural network with the number of input nodes
and output nodes. Assign each edge a value of 0.5 and show what the output
of the network would be for the first example in the dataset. For simplicity,
use the step (perceptron) output function instead of the sigmoid rule and assume
the threshold value is 0.
3. A framework for comparing machine learning algorithms is available on
cse in the directory /export2/home2/faculty1/cs6362/ml2.0 (view the README file
for directions on the use of this system). The code is configured to compare
four different learning algorithms: a majority class learner (whichever class
appears most often in the training set is the default classification), the C4.5
decision tree learner, a backpropagation neural network, and a naive Bayesian
classifier. Due to copyright restrictions, only the binary for C4.5 is
provided. If you want to manipulate source code, try using the DTL decision
tree learner provided in the ml2.0 directory instead.
For each of the six data sets found in the ml2.0/data directory, use the
ml program to run a 3-fold cross validation on the four learning algorithms.
Display the results of your experiments in the format shown below.
Domain Majority (avg/stddev) DTL (a/s) BP (a/s) NB (a/s)
---------------------------------------------------------------------
credit
diabetes
golf
lymphography
soybean
vote
Next, generate a learning curve for each algorithm by training on the first
1/10 of the examples and testing on the rest (round to the nearest integer),
then training on 2/10, 3/10, ..., 9/10, always testing on the remaining
examples. Generate this learning curve using only one data (I recommend the
credit database, as this has the largest number of examples). Generating the
curve for more than one dataset is optional.
Compare the different algorithms based on your tabulated results for both
experiments. Discuss what changes might be made to the algorithms to improve
performance.
Have fun!