Homework Assignment #6 1. (40 points) In this program you will implement a naive Bayes classifier (NBC) to use as a spam filter. Many modern mail programs implement Bayesian spam filtering. Examples include SpamAssassin, SpamBayes, and Bogofilter. The NBC maps a data point, described by a collection of features, onto a class value. For your spam filters, the class values are spam (undesired email) or ham (mail the user would like to receive). Each data point is a mail message. The data point will be described by n Boolean (true, false) values a1 through an. Here, ai will be true if the ith word in our dictionary is present in the document and will be false otherwise. If we have a dictionary consisting of the words {spam, ham, modern, zoo} then the first paragraph of this problem would be described by the feature vector {a1=true, a2=true, a3=true, a4=false). To get you started, I created a dictionary of words, stored in the file "dictionary", that you can use for your feature vectors. Your NBC should take as input a collection of files that are labeled as ham or spam. The first 2/3 (approximately) of the files will be used to train the algorithm, and I would like you to test your classifier on the remaining 1/3 of the files from each category. The corpus we will use for his assignment is available at http://spamassassin.apache.org/publiccorpus/. Use the files in spam.tar.bz2 (for the spam files) and easy_ham.tar.bz2 (for the ham files) to evaluate your algorithm. You may implement the program in C, C++, or Java, on any of the department Linux machines. Please turn in the following items in one zip file: your source code, an executable that runs on Linux, a sample output file, a report of how well your algorithm classified the test files, and a short description of how to compile and run the program. 2. (12 points) Create a decision tree to learn the target class "use manual or automatic control" of shuttle landings. The dataset and the description of the attribute and target class values can be found at http://archive.ics.uci.edu/ml/datasets/Shuttle+Landing+Control. In the case of missing values you can use any of the techniques we discussed in class to handle these values.