Homework #2

This homework is to be completed on your own, without input, code, or assistance from other students. See me or the TA if you have questions.

1. Implement Agrawal's Apriori algorithm for finding association rules. Use a minimum support value of 6 (at least 6 instances of each candidate itemset must appear in the database) and a minimum confidence value of 0.6. Test your algorithm on the Automobile database. Each transaction in this database contains eight fields, represented as integer values: model year, cylinders, weight, mpg, origin, horsepower, displacement, and acceleration.

Print each learned rule in the following format (you may not necessarily learn this particular rule):
RULE: origin=13 and mpg=73 => acceleration=1 and displacement=8 (confidence = 1.0, support = 7)
As a note, the attributes appear in the following order in the database: (origin, cylinders, mpg, acceleration, model year, displacement, horsepower, and weight).

Turn in your well-documented code with the contents of a sample run.

Here are some implementations of the Apriori algorithm.

2. For this problem, you may work in groups of 1 or 2. You will need to download and install an evaluation copy of DBMiner onto a Windows95 or WindowsNT platform. Create a cube using the "US Population" table, and answer the questions below using the cube. Because this table is too big for the evaluation version of DBMiner, you will need to remove enough entries from the END of the file to fit in the allowed space (a maximum of 1,000 entries). The cube should contain three dimensions: 1) pop (created from "pop90" and "pop80"), 2) area (created from "area"), and 3) pop_per_sqm92 (created from "pop_per_sqm92").

What L<->R association rules are created with a support of 15% or greater and a confidence of 70% or greater?
What classificaton rules are generated using the default classification and noise parameters, with 90% of the database used for training and 10% for testing? Analyze pop_per_sqm92 using pop and area.
Explain the curves that are generated when prediction of population is performed using area as the sole analysis parameter.