Homework #2
This homework is to be completed on your own, without input, code, or
assistance from other students. See me or the TA if you have questions.
1. Implement Agrawal's Apriori algorithm for finding association rules.
Use a minimum support value of 6 (at least 6 instances of each candidate
itemset must appear in the database) and a minimum confidence value of 0.6.
Test your algorithm on the Automobile database.
Each transaction in this database contains eight fields, represented as
integer values: model year, cylinders, weight, mpg, origin, horsepower,
displacement, and acceleration.
Print each learned rule in the following format (you may not necessarily
learn this particular rule):
RULE: origin=13 and mpg=73 => acceleration=1 and displacement=8
(confidence = 1.0, support = 7)
As a note, the attributes appear in the following order in the database:
(origin, cylinders, mpg, acceleration, model year, displacement, horsepower,
and weight).
Turn in your well-documented code with the contents of a sample run.
Here are some implementations of the Apriori
algorithm.
2. For this problem, you may work in groups of 1 or 2. You will need to
download and install an evaluation copy of DBMiner onto a Windows95 or WindowsNT
platform. Create a cube using the "US Population" table, and answer the
questions below using the cube. Because this table is too big for the
evaluation version of DBMiner, you will need to remove enough entries from the
END of the file to fit in the allowed space (a maximum of 1,000 entries). The
cube should contain three dimensions:
1) pop (created from "pop90" and "pop80"),
2) area (created from "area"), and
3) pop_per_sqm92 (created from "pop_per_sqm92").
- What L<->R association rules are created with a support of 15% or greater
and a confidence of 70% or greater?
- What classificaton rules are generated using the default classification and
noise parameters, with 90% of the database used for training and 10% for
testing? Analyze pop_per_sqm92 using pop and area.
- Explain the curves that are generated when prediction of population is
performed using area as the sole analysis parameter.