Data Mining Semester Project
On or before the project proposal due date of October 26th, you should turn in
(email) a proposal of your data mining class project. The proposal should
describe the problem you are addressing, the relevance of the project to
data mining, the approach you will take, possible experiments or theoretical
analysis you will perform, and the expected results. A list of project ideas
is provided below.
The project is a significant part of the course, and I expect you to spend
a significant amount of time designing, implementing, testing, and preparing
your final report. Therefore, I encourage you to turn in your project proposal
as soon as possible and get started early.
The project will consist of a writeup detailing the problem, relevance,
approaches, analyses, results, and conclusions about the advantages and
disadvantages of your approach to the problem. Turn in electronic versions
of all code written for the project. If your code is a modification of
an existing program, be sure to clearly indicate your modifications. Include
data used and sample output for all runs.
In some cases I will allow you to work on a project with one other
student. This must be approved as part of the project proposal.
If you are turning in a group project, describe the work done by each
participant.
Project Ideas:
- Apply a data mining method to a specific database. Avoid duplicate
work -- pick a unique database (one that has not been reported in the
literature) or a unique mining method for this type of data.
Describe the steps you go through to prepare the data, mine the data, and
interpret the results.
- Compare approaches to a particular problem. Alternative approaches have
been suggested for various problems. Implement several of the alternative
approaches and rigorously compare them on databases with distinct properties.
You can also create artificial databases to test the bounds of each approach.
Some comparisons include comparing discretization methods, feature selection
methods, clustering methods, and parallel data mining approaches.
- Apply the Subdue knowledge discovery algorithm to a database of your
choice. To do this you will need to design a graph representation for the
data, run the system with a variety of parameter choices on the data, and
interpret the results.
- Create a parallel or distributed version of a data mining algorithm.
For example, create a distributed version of Jorge's temporal discovery
algorithm or generate a parallel approach to learning using Bayesian networks.
- Improve data mining algorithms. A number of data mining algorithms have
been introduced this semester, and each area has need for improvements.
Create an improvement to one of these methods, implement it, and test it
on a data mining task. Some improvements include:
- Add an inexact match to Srikant and Agrawal's associate rule discovery
algorithm. The inexact match is used to find support in cases where the
pattern does not appear precisely in the same form throughout the database.
Determine how the amount of difference affects the level of support and
confidence, and show how to calculate a threshold value. Experiment with
different amounts of allowed variation and their affects on the results.
Here are some sample projects.