Data Mining Semester Project

On or before the project proposal due date of October 26th, you should turn in (email) a proposal of your data mining class project. The proposal should describe the problem you are addressing, the relevance of the project to data mining, the approach you will take, possible experiments or theoretical analysis you will perform, and the expected results. A list of project ideas is provided below.

The project is a significant part of the course, and I expect you to spend a significant amount of time designing, implementing, testing, and preparing your final report. Therefore, I encourage you to turn in your project proposal as soon as possible and get started early.

The project will consist of a writeup detailing the problem, relevance, approaches, analyses, results, and conclusions about the advantages and disadvantages of your approach to the problem. Turn in electronic versions of all code written for the project. If your code is a modification of an existing program, be sure to clearly indicate your modifications. Include data used and sample output for all runs.

In some cases I will allow you to work on a project with one other student. This must be approved as part of the project proposal. If you are turning in a group project, describe the work done by each participant.

Project Ideas:

Apply a data mining method to a specific database. Avoid duplicate work -- pick a unique database (one that has not been reported in the literature) or a unique mining method for this type of data. Describe the steps you go through to prepare the data, mine the data, and interpret the results.
Compare approaches to a particular problem. Alternative approaches have been suggested for various problems. Implement several of the alternative approaches and rigorously compare them on databases with distinct properties. You can also create artificial databases to test the bounds of each approach. Some comparisons include comparing discretization methods, feature selection methods, clustering methods, and parallel data mining approaches.
Apply the Subdue knowledge discovery algorithm to a database of your choice. To do this you will need to design a graph representation for the data, run the system with a variety of parameter choices on the data, and interpret the results.
Create a parallel or distributed version of a data mining algorithm. For example, create a distributed version of Jorge's temporal discovery algorithm or generate a parallel approach to learning using Bayesian networks.
Improve data mining algorithms. A number of data mining algorithms have been introduced this semester, and each area has need for improvements. Create an improvement to one of these methods, implement it, and test it on a data mining task. Some improvements include:
- Add an inexact match to Srikant and Agrawal's associate rule discovery algorithm. The inexact match is used to find support in cases where the pattern does not appear precisely in the same form throughout the database. Determine how the amount of difference affects the level of support and confidence, and show how to calculate a threshold value. Experiment with different amounts of allowed variation and their affects on the results.

Here are some sample projects.