Duke DBGroup Logo

CPS 196.03: Information Management and Mining
(Spring 2009, Shivnath Babu)

Course information
Course schedule and notes
Assignments
Readings
Project
Details of the third programming project are available here. Revised deadline: noon on April 22 (Wednesday).

Sample datasets and input files for the second programming project can be found at Sample Datasets and Input Files.

Details of the second programming project are available here. This project contains three tasks. The zip file contains one subdirectory per task. A README file is speficied per task that describes the inputs and outputs that you have to generate per task. A toy dataset is given in data.txt. Example input files (input.txt) are shown in each subdirectory. The project is due on April 8.

The final project report should include:

  1. Brief description of your implementation of the three tasks. Justify any significant implementation choices that you made.
  2. The README file per task specifies additional results that you should include in the report.
Code documentation should be provided just like the first programming project.



Details of the first programming project are available here. This project contains three tasks. Task 1 is based on the Apriori algorithm for frequent itemsets. Task 2 is based on constraint-based mining which is described in Notes 5 and Section 5.5 of the textbook. Task 3 is an extension of your choice.

A toy dataset is available here to give you an idea of the format in which the file containing each dataset will be provided.

A dataset with 10,000 purchases is available here. This dataset is called the 10K dataset.

A dataset with 50,000 purchases is available here. This dataset is called the 50K dataset.

The final project report should include:

  1. Brief description of your implementation of the three tasks. Justify any significant implementation choices that you made.
  2. For both the 10K and 50K datasets given, plot graphs that show the running time of your implementation of Task 1 for the following minimum support threshold values: 0.15%, 0.75%, 1%, 10%, 20%, 30%, 40%. Explain the nature of the plot that you get.
  3. For both the 10K and 50K datasets given, give the frequent itemsets and corresponding support values for the following minimum support threshold values: 1% and 10%.
  4. Based on your implementation of Task 2, give suitable results to show how your implementation of constraints is able to improve Apriori's performance. For example, you could show how a constraint reduces the overall running time and number of passes for a particular value of support on the 50K dataset.
  5. Give appropriate results to illustrate your implementation of Task 3.
The code documentation should enable a programmer to browse the code and understand the high-level picture quickly. One way to do this would be to write a page or so about the high-level oganization of the code in the README file. Also give enough low-level documentation in the source code to bring out the flow, tricky details, assumptions, etc.