|
|
Details of the third programming project are available here.
Revised deadline: noon on April 22 (Wednesday).
Sample datasets and input files for the second programming project can be found at Sample Datasets and Input Files.
Details of the second programming project are available here. This project contains three
tasks. The zip file contains one subdirectory per task. A README file is speficied per task that describes
the inputs and outputs that you have to generate per task. A toy dataset is given in data.txt.
Example input files (input.txt) are shown in each subdirectory.
The project is due on April 8.
The final project report should include:
-
Brief description of your implementation of the three tasks. Justify any
significant implementation choices that you made.
-
The README file per task specifies additional results that you should include in the report.
Code documentation should be provided just like the first programming project.
Details of the first programming project are available here. This project contains three tasks. Task 1 is based on the Apriori algorithm for frequent itemsets. Task
2 is based on constraint-based mining which is described in Notes 5 and Section 5.5 of the textbook. Task 3 is an extension of your choice.
A toy dataset is available here to give you an idea of the format in which the file containing each
dataset will be provided.
A dataset with 10,000 purchases is available here. This dataset is called
the 10K dataset.
A dataset with 50,000 purchases is available here. This dataset
is called the 50K dataset.
The final project report should include:
-
Brief description of your implementation of the three tasks. Justify any
significant implementation choices that you made.
-
For both the 10K and 50K datasets given, plot graphs that show the
running time of your implementation of Task 1 for the following
minimum support threshold values: 0.15%, 0.75%, 1%, 10%, 20%, 30%,
40%. Explain the nature of the plot that you get.
-
For both the 10K and 50K datasets given, give the frequent itemsets
and corresponding support values for the following minimum support
threshold values: 1% and 10%.
-
Based on your implementation of Task 2, give suitable results
to show how your implementation of constraints is able to
improve Apriori's performance. For example, you could show
how a constraint reduces the overall running time and number
of passes for a particular value of support on the 50K dataset.
-
Give appropriate results to illustrate your implementation of Task 3.
The code documentation should enable a programmer to browse the
code and understand the high-level picture quickly. One way to do this
would be to write a page or so about the high-level oganization of the
code in the README file. Also give enough low-level documentation in
the source code to bring out the flow, tricky details, assumptions,
etc.
|