Lab #7: Tweaking Classifiers

DUE: Tuesday 2/25 11:59pm

HOW/WHAT TO SUBMIT: All files should be submitted through WebSubmit. Only one of your team members needs to submit on behalf of the team. On the WebSubmit interface, make sure you select compsci290 and the appropriate lab number. You can submit multiple times, but please have the same team member resubmit all required files each time. To earn class participation credit, submit a text file team.txt listing members of your team who are present at the lab. To earn extra credit for the lab challenge, you must get your solutions checked off in class, but you don't need to submit anything.

0. Getting Ready

To get ready for this lab, fire up your virtual machine and enter the following command:

/opt/datacourse/sync.sh

This command will download the files which we are going to use for this lab.

Next, create a working directory (lab07 under your home directory) and get ready for this lab:

cp -pr /opt/datacourse/assignments/lab07-template/ ~/lab07/
cd ~/lab07/
./prepare.py

The last command will download the 20-newsgroup dataset. Give it some time to finish. This datset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date.

1. Classification Challenge

We use the textual contents of the messages to predict which newgroups the messages were posted to. For the purpose of this exercise, we limit ourselves to six newsgroups on the following subjects:

We have many options for classification. Your job is to explore these options and find the best one for our task at hand. Our measure of classifier performance is the F-measure, or the harmonic mean of precision and recall. The higher the F-measure the better See Lecture #4 if you need a refresher.

You can run the classification algorithm using the following command: ./classify.py You will see our default 1NN classifier. The f1-score (F-measure) is just 0.332. Not very impressive, is it?

Luckily, you get to explore a lot of possible ways to improve this result. You do so by running classify.py with different command-line options, and by modifying another file lab7.py to specify your feature extractor and classifier. (You won't need to modify classify.py or prepare.py.) Here is a summary of what you can try:

GETTING CHECKED OFF: Once you get your F-measure above 0.65, raise your hands to get your result checked off by course staff.

EXTRA CREDITS: If you manage to get your F-measure above 0.79, you receive extra credits worth 10% of a homework. The teams with highest F-measures will receive extra credits worth 20% of a homework.