Lab #7: Tweaking Classifiers¶

DUE: Tuesday 2/25 11:59pm

HOW/WHAT TO SUBMIT: All files should be submitted through WebSubmit. Only one of your team members needs to submit on behalf of the team. On the WebSubmit interface, make sure you select compsci290 and the appropriate lab number. You can submit multiple times, but please have the same team member resubmit all required files each time. To earn class participation credit, submit a text file team.txt listing members of your team who are present at the lab. To earn extra credit for the lab challenge, you must get your solutions checked off in class, but you don't need to submit anything.

0. Getting Ready¶

To get ready for this lab, fire up your virtual machine and enter the following command:

/opt/datacourse/sync.sh

This command will download the files which we are going to use for this lab.

Next, create a working directory (lab07 under your home directory) and get ready for this lab:

cp -pr /opt/datacourse/assignments/lab07-template/ ~/lab07/
cd ~/lab07/
./prepare.py

The last command will download the 20-newsgroup dataset. Give it some time to finish. This datset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date.

1. Classification Challenge¶

We use the textual contents of the messages to predict which newgroups the messages were posted to. For the purpose of this exercise, we limit ourselves to six newsgroups on the following subjects:

cryptography
medical
electronics
space
politics, mideast
politics, misc

We have many options for classification. Your job is to explore these options and find the best one for our task at hand. Our measure of classifier performance is the F-measure, or the harmonic mean of precision and recall. The higher the F-measure the better See Lecture #4 if you need a refresher.

You can run the classification algorithm using the following command: ./classify.py You will see our default 1NN classifier. The f1-score (F-measure) is just 0.332. Not very impressive, is it?

Luckily, you get to explore a lot of possible ways to improve this result. You do so by running classify.py with different command-line options, and by modifying another file lab7.py to specify your feature extractor and classifier. (You won't need to modify classify.py or prepare.py.) Here is a summary of what you can try:

We consider three classifiers---Multinomial Naive Bayes, k Nearest Neighbor, and Support Vector Machines, each with its own hyper parameters. You can specify what you want by modifying the code for function get_classifier() in lab7.py; we have some sample code and explanation there already. Feel free to try other classifier on your own!
Note that under get_classifier() in lab7.py, we also have an example of automatically tuning SVM hyper parameters using "grid search". Study the example code, so you can use grid search to systematically explore parameter tuning for other two classiers as well.
By modifying get_vectorizer() in lab7.py, you can extract as features either TFIDF scores or counts of terms.
classify.py has a number of command-line options that you might find useful. You can use them in combination.
- --report will print a detailed classification report, with the breakdown of precision and recall for each class.
- --select=N will select only the top N features using the \(\chi^2\)-test.
- --scale will scale the features, which may be useful to some classifiers, such as SVM with RBF kernel. Scaling a big sparse matrix might make it dense and overflow your memory, so use it with caution or in combination with --select.
- --top10 prints ten most discriminative terms per class. This option only makes sense for certain classifiers, such as Naive Bayes.

GETTING CHECKED OFF: Once you get your F-measure above 0.65, raise your hands to get your result checked off by course staff.

EXTRA CREDITS: If you manage to get your F-measure above 0.79, you receive extra credits worth 10% of a homework. The teams with highest F-measures will receive extra credits worth 20% of a homework.