DUE: Tuesday 2/25 11:59pm
HOW/WHAT TO SUBMIT: All files should be submitted through WebSubmit. Only one of your team members needs to submit on behalf of the team. On the WebSubmit interface, make sure you select compsci290
and the appropriate lab number. You can submit multiple times, but please have the same team member resubmit all required files each time. To earn class participation credit, submit a text file team.txt
listing members of your team who are present at the lab. To earn extra credit for the lab challenge, you must get your solutions checked off in class, but you don't need to submit anything.
To get ready for this lab, fire up your virtual machine and enter the following command:
/opt/datacourse/sync.sh
This command will download the files which we are going to use for this lab.
Next, create a working directory (lab07
under your home directory) and get ready for this lab:
cp -pr /opt/datacourse/assignments/lab07-template/ ~/lab07/
cd ~/lab07/
./prepare.py
The last command will download the 20-newsgroup dataset. Give it some time to finish. This datset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date.
We use the textual contents of the messages to predict which newgroups the messages were posted to. For the purpose of this exercise, we limit ourselves to six newsgroups on the following subjects:
We have many options for classification. Your job is to explore these options and find the best one for our task at hand. Our measure of classifier performance is the F-measure, or the harmonic mean of precision and recall. The higher the F-measure the better See Lecture #4 if you need a refresher.
You can run the classification algorithm using the following command: ./classify.py You will see our default 1NN classifier. The f1-score
(F-measure) is just 0.332
. Not very impressive, is it?
Luckily, you get to explore a lot of possible ways to improve this result. You do so by running classify.py
with different command-line options, and by modifying another file lab7.py
to specify your feature extractor and classifier. (You won't need to modify classify.py
or prepare.py
.) Here is a summary of what you can try:
get_classifier()
in lab7.py
; we have some sample code and explanation there already. Feel free to try other classifier on your own!get_classifier()
in lab7.py
, we also have an example of automatically tuning SVM hyper parameters using "grid search". Study the example code, so you can use grid search to systematically explore parameter tuning for other two classiers as well.get_vectorizer()
in lab7.py
, you can extract as features either TFIDF scores or counts of terms.classify.py
has a number of command-line options that you might find useful. You can use them in combination.
--report
will print a detailed classification report, with the breakdown of precision and recall for each class.--select=N
will select only the top N features using the \(\chi^2\)-test.--scale
will scale the features, which may be useful to some classifiers, such as SVM with RBF kernel. Scaling a big sparse matrix might make it dense and overflow your memory, so use it with caution or in combination with --select
.--top10
prints ten most discriminative terms per class. This option only makes sense for certain classifiers, such as Naive Bayes.GETTING CHECKED OFF: Once you get your F-measure above 0.65
, raise your hands to get your result checked off by course staff.
EXTRA CREDITS: If you manage to get your F-measure above 0.79
, you receive extra credits worth 10% of a homework. The teams with highest F-measures will receive extra credits worth 20% of a homework.