DUE: Monday 2/22 11:59pm
HOW/WHAT TO SUBMIT: All files should be submitted through WebSubmit. Only one of your team members needs to submit on behalf of the team. On the WebSubmit interface, make sure you select compsci216
and the appropriate lab number. You can submit multiple times, but please have the same team member resubmit all required files each time. To earn class participation credit, submit a text file team.txt
listing members of your team who are present at the lab. To earn extra credit for the lab challenge (Part 2), you must get your solutions to all parts of the lab checked off in class, but there is no need to submit anything.
To get ready for this lab, fire up your virtual machine and enter the following command:
/opt/datacourse/sync.sh
This command will download the files which we are going to use for this homework.
Next, type the following commands to create a working directory for this homework. Here we use lab06
under your shared
directory, but feel free to change it to another location.
cp -pr /opt/datacourse/assignments/lab06/ ~/shared/lab06/
cd ~/shared/lab06/
We are going to continue with Homework #6, Part 2, where we predict the party affiliation of Representatives based on the votes they cast in 2014. The subdirectory congress/
contains a complete implementation of a Bernoulli Naive Bayes classifier. To get ready, run:
cd ~/shared/lab06/congress/
python prepare.py
The feature extraction code has been extended to read in an SQL file pick.sql
, which selects a (pretty arbitrary) subset of 10 votes as features to be used for prediction (instead of using all hundreds of votes cast in 2014. Run the classifier, and you will notice that it still performs pretty well with just these 10 features:
python classify.py
Your job is to modify pick.sql
to select 10 votes as features such as the classifier will perform very poorly with these 10 votes. The query must return 10 vote ids, and they must be for votes cast by the House in the 2014 session. (Whenever you modify pick.sql
, you just need to rerun classify.py
; there is no need to rerun prepare.py
.)
Raise your hands to have your answer checked by the course staff when you the classifier to perform poorly. Explain, intuitively, why your choice leads to poor accuracy. You get extra credit worth of 5% of a homework grade, if your classifier's accuracy is below 70%. The team who achieves the lowest accuracy the fastest will also win a prize.
We use the textual contents of the messages to predict which newgroups the messages were posted to. For the purpose of this exercise, we limit ourselves to six newsgroups on the following subjects:
We have several options for classification. Your job is to explore these options and find the best one for our task at hand. Our measure of classifier performance is the F-measure, or the harmonic mean of precision and recall. The higher the F-measure the better.
You can run the classification algorithm using the following command:
cd ~/shared/lab06/newsgroups/
python classify.py
You will see our default 1NN classifier. The f1-score (F-measure) is just 0.332. Not very impressive, is it?
Luckily, you get to explore a lot of possible ways to improve this result. You do so by running classify.py
with different command-line options, and by modifying another file lab.py
to specify your feature extractor and classifier. (You won't need to modify classify.py
or prepare.py
.) Here is a summary of what you can try:
get_classifier()
in lab.py
; we have some sample code and explanation there already. Feel free to try other classifier on your own if you want to!get_classifier()
in lab.py
, we also have an example of automatically tuning SVM hyper parameters using "grid search". Study the example code, so you can use grid search to systematically explore parameter tuning for other two classiers as well.get_vectorizer()
in lab.py
, you can extract as features either TFIDF scores or counts of terms.classify.py
has a number of command-line options that you might find useful. You can use them in combination.
--report
will print a detailed classification report, with the breakdown of precision and recall for each class.--select=N
will select only the top N features using the \(\chi^2\)-test.--scale
will scale the features, which may be useful to some classifiers, such as SVM with RBF kernel. Scaling a big sparse matrix might make it dense and overflow your memory, so use it with caution, or in combination with --select
.--top10
prints ten most discriminative terms per class. This option only makes sense for certain classifiers, such as Naive Bayes.Once you get your F-measure above 0.65, raise your hands to get your result checked off by course staff. If you manage to get your F-measure above 0.79, you receive extra credits worth 5% of a homework. The team who achieves the highest accuracy the fastest will also win a prize.