Lab #4: Introducing Classification

DUE: Monday 2/9 11:59pm

HOW/WHAT TO SUBMIT: All files should be submitted through WebSubmit. Only one of your team members needs to submit on behalf of the team. On the WebSubmit interface, make sure you select compsci216 and the appropriate lab number. You can submit multiple times, but please have the same team member resubmit all required files each time. To earn class participation credit, submit a text file team.txt listing members of your team who are present at the lab. To earn extra credit for the lab challenge (Part 3), you must get your solutions to all parts of the lab checked off in class, and submit the required SQL file (see below).

0. Getting Ready

To get ready for this lab, fire up your virtual machine and enter the following command:

/opt/datacourse/sync.sh

This command will download the files which we are going to use for this homework.

(We assume that you already have the movielens database set up from Homework #4. You can use psql movielens to check.)

Next, type the following commands to create a working directory for this homework. Here we use lab04 under your shared directory, but feel free to change it to another location.

cp -pr /opt/datacourse/assignments/lab04/ ~/shared/lab04/
cd ~/shared/lab04/

Finally, run the following program to prepare the data for analysis:

python prepare.py

1. Train-Test Runs and the Mystery of \(A\)

There are three classifiers \(A\), \(B\), and \(C\) in three Python sripts classifyA.py, classifyB.py, and classifyC.py. They all attempt to predict the gender of a user based on the user's occupation and the set of movies reviewed by the user. \(B\) is the Naive Bayes classifier and \(C\) is the \(k\)NN classifier (with Euclidean distance and uniform weighting). You will soon find out what \(A\) does.

The dataset has 943 users (rows). If you run a classifier without specifying any additional parameters, like this:

python classifyA.py

The code will perform 10-fold cross-validation using the given dataset and report the accuracies of the classifier for the 10 runs (together with mean and 2*standard deviation). NOTE: In case you are interested on how to code cross-validation, see eval() in prepare.py; focus on the case when no sqlfile is given. It's pretty easy!

(A) Which classifier seems to work best?

(B) Study classifyA.py. What exactly does \(A\) do?

When you have answers to both questions above, raise your hand and have them checked by the course staff.

2. Tweaking \(k\)NN

\(C\), the \(k\)NN classifier, has a "hyper" parameter \(k\) that specifies how many neighbors to examine for classification. The default setting is \(k=1\). Your task for this part of the lab is to study how \(k\) affects classification accuracy.

You can change \(k\) by specifying it as the last input parameter to classifyC.py; e.g., for 2NN, use the command:

python classifyC.py 2

To take a closer look at the \(C\)'s performance for a particular train-test split, use commands like these:

python classifyC.py random90-10-repeatable.sql 1
python classifyC.py random90-10-repeatable.sql 2

The .sql file contains a SQL query specifying the exact train-test split (more on this later). In this case, the query randomly picks 90% of the data for training and 10% for testing. The output shows not only the classifier's accuracy on the test data, but also that on the training data (recall the "rookie mistake" mentioned in lab). The other two classifers also works with this SQL query file as input.

(A) How does \(k\) affect \(C\)'s accuracy on test data? How about accuracy on training data? Would you prefer a bigger \(k\) or a smaller \(k\) for our classification problem at hand?

(B) How does \(C\)'s performance with a large \(k\) (say \(k=500\)) compare with \(A\)? Why?

When you have answers to both questions above, raise your hand and have them checked by the course staff.

3. [Challenge] The Evil SQL Splitters and Redemption of Naive Bayes

Recall from above that code for all three classifiers can run a file containing a SQL query (such as random90-10-repeatable.sql) specifying the exact train-test split. The query is supposed to return a subset of users(id) values: the corresponding users will be included in the training data; others will be in the test data. Read random90-10-repeatable.sql and make sure you understand what it does. You will find other examples in the .sql files in the same directory.

Your task for this challange is to find (SQL queries that generate) train-test splits to screw up our classifiers (and in this process, learn how training data affects classification performance). You should create new .sql files and tweak them until you arrive at your answer. As soon as you get one answer below, raise your hand and have it checked by the course staff.

(A) Find a train-test split where our classifiers do extremely well on the training data, but horrible on the test data. The percentage of data that goes to training is up to you, but don't make it lower than 60%.

Get your answer checked by the course staff before moving to the next part.

(B) \(B\), Naive Bayes, seems to suck compared with \(C\) (and even simple-minded \(A\)). Find a train-test split where \(B\) performs significantly better than both \(A\) and \(C\) (on test data). Again, the percentage of data that goes to training is up to you, but don't make it lower than 60%.

You get extra credit worth of 5% of a homework grade, if, for your split, \(B\)'s accuracy on test data is higher than 60% and more than four times \(A\)'s and \(C\)'s.

WHAT TO SUBMIT: The .sql file you use to achieve the split for (B).