Lab #5: Text Analytics

DUE: Monday 2/16 11:59pm

HOW/WHAT TO SUBMIT: All files should be submitted through WebSubmit. Only one of your team members needs to submit on behalf of the team. On the WebSubmit interface, make sure you select compsci216 and the appropriate lab number. You can submit multiple times, but please have the same team member resubmit all required files each time. To earn class participation credit, submit a text file team.txt listing members of your team who are present at the lab.

0. Getting Ready

To get ready for this lab, fire up your virtual machine and enter the following command:

/opt/datacourse/sync.sh

This command will download the files which we are going to use for this homework.

Next, type the following commands to create a working directory for this homework. Here we use lab05 under your shared directory, but feel free to change it to another location.

cp -pr /opt/datacourse/assignments/lab05/ ~/shared/lab05/
cd ~/shared/lab05/

1. Deciphering Reviews

There are two files in your directory. train.csv is a CSV file containing a bunch of reviews for two types of products, laptops and mobile phones. The format should be self-explanatory. You can parse this file using Python to analyze the data therein, or you can use a database and psql (see Homework #5 for an exmaple of how to set up such a database).

Your job for this lab to play with train.csv and get a sense of what "features" might be indicative of the type of the product that a review is for. Then, write a program (or SQL queries) to infer the type of the product for each of the review in test.unlabeled.csv.

When your team is ready, raise your hands to get your answer checked. Note that you must use an automated procedure not specifically tailored to the data in test.unlabeled.csv.

The team that achieves the highest accuracy within the shortest amount of time will get a little prize.