Homework #5: Text Analysis

DUE: Sunday 2/15 11:59pm

HOW TO SUBMIT: Submit the required files for all problems (see WHAT TO SUBMIT under each problem below) through WebSubmit. On the WebSubmit interface, make sure you select compsci216 and the appropriate homework number. You can submit multiple times, but please resubmit files for all problems each time.

0. Getting Started

WHAT TO SUBMIT: Nothing is required for this part.

To get ready for this assignment, get a VM shell, and type the following command:

/opt/datacourse/sync.sh

Next, type the following commands to create a working directory for this homework. Here we use hw05 under your shared directory, but feel free to change it to another location.

cp -pr /opt/datacourse/assignments/lab03/products/ ~/shared/hw05/
cd ~/shared/hw05/

We will be working with the same products dataset introducted in Lab #3, focusing on the Amazon products this time. You can use your choice of programming language for this homework (e.g., SQL or python). If you choose to use SQL, you can access the database using:

psql products

which should have been set up in Lab #3. If not, you can also run ./setup.sh to recreate the database.

If you plan to use other programming languages like Python for this homework, you can parse the file amazon.ascii.csv.

1. Finding Salient Words

First, we are going find salient words (appearing in the title, description or manufacturer) for each product in the amazon dataset. You will need to implement:

(A) a method for tokenizing strings to get words, and

(B) a method for computing TFIDF scores for the words.

You can choose to implement any method for tokenizing strings (split on spaces, punctuation, etc). We hope to use the results of this homework in Lab #5, and a good tokenizer might help you do better in the lab exercise.

HINT: If you are using SQL, you will need to use string functions which can be found here. You can use regexp_split_to_table to split a string based on a regular expression. For example:

SELECT id, regexp_split_to_table(title, '\s+') AS word FROM amazon;

will split the title string of each product in amazon table into substrings by whitespace, and output one row for each substring. (You will need to tweak this query to get better tokenization.)

WHAT TO SUBMIT: The code for tokenizing and computing TFIDF scores for words associated with each product.

2. Finding Similar Products

Now that we have quantified the saliency of words for each product, implement an algorithm that given as input an Amazon product id, outputs the top 5 Amazon products in the database that are most similar to the input product.

For example, given product id 1931102953 as input, you might see (besides 1931102953 itself) b000bgpqos and b000ivhozk in the top 5 list.

HINT: Use cosine similarity discussed in class.

WHAT TO SUBMIT: The code for computing the top 5 product ids that are most similar to an input product id.