DUE: Monday 2/3 11:59pm
HOW TO SUBMIT: Submit the required files for all problems (see WHAT TO SUBMIT under each problem below) through WebSubmit. On the WebSubmit interface, make sure you select compsci290
and the appropriate homework number. You can submit multiple times, but please resubmit files for all problems each time.
WHAT TO SUBMIT: Nothing is required for this part.
To get ready for this assignment, fire up your virtual machine, open up a terminal, and type the following command:
/opt/datacourse/sync.sh
This command will download the files which we are going to use for this homework.
Next, type the following commands to create a working directory (hw04
under your home directory) for this homework:
cp -pr /opt/datacourse/assignments/hw04-template/ ~/hw04/
cd ~/hw04/
psql congress -c "CREATE EXTENSION fuzzystrmatch"
psql congress -c "CREATE EXTENSION pg_trgm"
The last 2 command will install extensions to Postgres which will allow you to use Levenshtein distance (levenshtein(string1, string2)), and Jaccard similarity (similarity(string1, string2)) that we discussed in class.
Once the above commands finish running, you should be able to access the database by issuing the following command:
psql congress
WHAT TO SUBMIT: For each of the questions (A, B) below, submit a plain-text .sql
file (1A.sql
, 1B.sql
). Note that you can enter comments in SQL using --
(anything following --
on the same line will be ignored as comments). For each query, in the comments report:
In the working directory you will find a file senator_tweets_parsed.csv
. This file contains the list of current senators and three different twitter ids (Official, staff and campaign) extracted from the Web (and current as of Nov 6, 2013). Our goal in this homework is to match the records in this file to the persons
table in the congress
database.
You will also find a file names nicknames.csv
which contains a list of firstnames and the common nicknames associated with them (which you will need). You will also find a script file create-load-nicknames.sql
which will allow you to create a table called nicknames
and copy the data from the csv file into this table. You can create this by running
\i create-load-nicknames.sql
inside psql
.
(A) Create a table called tweets
and populate it with the data from senator_tweets_parsed.csv
. (Hint: Write a SQL file similar to create-load-nicknames.sql
.) List all the records from the tweets
table that match a record in the persons
table on firstname
AND lastname
.
NOTE: If you made a mistake in your script or copied wrong data into a table, you can always use the SQL command
DROP TABLE tablename;
inside psql
to drop the table you created. Also, do not execute the \COPY
command more than once; it will just add a new set of duplicate rows from the csv file into the existing table.
(B) Create the nicknames
table (if you have not already done it). List all the records that result from the following matching: edit distance of at most 2 on last name, AND (first names are same OR first name is a nickname variant).
WHAT TO SUBMIT: Submit a plain-text 2.sql
file. In the comments report:
Using all the information available to you about this tweets dataset, find the best matching condition that returns as many of the correct matches as possible, and returns very few erroneous matches.