Homework #4: Record Linkage and Fuzzy Matching

DUE: Monday 2/3 11:59pm

HOW TO SUBMIT: Submit the required files for all problems (see WHAT TO SUBMIT under each problem below) through WebSubmit. On the WebSubmit interface, make sure you select compsci290 and the appropriate homework number. You can submit multiple times, but please resubmit files for all problems each time.

0. Getting Started

WHAT TO SUBMIT: Nothing is required for this part.

To get ready for this assignment, fire up your virtual machine, open up a terminal, and type the following command:

/opt/datacourse/sync.sh

This command will download the files which we are going to use for this homework.

Next, type the following commands to create a working directory (hw04 under your home directory) for this homework:

cp -pr /opt/datacourse/assignments/hw04-template/ ~/hw04/
cd ~/hw04/
psql congress -c "CREATE EXTENSION fuzzystrmatch"
psql congress -c "CREATE EXTENSION pg_trgm"

The last 2 command will install extensions to Postgres which will allow you to use Levenshtein distance (levenshtein(string1, string2)), and Jaccard similarity (similarity(string1, string2)) that we discussed in class.

Once the above commands finish running, you should be able to access the database by issuing the following command:

psql congress

1. Senator Twitter IDs

WHAT TO SUBMIT: For each of the questions (A, B) below, submit a plain-text .sql file (1A.sql, 1B.sql). Note that you can enter comments in SQL using -- (anything following -- on the same line will be ignored as comments). For each query, in the comments report:

  1. How many records were returned by your query?
  2. How many records do you think you matched correctly?
  3. If you did not match all rows correctly, explain why.

In the working directory you will find a file senator_tweets_parsed.csv. This file contains the list of current senators and three different twitter ids (Official, staff and campaign) extracted from the Web (and current as of Nov 6, 2013). Our goal in this homework is to match the records in this file to the persons table in the congress database.

You will also find a file names nicknames.csv which contains a list of firstnames and the common nicknames associated with them (which you will need). You will also find a script file create-load-nicknames.sql which will allow you to create a table called nicknames and copy the data from the csv file into this table. You can create this by running

\i create-load-nicknames.sql

inside psql.

(A) Create a table called tweets and populate it with the data from senator_tweets_parsed.csv. (Hint: Write a SQL file similar to create-load-nicknames.sql.) List all the records from the tweets table that match a record in the persons table on firstname AND lastname.

NOTE: If you made a mistake in your script or copied wrong data into a table, you can always use the SQL command

DROP TABLE tablename;

inside psql to drop the table you created. Also, do not execute the \COPY command more than once; it will just add a new set of duplicate rows from the csv file into the existing table.

(B) Create the nicknames table (if you have not already done it). List all the records that result from the following matching: edit distance of at most 2 on last name, AND (first names are same OR first name is a nickname variant).

2. What is the best you can do?

WHAT TO SUBMIT: Submit a plain-text 2.sql file. In the comments report:

  1. How many records were returned by your query?
  2. How many records do you think you matched correctly?
  3. If you did not match all rows correctly, explain why.

Using all the information available to you about this tweets dataset, find the best matching condition that returns as many of the correct matches as possible, and returns very few erroneous matches.