Lab 4 - Record Linkage

DUE: Tuesday 2/4 11:59pm

HOW/WHAT TO SUBMIT: All files should be submitted through WebSubmit. Only one of your team members needs to submit on behalf of the team. On the WebSubmit interface, make sure you select compsci290 and the appropriate lab number. You can submit multiple times, but please have the same team member resubmit all required files each time. To earn class participation credit, submit a text file team.txt listing members of your team who are present at the lab.

CHALLENGES: You can earn extra credit in this lab, but it must be finished by the end of class. There is no submission required from this part, but the results must be checked off by an instructor to receive credit.

Points

Extra credit can be earned in this lab by having a very high f1-score as below:

5% for F1 score > 0.8

10% for F1 score > 0.82

20% for F1 score > 0.85

0. Setup

WHAT TO SUBMIT: Nothing is required for this part.

To get ready for this assignment, fire up your virtual machine, open up a terminal, and type the following command:

/opt/datacourse/sync.sh

This command will download the files which we are going to use for this homework.

Next, type the following commands to create a working directory (hw04 under your home directory) for this homework:

cp -pr /opt/datacourse/assignments/lab04-template/ ~/lab04/
cd ~/lab04/
./setup.sh

You might need to install the extensions on the new database:

psql restaurants -c "CREATE EXTENSION fuzzystrmatch"
psql restaurants -c "CREATE EXTENSION pg_trgm"

Now we have a new database containing a listing of restaurants:

psql restaurants

Linking Duplicate Restaurants

The restaurants table has been merged improperly! It contains data from two different restaurant listing services.

restaurants=# \d+ restaurants
                      Table "public.restaurants"
 Column |          Type          | Modifiers | Storage  | Description 
--------+------------------------+-----------+----------+-------------
 id     | integer                | not null  | plain    | 
 name   | character varying(200) | not null  | extended | 
 addr   | character varying(200) | not null  | extended | 
 city   | character varying(200) | not null  | extended | 
 type   | character varying(100) |           | extended | 

Your goal in the lab is to populate another table called matches with the id's of two matching restaurant entries. Here is an example to find matching restaurants and populate the matches table. (You will find this example in test.sql in your working directory lab04)

-- delete existing matches 
DELETE FROM matches;

-- insert new matches
INSERT INTO matches(left_id, right_id)
(SELECT A.id AS left_id, B.id AS right_id
FROM restaurants A, restaurants B WHERE
-- exact match on name and city
A.name = B.name and A.city = B.city
AND A.id != B.id AND A.id <= B.id
);

-- compute F1 score
SELECT cs_f1score();

Scoring

The setup script created three functions to gauge the performance of your solution:

  1. cs_precision
  2. cs_recall
  3. cs_f1score

After you have populated the match tables, you can call these functions to compute the quality of your matching as follows:

select cs_f1score();
    cs_f1score     
-------------------
 0.545454545454545
(1 row)

You can also list the set of true positives, false positives, and false negatives by using the following queries:

True Positives

SELECT * FROM restaurants r1, restaurants r2,
(SELECT matches.left_id, matches.right_id FROM matches INNER JOIN (SELECT * FROM official_matches) A
ON (matches.left_id = A.left_id and matches.right_id = A.right_id)) m
WHERE r1.id = m.left_id and r2.id = m.right_id;

(You can find this in truepositives.sql)

False Negatives

SELECT * FROM official_matches o, restaurants r1, restaurants r2
WHERE
o.left_id = r1.id and o.right_id = r2.id
AND NOT EXISTS
(SELECT * FROM matches
WHERE o.left_id = matches.left_id
AND o.right_id = matches.right_id)
order by r1.id;

(You can find this in falsenegatives.sql)

False Positives

SELECT * FROM matches m, restaurants r1, restaurants r2
WHERE
m.left_id = r1.id and m.right_id = r2.id
AND NOT EXISTS
(SELECT * FROM official_matches
WHERE m.left_id = official_matches.left_id
AND m.right_id = official_matches.right_id);                                         

(You can find this in falsepositives.sql)

Hint for the Challenge

To get high F1-scores, you might need to use string operations that the postgres database allows. You can find them here.