Homework #2: Relational Data Processing

DUE: Sunday 1/25 11:59pm

HOW TO SUBMIT: Submit the required files for all problems (see WHAT TO SUBMIT under each problem below) through WebSubmit. On the WebSubmit interface, make sure you select compsci216 and the appropriate homework number. You can submit multiple times, but please resubmit files for all problems each time.

0. Getting Started

WHAT TO SUBMIT: Nothing is required for this part.

To get ready for this assignment, fire up your VM, get a VM shell, and type the following command:

/opt/datacourse/sync.sh

Next, we are going to get, from govtrack, YAML and JSON data files about all past and current members of the Congress as well as bills introduced and votes cast by the 113th Congress in 2013 and 2014. We are going download the files, parse them, transform them into the relational format, and load them into a database called congress:

/opt/datacourse/data/congress/raw/download.sh
/opt/datacourse/data/congress/db/setup.sh

The whole process will take quite a while. Once the above commands finish running, you should be able to access the database by issuing the following command:

psql congress

Before continuing with the rest of the homework, we recommend that you go through the simple tutorial on the course website to familiarize yourself with psql command-line interface.

Finally, type the following commands to create a working directory for this homework. Here we use hw02 under your shared directory, but feel free to change it to another location.

cp -pr /opt/datacourse/assignments/hw02/ ~/shared/hw02/
cd ~/shared/hw02/

1. Querying the Congress Database

WHAT TO SUBMIT: For each of the questions (A, B, ...) below, submit a plain-text .sql file (1A.sql, 1B.sql, ...). Note that you can enter comments in SQL using -- (anything following -- on the same line will be ignored as comments). If a query does not run or returns an incorrect answer, please document it in the comments.

To get a sense of what's in the congress database, study the file /opt/datacourse/data/congress/db/create.sql (which specifies the database schema) as well as the .dat files under the load/ subdirectory (which contains the structured data files parsed from raw data to be loaded into the database---there is one for each table). These files can be opened and viewed as plain-text files (see Shell Basics for help on working with plain-text files).

Also feel free to explore the data in congress by issuing SQL queries inside psql. Write SQL queries to answer the following questions. (We recommend a two-window setup described in the PostgreSQL Tips.) Assuming that you have followed Part 0 correctly, you will find some example .sql queries in your working directory.

(A) List all persons (past and present members of the Congress) who are female and were born in or after 1980.

(B) List all North Carolina Democratic senators (past and present), together with their terms as a North Carolina senator. Output only first name, last name, birthday, as well as the start and end dates of the term. Order the results by start dates, most recent first. Here are some example output rows (the answer has more):

 first_name | last_name |  birthday  | start_date |  end_date  
------------+-----------+------------+------------+------------
 Kay        | Hagan     | 1953-05-26 | 2009-01-06 | 2015-01-03
 John       | Edwards   | 1953-06-10 | 1999-01-06 | 2004-12-09
 James      | Sanford   | 1917-08-20 | 1986-01-01 | 1992-10-09

(C) Let's see how the House Representatives voted in the 2013 session compared with 2014. Specifically, for each session (year) and for each possible vote record value (Aye, Nay, etc.), list the total number of such records. Here are some example output rows (the answer has more):

 session |     vote     | count 
---------+--------------+-------
    2013 | Colin Powell |     1
    2013 | Nay          | 34680
    2014 | Nay          | 28605

Investigate why there were vote record values that looked like names in 2013.

(D) One of the most important votes cast by the House in 2014 was the passage of a bill known as the Consolidated Appropriations Act of 2014, where the House approved (actually, concurred in the Senate amendments with an amendment) a $1.1 trillion budget to fund government activities through the end of the fiscal year. Find out which Democrats voted against it. Print out their first name, last name, state, and district.

You should be able to find this vote by looking for a votes row with category being 'passage' and question containing strings 'Concurring' and 'H R 3547'.

NOTE: In SQL, you can do simple string matching using the LIKE operator. For example, question LIKE '%H R 3547%' checks if question contains the string 'H R 3547' (% matches any sequence of zero or more characters). Alternatively, in PostgreSQL, you can do regular expression matching using the ~ operator. For example, question ~ '.*H R 3547.*' is effectively the same as question LIKE '%H R 3547%'.

HINT: There should be three Democrats in your answer.

(E) Write a SQL query using WITH to accomplish the following.

In the WITH clause, define a temporary view called person_party_votes(vote_id, person_id, party, vote), which basically augments the information in person_votes with the voter's party at the time of the vote. To ensure correct answer, you need to handle the case where the person is no longer a current member (e.g., John Kerry), as well as the case where the person's party affiliation has changed.

In the query that follows, compute the breakdown of votes by party for each vote_id. Here are some example output rows (the answer certainly has more rows):

       vote_id        |    party    |     vote     | count 
----------------------+-------------+--------------+-------
 s100-113.2013        | Democrat    | Nay          |    40
 s100-113.2013        | Democrat    | Yea          |    13
 s100-113.2013        | Independent | Nay          |     2
 s100-113.2013        | Republican  | Nay          |     1
 s100-113.2013        | Republican  | Yea          |    44

2. Detective Work on Vote Correlation

In Lecture #2, we showed how queries computing vote correlations for David Price (D-NC) with Nancy Pelosi (D-CA) revealed unexpected results. Can you explain why it seems that Price votes so infrequently with Pelosi (when compared with another Represetative from NC, Butterfield)?

For your convenience, the file price-pelosi.sql in your working directory for this homework contains the SQL query used in the class. You may edit this file to run variations of this query to help you debug.

WHAT TO SUBMIT: A plain-text file named price.txt. Give your explanation in this file, together with any SQL queries and results that support your explanation.

3. Making Claims and Checking Facts

Your job for this part is to find an interesting claim base on our congress database that may be "checked" by others in the lab on Monday. The following properties are desirable for a claim:

  1. The correctness of the claim can be verified by SQL queries over the congress database alone, without drawing from external data sources.

  2. The claim sounds interesting enough (that somebody might feel like tweeting it). (Okay, this is purely subjective, but try your best.)

  3. The claim may be technically correct, but nonetheless misleading. Others will need to come up with non-trivial couterarguments to refute what the claim implies.

The claim about Price vs. Pelosi in Part 2 of this homework meets (a) and (b) above, but doesn't meet (c) because it is technically incorrect.

One example of a claim meeting (a), (b), and to some extent (c), would be

The average age of U.S. Representatives in Alaska is a whopping 82!

True, but somewhat misleading---Alaska has only one Representative in the House.

Real-world examples of claims meeting (a), (b), and (c) can be found in the back-and-forth attacks in election campaigns. A TV ad in the 2010 elections claimed that Jim Marshall, a Democratic incumbent from Georgia "voted the same as Republican leaders 65 percent of the time." This comparison was made with Republican Leader John Boehner over the votes in 2010. If we look at the history since 2007, however, the number would have been only 56 percent, which is not very high considering the fact that even the Democratic Whip, Jim Clyburn, voted 44 percent of the time with Boehner during that period. In fact, this ad was in response to an earlier ad attacking Marshall for voting with Nancy Pelosi "almost 90 percent of the time," which, not surprisingly, also tailored the claim in ways to further its own argument. You can read more about such claims here at FactCheck.org.

Your job is to make an interesting claim from our congress database, which contains all historical members of the Congress but voting records for only 2013-2014 (the 113th Congress). If your claim is misleading at the same time, that's even better---in that case, please also provide a counterargument to your claim (similar to the counterarguments made in the two examples above).

WHAT TO SUBMIT: A plain-text file named claim.txt. Structure it in three sections: 1) The claim in English (and, if it is intended to mislead, what you are trying to imply). 2) One more more SQL queries over the congress database to establish the correctness of your claim. 3) A counterargument to your claim, together with supporting SQL queries, if your claim is intended to mislead.