Lab #1: Data Wrangling and Exploration

DUE: Wednesday 1/14 11:59pm

HOW TO SUBMIT: All files should be submitted through WebSubmit. Only one of your team members needs to submit on behalf of the team. On the WebSubmit interface, make sure you select the correct course and lab number. You can submit multiple times, but please have the same team member resubmit all required files each time. To earn class participation credit, submit a text file team.txt listing members of your team who are present at the lab. To earn rewards for the lab challenge, raise your hand to get your answer checked by the course staff, and submit the required files for challenge problems (see WHAT TO SUBMIT below).

0. Getting Ready

(Before you start, make sure you have a correct VM setup. When in doubt, ask course staff.)

Get a shell in your VM, and issue the following commands:

/opt/datacourse/sync.sh
cp -r /opt/datacourse/assignments/lab01/ ~/shared/lab01/

Here, ~/shared/lab01/ is your working directory for this assignment.

1. Using OpenRefine to Clean up and Explore a Congressional Member Listing

In this part of the lab, we will continue with OpenRefine to do some wrangling and exploration of a congressional member listing from govtrack.us. We will start by reviewing Homework #1, Part 4, and then briefly explore how facets can be used for some simple analysis.

After identifying the problem of incomplete data with https://www.govtrack.us/api/v2/person, let's get the real list of current legistrators. Start a new OpenRefine project. Use the API call https://www.govtrack.us/api/v2/role?current=true&limit=599 (by the way, you can view the result of this API call in a browser). This call will return a list of "role" (i.e., person-term) objects. For now, we are just interested in the person subobject; tell OpenRefine to extract the contents of these objects.

Then, as we have done before, based on column "person - name", add a new column "party" using the following expression:

value.match(/.*\[([A-Za-z]).*/)[0]

Now, we are ready for some action. Our first task is study the gender balance (or rather, lack thereof) in the Congress across parties. Conceptually, we want to have the count of legislators by gender and party, like the following table:

Male Female
Democrats ? ?
Independents ? ?
Republicans ? ?

To this end, we will create two facets using "party" and "person - gender" (using the drop-down menu on these columns). Since we want to group rows by the text values of these two columns, we will select "Text facet" for both. OpenRefine will automatically display the total number of rows associated with each facet value in these boxes on the left:

To get the total number of female Democrats, highlight value "female" in the "person - gender" facet box and choose "include"; also highlight value "D" in the "party" facet box and choose "include". Now, you should see, on top of the data area, that there are 79 female Democrats in the Congress (out of a total of 542):

You can also combine multiple columns together into one facet; this method will allow you to compute the counts by party and gender directly. Delete the "person - gender" facet, and click "change" on the "party" facet to edit the facet expression to the following:

cells["person - gender"].value + ',' + value

Here, the second value still refers to party, but cells["person - gender"].value lets you use the value of the "person - gender" column in computing the facet value used for tallying. Once you make this change, you will see the more detailed results in the facet box:

You don't have to stop here. Now, transform the column "person - birthday" to date, and then add it as a timeline facet (in addition to the gender-party facet you created above). Then, when you click through different gender-party values, you see how the distribution of birthdays of the selected subgroup of legislators compares with the overall distribution:

ON YOUR OWN (with your team): Use OpenRefine to count the number of legistators by state. Be sure to check out those with no states. Who are they? Once you think you have the right answer, raise your hand and get it checked by the course staff.

That's it. OpenRefine is a pretty powerful tool with lots of additional features that you might find useful later on. Be sure to check out the pointer to documentation from our Help page, where we've highlighted particularly relevant portions of the documentation.

Next, let's look at another task---which gets closer to query and analysis---for which you might find OpenRefine somewhat lacking.

2. [Challenge] Finding The Longest-Serving Current Members of the Congress

As we saw in Homework #1, Part 4, data returned by https://www.govtrack.us/api/v2/person is incomplete. It only samples a subset of the members, which are not necessarily the current ones. Also, it just lists the last term served by an individual; it doesn't list all the terms served, which may not be continuous, and may alternate between the House and the Senate.

Your ultimate goal for this challenge is to find using the data available from govtrack.us, the 10 longest-serving members of the Congress among those who are currently serving.

(A) Your first task will be to determine which of the govtrack dataset and/or API call get you the data you need. Here are three possible starting points (feel to explore the website for more options):

Study the data provided by these possibilities carefully. For each possibility, does it provide enough data to answer our question? If not, what information is missing?

(B) Can you use OpenRefine alone to answer this question? If you were not able to obtain a complete solution, tell us how far you've gotten and why it is difficult. (The course staff were only able to find a very inefficient pure-OpenRefine solution.)

(C) Forget about OpenRefine. Let's use Python to come up with the answer. Python has good parsers for JSON (standar json module), CSV (standard csv module), and YAML (PyYAML) that you can use. Under your working directory for this assignment (~/shared/lab01/), you will find example Python code that uses these parsers to parse raw data. For example, once you have downloaded the JSON file legislators-current.json in this directory, you can run the JSON parser to parse the data:

cd ~/shared/lab01/
python parse-json.py legislators-current.json

If you want to save the output to a file so you can look at it later in an editor/viewer, you can "redirect" the output to a file:

python parse-json.py legislators-current.json > json-out.txt
nano json-out.txt

Or you can "pipe" the output to the less pager for viewing without creating a file:

python parse-json.py legislators-current.json | less

For more information about using shell commands, please refer to Shell Basics.

WHAT TO SUBMIT: For (A), submit a text file named A-README.txt with your answer to the question. For (B), submit B-README.txt with your answer to the question; also, at the end of this file, append your OpenRefine JSON operation history (which can be obtained by selecting "Extract..." under the "Undo/Redo" tab and copying and pasting the JSON output). For (C), submit your code and a file C-README.txt explaining how to run your code and listing the 10 longest-serving current members of the Congress.

REWARD: The first three teams that comes up with the correct answer for (C) will receive a little prize.