Lab #1: Data Wrangling and Exploration with OpenRefine

DUE: Tuesday 1/14 11:59pm

HOW TO SUBMIT: All files should be submitted through WebSubmit. Only one of your team members needs to submit on behalf of the team. On the WebSubmit interface, make sure you select compsci290 and the appropriate lab number. You can submit multiple times, but please have the same team member resubmit all required files each time. To earn class participation credit, submit a text file team.txt listing members of your team who are present at the lab. To earn extra credit for the lab challenge, submit the required files for challenge problems (see WHAT TO SUBMIT below).

1. Using OpenRefine to Clean up and Explore a Congressional Member Listing

In this part of the lab, we will continue with OpenRefine to do some wrangling and exploration of a congressional member listing from govtrack.us. We will start by reviewing Homework #1, Part 3, and then briefly explore how facets can be used for some simple analysis.

After identifying the problem of incomplete data with https://www.govtrack.us/api/v2/person, let's get the real list of current legistrators. Start a new OpenRefine project. Use the API call https://www.govtrack.us/api/v2/role?current=true&limit=599 (by the way, you can view the result of this API call in a browser). This call will return a list of "role" (i.e., person-term) objects. For now, we are just interested in the person subobject; tell OpenRefine to extract the contents of these objects.

Then, as we have done before, based on column "person - name", add a new column "party" using the following expression:

value.match(/.*\[([A-Za-z]).*/)[0]

Now, we are ready for some action. Our first task is study the gender balance (or rather, lack thereof) in the Congress across parties. Conceptually, we want to have the count of legislators by gender and party, like the following table:

Male Female
Democrats ? ?
Independents ? ?
Republicans ? ?

To this end, we will create two facets using "party" and "person - gender" (using the drop-down menu on these columns). Since we want to group rows by the text values of these two columns, we will select "Text facet" for both. OpenRefine will automatically display the total number of rows associated with each facet value in these boxes on the left:

To get the total number of female Democrats, highlight value "female" in the "person - gender" facet box and choose "include"; also highlight value "D" in the "party" facet box and choose "include". Now, you should see, on top of the data area, that there are 79 female Democrats in the Congress (out of a total of 542):

You can also combine multiple columns together into one facet; this method will allow you to compute the counts by party and gender directly. Delete the "person - gender" facet, and click "change" on the "party" facet to edit the facet expression to the following:

cells["person - gender"].value + ',' + value

Here, the second value still refers to party, but cells["person - gender"].value lets you use the value of the "person - gender" column in computing the facet value used for tallying. Once you make this change, you will see the more detailed results in the facet box:

You don't have to stop here. Now, transform the column "person - birthday" to date, and then add it as a timeline facet (in addition to the gender-party facet you created above). Then, when you click through different gender-party values, you see how the distribution of birthdays of the selected subgroup of legislators compares with the overall distribution:

ON YOU OWN: Now that you learned the trick of multi-column facets, try the following problem: count the number of Democrats and Republicans with and without Twitter accounts, and look at how they vary by their birthdays. Are Democratic legislators more likely to have Twitter accounts? Are older legisaltors less likely to have Twitter accounts?

2. [Challenge] Finding The Longest-Serving Current Members of the Congress

As we saw in Homework #1, Part 3, data returned by https://www.govtrack.us/api/v2/person is incomplete. It only samples a subset of the members, which are not necessarily the current ones. Also, it just lists the last term served by an individual; it doesn't list all the terms served, which may not be continuous, and may alternate between the House and the Senate.

Your ultimate goal for this challenge is to find, correctly, using the data available from govtrack.us, the 10 longest-serving members of the Congress among those who are currently serving.

(A) Your first task will be to determine which of the govtrack dataset and/or API call get you the data you need. To start, here are three possibilities:

Study the data provided by these possibilities carefully. For each possibility, does it provide enough data to answer our question? If not, what information is missing?

(B) Can you use OpenRefine alone to answer this question? If you were not able to obtain a complete solution, tell us how far you've gotten and why it is difficult. (The course staff were only able to find a very inefficient pure-OpenRefine solution.)

(C) Forget about OpenRefine. Use whatever programming tools you are comfortable with to come up with the answer. (HINT: Python has a reasonably good and straightforward YAML parser called PyYAML that you can use.)

WHAT TO SUBMIT: For (A), submit a text file named A-README.txt with your answer to the question. For (B), submit B-README.txt with your answer to the question; also, at the end of this file, append your OpenRefine JSON operation history (which can be obtained by selecting "Extract..." under the "Undo/Redo" tab and copying and pasting the JSON output). For (C), submit your code and a file C-README.txt explaining how to run your code and listing the 10 longest-serving current members of the Congress.

REWARD: If you answer at least two of (A), (B), and (C) correctly, you earn extra credit worth 10% of one homework; if you answer all three correctly, you earn 20%.