Homework #1: VM Setup, Self-Introduction, and OpenRefine

DUE: Tuesday 1/13 11:59pm

HOW TO SUBMIT: Submit the required files for all problems (see WHAT TO SUBMIT under each problem below) through WebSubmit. On the WebSubmit interface, make sure you select compsci216 and the appropriate homework number. You can submit multiple times, but please resubmit files for all problems each time.

1. VM Setup

WHAT TO SUBMIT: You don't need to submit anything for this part of the homework.

In this course, we will use a standard virtual machine (VM) running the Ubuntu operating system. In this part of the homework, you will set up this VM to run from your laptop. Follow the instructions on the Help section of the course website, specifically:

It would also be useful go through the following:

We expect you to bring your laptop (with battery fully charged!) to the Wednesday lab, with the VM properly set up. If you run into any problem, please ask on Piazza or visit office hours.

2. Self-Introduction

WHAT TO SUBMIT: Submit a plain-text file named intro.txt.

Write a few short paragraphs about yourself that tell us a bit about your background and goals for this course. Specifically, please address the following:

As explained on the course website, we are not expecting you to be an expert in all of the disciplines related to data, and you are not going to be "graded" by your background. Feel free to answer "no" to some of the questions above. We are doing this survey to help us tailor the course towards your background and interest, and to help you form project teams and ideas.

We recommend that you try editing this file inside the VM shell (e.g., using nano). Don't forget to read the tips on file access in Accessing VM.

3. OpenRefine Tutorial

WHAT TO SUBMIT: You don't need to submit anything for this part of the homework.

OpenRefine is a powerful data wrangling tool. Data often comes along messy. Either people make mistakes entering and collecting data, or the data you got is in the wrong format for what you want to do with it. OpenRefine was built to deal with these kinds of problems and to bring data into the shape you need.

(This tutorial originally comes from http://unurl.org/cbclean. It's been cleaned up and modified a bit.)

Installing and Starting OpenRefine

Log into your VM from the GUI desktop. Open a browser in your VM, and download the "Linux kit" of OpenRefine from http://openrefine.org.

Get a shell on the VM desktop, and issue the following commands to install OpenRefine:

    cd ~/shared/ 
    tar xvfz ~/Downloads/google-refine-2.5-r2407.tar.gz

To start OpenRefine, type the following commands in a VM shell on the VM desktop:

    cd ~/shared/google-refine-2.5/
    ./refine

A browser tab will open up and give you the GUI for OpenRefine.

Downloading Data

Using your browser on the VM, download the CSV from http://data.okfn.org/community/mihi-tr/bosnia-tenders. (Note that by default Firefox saves it to your ~/Downloads/ folder.)

Once you’ve downloaded what we’ll need: Let’s start.

Let’s first look at the dataset on the website---fortunately the website gives us a nice preview of the data. The data is a recording of the tenders (proposals/bids) awarded in Bosnia and Herzegovina, scraped for the period of 01/04/2013 - 10/31/2013. There are several things wrong with this dataset. The amounts always have the currency symbols with them and are formatted so humans can read them---computers struggle with it. The dates contain additional information and are not really the dates. Company names may be inconsistent throughout the document---even for the same company.

Importing

Cleaning Up the Data

4. Using OpenRefine to Clean up a Congressional Member Listing

NOTE: We encourage you to attempt this part but it is optional. You don't need it to get a "V" (90%) on this homework but you will need it to get an "E" (100%). While we will go through much of this excerise in the lab on Wednesday, doing this exercise by yourself will give you an advantage when tackling the "challenge" in the lab.

WHAT TO SUBMIT: Submit a plain-text file named congress.txt. Type (or copy and paste from OpenRefine) your answer to each question below into that file. Please clearly number and delineate the answers to different questions.

govtrack.us contains a wealth of data on the U.S. Congress. You can get a JSON feed of congressional members from https://www.govtrack.us/api/v2/person. Import the feed into OpenRefine, and use the data therein (not from any other sources!) to answer the following questions:

  1. For each possible party affiliation (R for Republicans, D for Democrats, and I for independents), list the number of members with that affiliation.
  2. How many members are born after 1950?
  3. Check your answers above against sources like Wikipedia. Are your answers correct with respect to reality? What does that say about the govtrack.us data source you used?