Duke DBGroup Logo

CPS 216: Advanced Database Systems
(Data-Intensive Computing Systems, Fall 2010)

Course information
Course schedule and notes
Assignments
Readings
Project
Extra Materials

Assignment 1

The first Programming Assignment has three parts A, B, and C. The deadline for this assignment is Sept 17, 5.00 PM.

Part A

Page 20 (Chapter 2) of the handout on Hadoop gives a MapReduce program that computes the maximum temperature per year for the NCDC dataset. Write a program to compute the average temperature per year for the same dataset. Is there a simple way to add a combiner to your program?

You can use the NCDC data at /cps216/common/NCDC to test your program.

Part B

In this part, you will write a MapReduce program to do word count (i.e., count the number of occurrences of each word) for the datasets stored at /cps216/common/TPC-DS on the Duke Hadoop cluster. Note that the words in these datasets are separated by the delimitter string "|" (pipe character). For example, the first line of /cps216/common/TPC-DS/customer.dat is:

1|AAAAAAAABAAAAAAA|980124|7135|32946|2452238|2452208|Mr.|Javier|Lewis|Y|9|12|1936|CHILE||Javier.Lewis@VFAxlnZEvOx.org|2452508|

The first word on this line is "1". The second word is "AAAAAAAABAAAAAAA". The third word is "980124", and so on.

Part C

This part involves some fun MapReduce processing using the White House Visitor Log. You can find the dataset at:
http://www.whitehouse.gov/files/disclosures/visitors/WhiteHouse-WAVES-Released-0827.csv

First download this dataset and copy it to HDFS (e.g., in the /usr/research/home/USERNAME directory, where USERNAME is replaced with your user name). Use the copyFromLocal command described at:
http://hadoop.apache.org/common/docs/r0.20.2/hdfs_shell.html

The attributes in this dataset are described at:
http://www.whitehouse.gov/files/disclosures/visitors/WhiteHouse-WAVES-Key-1209.txt
Also, you can see a spreadsheet of the data at:
http://www.whitehouse.gov/briefing-room/disclosures/visitor-records

You are required to write efficient MapReduce programs to find the following information:

(i) The 10 most frequent visitors (NAMELAST, NAMEFIRST, NAMEMID) to the White House.

(ii) The 10 most frequently visited people (visitee_namelast, visitee_namefirst) in the White House.

(iii) The 10 most frequent visitor-visitee combinations.

(iv) Some other interesting statistic that you can think of.

Throughout this programming assignment, do not limit your programs to run with a single reduce task only.