CPS 216: Advanced Database Systems
Assignment 1The first Programming Assignment has three parts A, B, and C. The deadline for this assignment is Sept 17, 5.00 PM.
Part APage 20 (Chapter 2) of the handout on Hadoop gives a MapReduce program that computes the maximum temperature per year for the NCDC dataset. Write a program to compute the average temperature per year for the same dataset. Is there a simple way to add a combiner to your program?
You can use the NCDC data at /cps216/common/NCDC to test your program.
Part BIn this part, you will write a MapReduce program to do word count (i.e., count the number of occurrences of each word) for the datasets stored at /cps216/common/TPC-DS on the Duke Hadoop cluster. Note that the words in these datasets are separated by the delimitter string "|" (pipe character). For example, the first line of /cps216/common/TPC-DS/customer.dat is:
The first word on this line is "1". The second word is "AAAAAAAABAAAAAAA". The third word is "980124", and so on.
Part CThis part involves some fun MapReduce processing using the White House Visitor Log. You can find the dataset at:
First download this dataset and copy it to
HDFS (e.g., in the /usr/research/home/USERNAME directory,
where USERNAME is replaced with your user name).
Use the copyFromLocal command described at:
The attributes in this dataset are described at:
You are required to write efficient MapReduce programs to find the following information:
(i) The 10 most frequent visitors (NAMELAST, NAMEFIRST, NAMEMID) to the White House.
(ii) The 10 most frequently visited people (visitee_namelast, visitee_namefirst) in the White House.
(iii) The 10 most frequent visitor-visitee combinations.
(iv) Some other interesting statistic that you can think of.
Throughout this programming assignment, do not limit your programs to run with a single reduce task only.