CPS 216: Advanced Database Systems
Assignment 2The deadline for this assignment is Thursday, Oct 7, 5.00 PM.
The objective of the second programming assignment is to better understand how Hadoop works. For this assignment, you should know how to start a Hadoop cluster on Amazon EC2. You can use the software that we have provided for this purpose through the class git repository --- see the Extra Materials section on the class page.
You will copy a dataset into HDFS (the Hadoop Distributed FileSystem) in the Hadoop cluster you started on Amazon EC2. Through the git repository, we will be providing you a data generator that can generate data like the /cps216/common/TPC-DS/customer.dat and /cps216/common/TPC-DS/web_sales.dat datasets you used in programming assignment 1. This data generator can be run on the Hadoop cluster you started on Amazon EC2. The generator will also copy data in parallel into HDFS on the cluster. You are welcome to use other datasets you like as long as they satisfy three properties: (a) they go to 10GB or more in size, (b) you can meaningfully run WordCount and Sort on them, and (c) you don't have to copy the dataset back and forth between the Hadoop cluster on EC2 and your local machine at Duke every time (Amazon will charge for network traffic between EC2 machines and your local machine at Duke or elsewhere outside EC2; but note that Amazon will not charge for traffic between EC2 machines and Amazon S3 storage).
You need two MapReduce programs for this assignment. Program 1 is WordCount. You can use the WordCount program that you wrote for assignment 1. Just ensure that the WordCount program will work for the dataset that you will be working with. For example, datasets like /cps216/common/TPC-DS/customer.dat that you worked with in assignment 1 have the "|" symbol as the separator between words. Program 2 has to sort the records in the input dataset on some attribute of your choice. It is recommended that you pick an integer attribute as the attribute on which to sort. Do not pick an attribute on which the data is already sorted! For example, the first record in /cps216/common/TPC-DS/customer.dat is:
Notice that the third attribute (with value 980124 in this record) is an integer attribute on which you could sort this dataset.
The goal of this assignment is as follows: for each of the two programs, and an input dataset of size at least 10GB (GigaBytes), run the program with different number of reducers and three Hadoop cluster sizes: 2 slave nodes, 4 slave nodes, and 8 slave nodes. For each program, you should plot the running times observed for each Hadoop cluster size in a graph similar to the graph on the right side of the 25th slide (titled: Hadoop 75GB TeraSort) in the "How MapReduce Works (in Hadoop)" lecture slides (see: lecture slides). Note that the horizontal axis of this graph shows the number of reducers. The vertical axis shows the running time of the program. One way to get the running time is to use the JobTracker interface that is started by default on the 50030 port on the master node of the Hadoop cluster. Keep in mind that, on reboot, the JobTracker will remove information about the jobs it ran so far.
A natural question you may have is what different numbers of reducers you should consider in each case. We leave that for you to find out since the goal of this programming assignment is to better understand how Hadoop works. What we want to see is the trend as the number of reducers is varied in each case. For example, notice the trend in the right-side graph on the 25th slide in "How MapReduce Works (in Hadoop)". The running time starts off as very high for a small number of reducers, then it drops, then it goes up, and then it keeps coming down. You may not need to go up to 300 reducers or run the programs for each and every single value of the number of reducers. Choose wisely to understand the trend. Each "*" symbol in the right-side graph on the 25th slide shows one run of the program. Only 11 runs were done to understand the trend in this case.
Apart from your code, you should submit six graphs: one for each combination of program (WordCount, Sort) and Hadoop cluster size (2/4/8 slave nodes).
For your convenience, we will be providing you the software for an "experiment harness". This harness will simplify the task of running a MapReduce program on Hadoop repeatedly for different settings. The software will be provided through the class git repository. Usage instructions will be posted and the TA will demonstrate its use in the TA sessions.