Homework #10: MapReduce

DUE: Tuesday 3/31 11:59pm

HOW TO SUBMIT: Submit the required files for all problems (see WHAT TO SUBMIT under each problem below) through WebSubmit. On the WebSubmit interface, make sure you select compsci216 and the appropriate homework number. You can submit multiple times, but please resubmit files for all problems each time.

0. Getting Started

WHAT TO SUBMIT: Nothing is required for this part.

Start up your VM and issue the following commands, which will install mrjob python package and create a working directory for your assignment:

/opt/datacourse/sync.sh
cp -pr /opt/datacourse/assignments/hw10/ ~/hw10/

Execute a MapReduce job locally

To get you started, We have provided a sample file tweet_sm.txt containing a fairly small number of tweets and a simple program hashtag_count.py that counts the total number of tweets and the total number of uses of hashtags.

You can execute the MapReduce program locally on your VM, without using a cluster, as follows:

python hashtag_count.py tweets_sm.txt

As a rule of thumb, you should always test and debug your MapReduce program locally on smaller datasets, before you attempt it on a big cluster on Amazon---it will cost you money!

Signing up for Amazon AWS and setting up mrjob/EMR

  • If you haven't done so, follow the instructions here to sign up for Amazon AWS. You do not need to create new VM instances on Amazon at this time. You will simply use your own course VM to program; later on, when you run your code, you can tell it to automatically make use of additional computing resources, remotely, on Amazon AWS (if your course VM is hosted by Amazon, it will be separate from the additional computing resources).

  • Next, you need to make sure that your VM can access the Amazon AWS key pair, which you obtained by following the instructions above. Make a copy of the file datacourse.pem in the assignment directory inside the VM. Run the following command in the assignment directory inside VM:

    chmod go= datacourse.pem
  • Now, edit the file mrjob.conf in the assignemnt working directory on your VM. Replace the first two fields with your own access key id and secret access key. Make sure there is a space between each colon and its ensuing value in mrjob.conf; it's touchy about that!

Executing your MapReduce program on Amazon EMR

Now, you are ready to run a program on Amazon EMR (Elastic MapReduce using Hadoop)! Just type the following command:

python hashtag_count.py -c mrjob.conf -r emr tweets_sm.txt

Tips:

  • If you get an invalid SSH key error, it might be a region match error. Key pairs are bound to specific regions. You can check the region in the upper right of the AWS Management Console, and make sure it matches what's specified in mrjob.conf.

  • You will see lots of diagonistic info flying by. The program will actually take longer to run on Amazon than on your local machine! That's expected, because of the various overhead involved. MapReduce on a cluster is really meant for problems much, much larger than this one.

  • In mrjob.conf, you can optionally tweak:

    • The cluster size (num_ec2_instances, i.e., the number of Amazon machines) on which to run your program. For massive data you will need a larger cluster to finish in a reasonable amount of time. But don't go overboard because more machines imply more money will be charged to your account.
    • The cluster type (ec2_instance_type:), which is currently set to m1.small. This basic machine should suffice for us. Fancier machine types cost more.
    • The maximum number of concurrent map and reduce tasks per machine (under bootstrap_actions). Generally speaking, machines with more cores can run more concurrent tasks. You can leave this out and just trust EMR's default setting.
    • The total number of map and reduce tasks per job (mapreduce.mapred.map.tasks and mapred.reduce.tasks under jobconf). You can usually leave them unspecified; EMR/Hadoop will pick reasonable defaults based on what it thinks of as a reasonable unit amount of work per task. Even if you specify them, Hadoop may choose to ignore them in some cases, so think of them as only optimization "hints".

1. Finding the top 50 Twitter hashtags

In this exercise, you will write a MapReduce program to find the 50 most popular hashtags from a file containing approximately 3.5 million tweets (a couple hours in the twitterverse).

We strongly recommend that you debug first locally before running on the full file in Amazon. You can make a copy of the example code:

cp hashtag_count.py topk.py

and edit topk.py to suit your needs. When running on Amazon, you can use the following files:

  • Small test file: s3://cs216-spring2015/twitter/tweets_sm.txt

  • Big file: s3://cs216-spring2015/twitter/tweets-full/*

You can give these file URLs to your python program directly, e.g.:

python topk.py -c mrjob.conf -r emr s3://cs216-spring2015/twitter/tweets-full/*

Hint 1: Your approach should work on much bigger datasets than what we are using here. Keep in mind the tips from class about potential problems when computing on data in parallel.

Hint 2: You can create standard python data structures within MRJob functions or class instances. Keep in mind that these will be stored in-memory for each mapper or reducer, so if used, any such structure should be kept small.

WHAT TO SUBMIT: Submit a plain-text file named topk.txt with your results. You need to submit the results on the big file (the small file is there in case you need to debug your program). Submit also your code (topk.py) including comments explaining the code.