Homework #10: Graph and MapReduce¶

DUE: Monday 3/24 11:59pm

HOW TO SUBMIT: Submit the required files for all problems (see WHAT TO SUBMIT under each problem below) through WebSubmit. On the WebSubmit interface, make sure you select compsci290 and the appropriate homework number. You can submit multiple times, but please resubmit files for all problems each time.

0. Getting Started¶

WHAT TO SUBMIT: Nothing is required for this part.

Start up your VM and issue the following commands, which will download the necessary files and create a working directory for your assignment:

/opt/datacourse/sync.sh
cp -pr /opt/datacourse/assignments/hw10-template/ ~/hw10/
cd ~/hw10/

1. PageRank in MapReduce¶

First, study the code in in pagerank.py, which we went over in class. You can run this code, as is, on the Web graph of Stanford Web pages (from http://snap.stanford.edu/data/web-Stanford.html). To test the code locally on a tiny subgraph, run the command:

./pagerank.py tiny-web-Stanford.txt > tiny-web-Stanford-pageranks-local.txt

This command will save the results in tiny-web-Stanford-pageranks-local.txt. Each result line lists the page id, its PageRank, and then the list of page ids that it points to. The full dataset, web-Stanford.txt, has 281903 nodes, and will take much longer to run. See Scalability Experiments towards the end of this part if you wish to try.

Your job here is to retool pagerank.py to work on characters.tsv, the undirected social network of Marvel characters we created in Homework #9, so that we can use PageRank to rank these characters. More specifically:

Make a copy of the code (using the command cp pagerank.py charank.py) in charank.py, which you will further modify for this problem.
Think about how to re-define PageRank for an undirected graph. Your definition should be sensible, and also convenient---so that you can reuse most of the existing code.
You will need to parse the input file differently because characters.tsv's format differs from web-Stanford.txt's. In fact, mrjob has built-in support for characters.tsv's format, which you can activate by setting INPUT_PROTOCOL = ReprProtocol within your MrJob class (see Homework #9 sample solution for an example).
We also want you to sort the characters by their PageRank. The highest ranked characters should be output first.

HINT: Although you are free to change charank.py in any way you wish, only a small amount of changes are required for it to work.

WHAT TO SUBMIT: Your charank.py file, as well as a plain-text file 1.txt. In 1.txt: 1) state your definition of PageRank for an undirected graph; 2) list the top 20 characters together with their PageRank values (you can just copy and paste from your charank.py's output); 3) comment on how this top 20 list compares with the top 20 characters in terms of degrees (which we saw using graph.py in Homework #9).

Scalability Experiments [Optional] We encourage you try pagerank.py on the full Stanford Web graph. First, try it locally (and note how long it takes):

./pagerank.py web-Stanford.txt >| web-Stanford-pageranks-local.txt

Next, try it on Amazon EMR. For your convenience, we have put a copy of the input file on Amazon S3 here:

s3://cs290-spring2014/web-Stanford.txt

To run pagerank.py on Amazon EMR, you will need the two files datacourse.pem and mrjob.conf from Homework #8. Assuming that you have working versions of these files in your Homework #8 directory, you can just copy them to your working directory as follows:

cp -p ~/hw08/datacourse.pem ~/hw08/mrjob.conf ~/hw10/

Feel free to tweak mrjob.conf as you see fit (such as increasing num_ec2_instances). Then, you can just run pagerank.py as follows:

./pagerank.py -c mrjob.conf -r emr s3://cs290-spring2014/web-Stanford.txt >| web-Stanford-pageranks-emr.txt

If you complete this optional part, comment on the relative speeds of local vs. Amazon executions in 1.txt.

2. Two-Hop Reachability¶

Write a MapReduce program twohop.py to compute, given a directed graph such as web-Stanford.txt, count the number of unique pairs of nodes \((u, v)\) such that there exists a path of length exactly two from \(u\) to \(v\).

The final output should be just a single total count. Since MapReduce requires everything to be key-value pairs, you can emit your final output as None, count.

Test your program on the tiny version:

    ./twohop.py tiny-web-Stanford.txt

WHAT TO SUBMIT: Your twohop.py file, as well as a plain-text file 2.txt. In 2.txt, report the final answer you got on the tiny graph.

Scalability Experiments [Optional] We encourage you try twohop.py on the full Stanford Web graph. Try it both locally and remotely using Amazon EMR. Note the running times and answers.

If you complete this optional part, report the final answer you got on the full graph, and comment on the relative speeds of local vs. Amazon executions in 2.txt.