Lab #9: Graphs

DUE: Tuesday 3/18 11:59pm

HOW/WHAT TO SUBMIT: All files should be submitted through WebSubmit. Only one of your team members needs to submit on behalf of the team. On the WebSubmit interface, make sure you select compsci290 and the appropriate lab number. You can submit multiple times, but please have the same team member resubmit all required files each time. To earn class participation credit, submit a text file team.txt listing members of your team who are present at the lab. To earn extra credit for the lab challenge, you must get your solutions checked off in class, but you don't need to submit anything.

0. Getting Ready

To get ready for this lab, fire up your virtual machine and enter the following command:

/opt/datacourse/sync.sh

This command will download the files which we are going to use for this lab.

Next, create a working directory (lab09 under your home directory) and get ready for this lab:

cp -pr /opt/datacourse/assignments/lab09-template/ ~/lab09/
cd ~/lab09/

1. Which is Which?

In the working directory of this lab, you will find a bunch of graphs in the GraphML format: 1.xml.gz, 2.xml.gz, ..., 6.xml.gz. These are compressed XML files. They are actually human-readable---you can use the command less to view them (e.g., type the command less 1.xml.gz). Their format is fairly self-explanatory, and you don't need to understand the details about this format for this lab.

These files correspond to the following graphs. Can you tell which one is which? Of course, the ordering of files has been scrambled. Also, in the case of real-world graphs below, the node ids have been permuted and don't correspond to the ids in the original datasets.

(Random-Sparse) A random graph with \(p < 1/50\). See Lecture #9, slides 11-14 for the definition of a random graph. Recall that \(p\) is the probability that an edge exists between a given pair of nodes.

(Random-Dense) A random graph with \(p > 1/4\).

(BA-1) A Barabasi-Albert graph (preferential attachment model) with \(m = 1\). See Lecture #9, slides 24-27 for the definition of Barabasi-Albert model of graph construction. Recall that nodes are added one at a time, and each node will try to preferentially attach itself to \(m\) existing nodes; those already with high degrees will be preferred.

(BA-3) A Barabasi-Albert graph with \(m = 3\).

(Congress) The nodes in this graph represent Representatives and roll calls in the House in 2013. If a Representative \(u\) voted "Yea" or "Aye" in a roll call \(v\), then we draw an edge between \(u\) and \(v\); otherwise, there is no edge between \(u\) and \(v\). To obtain a graph with \(N\) nodes, we pick them uniformly at random and consider edges among them. The data came from govtrack.us (in fact it's in our congress PostgreSQL database).

(Coauthor) The nodes in this graph represent authors in the field of data mining. There is an edge between two nodes if the two authors have written at least one paper together. To obtain a graph with \(N\) nodes, we pick \(N\) nodes with the highest degrees in the original graph, and then consider just these \(N\) nodes and the edges among them. The data came from the WSDM Data Challenge.

TOOL OF TRADE: To find out which file corresponds to which graph, you are free to use any tool and run any kind of analysis on the files. For your convenience, however, we've provided an improved version of the script graph.py that you used in Homework #9. (Hint: In fact, graph.py alone should suffice for this lab.)

To analyze 1.xml.gz, for example, just issue the command:

./graph.py 1.xml.gz

The two figures (degree and distance distributions) will be automatically saved in two PDF files (with prefix 1.xml). Alternatively, you can just run graph.py in a batch mode (which saves you the trouble of clicking on windows):

./graph.py --batch 1.xml.gz > 1.xml-report.txt

(Remember to change both occurrences of 1.xml above to match the file you want to analyze.) This command will save the two figures in PDF files and the textual output in the .txt file, which you can look at further.

GETTING CHECKED OFF: Once you feel that you have an answer, raise your hands to get your result checked off by course staff. To discourage random guessing:

WINNING: The teams will be ranked by the number of correct matches in their best answer. Ties are broken by the number of lifelines used. If there is still a tie, it will be broken by the time of the answer.

EXTRA CREDITS: If you do better than correctly identifying all but two graphs, you receive extra credits worth 10% of a homework. The top three teams will receive extra credits worth 15%, 10%, and 5% of a homework (in addition to the 10% mentioned earlier).