Homework #11: Graph Sampling

DUE: Monday 3/31 11:59pm

HOW TO SUBMIT: Submit the required files for all problems (see WHAT TO SUBMIT under each problem below) through WebSubmit. On the WebSubmit interface, make sure you select compsci290 and the appropriate homework number. You can submit multiple times, but please resubmit files for all problems each time.

0. Getting Started

WHAT TO SUBMIT: Nothing is required for this part.

Start up your VM and issue the following commands, which will download the necessary files and create a working directory for your assignment:

/opt/datacourse/sync.sh
cp -pr /opt/datacourse/assignments/hw11-template/ ~/hw11/
cd ~/hw11/

1. Graph Sampling

In this exercise you will edit the sample.py scripts to implement different graph sampling routines. You will then use graph.py to study the sampling distributions of the following three statistics (see lecture notes for definitions).

  1. Average Degree
  2. Average Correlation Coefficient
  3. Average Betweenness

Calculating betweenness on the full, unsampled graph can take quite a while, so we've calculated the unsampled statistics for you:

Unsampled Statistics

Implement the following two sampling techniques discussed in class -- random_edge and random_jump. We have already implemented random_node sampling for you in sample.py. You can switch the sampling function used by editing graph.py as follows (as long as the sampling functions take the same required arguments):

from sample import random_jump as sample_function 

from sample import random_edge as sample_function

Note: Remember to run graph.py by using /usr/bin/python graph.py characters-fallback.tsv.

Random Edge Sampling

This is exactly as it sounds: we randomly select an edge from the graph, and repeat.

HINT: Your sample size will likely have to be much larger than random_node.

Random Jump Sampling

Random jump sampling is a variant on random walk sampling. The idea is:

  1. Uniformly at random pick a starting node \(S\) from graph \(G\)
  2. With probability (1-p), randomly select a node \(S_{neighbor}\) from \(S\).out_neighbors(), and continue
  3. With probability p, select a new node \(S'\) and random, and continue.

WHAT TO SUBMIT:

Your sample.py file with the above sampling techniques. In addition, a 1.txt file that answers the following questions:

  1. Which sampling algorithm best approximates the average degree? Why?
  2. Which sampling algorithm best approximates the average clustering coefficient? Why?
  3. Which sampling algorithm best approximates the average betweenness? Why?
  4. How does changing the probability (to a value less than 0.15) of random jump sampling impact the statistics? Or does it?
In []: