Choose data source for Network Construction assignment
Using Duke Scrobbler
Dametrious Peyton, Zach Marshall, and Beth Trushkowsky created online social network tools for use in our
courses called Duke Scrobbler based off
of Facebook, UPenn's Lifester, and last.fm.
If you have Facebook account, create a Scrobbler
account by logging. You can delete your Scrobbler account and unlink it
from you Facebook account after the lab if you like. This system
also allows you to store your music listening profile. To
submit your listening history from your computer or iPod, you will use Duke
Scrobbler Client, powered by
the Audioscrobbler
technology. More on this client later.
For now, you should just
download your Friends Graph. Right click on View My
Friends Graph and save the file as myFriends.graphml.
Using GUESS
Eytan Adar developed GUESS a
tool for analyzing and visualizing networks. Ben Spain then adapted
it for our courses as DukeGUESS.
Follow these steps to open DukeGUESS.
Go to the Start menu.
Next go to Programs.
Inside that sub-menu, go to Programming Languages.
Inside that, go to DukeGuess.
Finally, choose the only available option: the DukeGuess folder.
This should open a directory window with all of the DukeGuess material. In this directory, double-click DukeGuess.jar to run the desired program.
You can
download the zipped
folder for use outside of lab.
We will use Duke GUESS to analyze and visualize networks in this course, so
you should follow the steps in the Duke GUESS tutorial.
Open the myFriends.graphml file that you saved earlier in
GUESS. Export the file to gdf format by choosing the Export
GDF... option in the File menu. Export to
myFriends.gdf. You will submit this file for this week's lab
and also use it in next week's lab.
Type
g.nodes
into the interpreter. What do you
see?
Instead of using your actual Facebook logins or IDs, each node in the
graph is denoted by a digital fingerprint like
isfGSN_i6zXA. generated by a hash function. Why do we use that instead of your actual ID?
You should
identify a specific source of real-world data, the
precise definition of the network (vertices and edges) you plan to extract from this
data, and the methodology by which you will extract it.
We will be generous with the term "real-world", which could
include data from the domains of biology, sociology, economics and
finance, technology, etc. However, it must be a well-defined, objective
data source gathered by a third party.
An example of an entirely acceptable data
source is the recently released corpus of emails exchanged by
Enron executives, where it would be natural to examine the network of
whom exchanged email with whom. An example of an unacceptable data
source and network would be "I wrote down a list of all my friends and
then connected any pair of them that I thought shared a lot of common
interests". This example is too subjective and the data is not gathered
by a third party.
To be sure there is some minimal level of complexity to your network, we
require that the number of vertices in the network be at least 12, and the
total number of edges in the network to be at least 12. However,
considerably more ambitious networks are encouraged.
By the "methodology" by which you will extract your network, we mean how
you plan to go from the raw data source and your defined network to an
acutal representation of your network in our simple format (see below, but
essentially nothing more than a list of all the vertices in your network,
followed by a list of all those pairs of vertices that are connected
by an edge).
For this part, you should submit a brief write-up detailing the information
described above for your network. If your data source is online, please
provide the URLs for the source; feel free to include a small portion of
the raw data in your write-up if it would be helpful to do so. Be sure to
be as precise as possible in all aspects of your write-up, from network
definition to methodology. As an informal test, your write-up should
be sufficiently precise that a third party could independently create
the same network you will from your description.
You may do this section in pairs.
Data Format
Your networks description should be in a file called myNetwork.gdf in the GUESS .gdf file format. A gdf file consists of a section describing all of th evertices and then one describing all of the edges. Consider simplegraph.gdf, a graph has six nodes (A-F) and eight edges. The file is listed below and a visualization of the graph is to the right.
nodedef> name
A
B
C
D
E
F
edgedef> node1,node2,directed
A,B,true
A,C,true
B,C,true
B,D,true
C,D,true
D,C,true
E,F,true
F,C,true
The vertex section of the graph begins with the following line.
nodedef> name
This line indicates that each subsequent vertex definition line will have
the vertex name on it. You can define many other attributes for vertex, but
only a name is required. The edge section of the graph begins with:
edgedef> node1,node2,directed
Edge definitions have a similar structure to vertices. The only required
entried are the names of the two nodes to be connected by an
edge. The directed attribute indicates whether a
particular edge is has directionality. False means that an edge is
undirected and true means that the edge is directed with the first
node being the source and second node being the destination.
Possible Data Sources
CAIDA data
:
data collected by this well-known Internet research consortium. Vertices
could be routers or autonomous systems. Edges could be netowrk
links or peering agreements.
CERT data
: details on network and computer security vulnerabilities.
The Internet Movie Data Base:
Vertices could be actors in a particular class of movies and edges
would indicate that they were in the same movie together.
United Nations Statistics Division
: economic and trade data, demographic data, etc. Each vertex could be
a nation. Edges could be significant trade relations.
Bureau of Labor Statistics
: Labor-related data collected by the government. Vertices could be various quantities tracked over time by BLS, like
unemployment, consumer prices, productivity, wages, etc. There could be
an edge between two such quantities if historically they are significantly correlated.
Yahoo! finance
: historical data on financial markets. Vertices could be companies
listed in some index.
Submitting
You should submit the following files via the Lab 3 assignment on Blackboard.
myFriends.graphml and myFriends.gdf: Your friends from Facebook extracted from
Duke Scrobbler
lab3.txt: Answers for all of the questions from the "Using
GUESS" section in a text file
writeup.txt: A file for the "Building a Network" section
that lists the specific source of real-world data, the
precise definition of the network (vertices and edges) you
plan to extract from this data, and the methodology by
which you will extract it.
myNetwork.gdf: GUESS data description for the network you
specified in writeup.txt
For writeup.txt and myNetwork.gdf both you and your partner should submit
the same file.