Algorithms for Big-Data Management

CompSci 590.02, Spring 2013

Instructor: Ashwin Machanavajjhala
When: 3:05 PM - 4:20 PM Tuesdays and Thursdays
Where Social Science 105
Office Hours: By appointment (send email to firstname@cs.duke.edu)

Overview: Many real world applications are moving from being computationally-bound to being data-bound -- we are seeing a wide variety of staggeringly large datasets. There are billions of emails and search queries, and millions of tweets and photos posted everyday, in addition to our every action being tracked online (via cookies) and in the physical world (e.g., through video cameras). Analyzing this data, especially by combining multiple datasets from different sources, has revolutionized e-commerce, advertising and health industries, and has facilitated advances in science and public policy (e.g. Google Flu).

This course will provide an introduction to algorithm design on such large datasets, with a specific focus on integrating information from multiple datasets. In this context, we will cover parallel algorithms (which partition computation across many machines), streaming algorithms (that never store the whole input in memory), large scale machine learning, and crowd-sourcing.

Lectures in this course will be based on one or more recent publications. There will be no exams. Students will be graded on (a) class participation (attendance, reading assigned papers, class discussion, paper presentation), (b) scribing notes for one or two lectures, and (c) the completion of a group project. Projects can focus on developing new theory/ algorithms, or on implementing/adapting known algorithms for new application domains.


Prerequisites: The course is open to interested graduate and undergraduate students with sufficient mathematical maturity. Students are expected to have completed Discrete Math (CS 230 or equivalent) and Algorithms (CS 330 or equivalent). Knowledge of basic probability will be assumed.

News:
Administrivia:

  Syllabus:

#
Date
Topic
Readings


Sampling

1
Thu, Jan 10
Reservoir Sampling (slides)
J. Vitter, “Random Sampling with a Reservoir”, ACM Transactions on Mathematical
Software, 1985
2
Tue, Jan 15
Sampling from databases (slides)
N. Dalvi, R. Kumar, A. Machanavajjhala, V. Rastogi, "Sampling hidden objects using nearest-neighbor oracles", KDD 2011.
Optional: Frank Olken "Random Sampling from Databases", PhD Thesis
3
Thu, Jan 17
Monte Carlo Estimation, DNF counting (slides)
R. Karp, M. Luby, N. Madras, "Monte Carlo Estimation Algorithm for Enumeration Problems", Journal of Algorithms 10(3) 1989
4
Tue, Jan 22
Markov Chain Monte Carlo (slides) L. Lovasz "Random Walks on Graphs: A Survey", Combinatorics 1993
5
Thu, Jan 24
Coupling and Mixing Times (notes)
Assignment 1 is posted




Streaming Algorithms
6
Tue, Jan 29
Filtering & Estimating the number of distinct objects (slides)
Chapter 4: Section 4.3, 4.4,  Ullman & Rajaraman textbook
7
Thu, Jan 31
Frequency Moments (notes)
Chapter 3: Section 4.5,  Ullman & Rajaraman textbook
Alon, Mathias, Szegedy, "The space complexity of approximating the frequency moments", STOC 96
8
Tue, Feb 5
Heavy-Hitters & Count min sketch
(Andrew McGregor's slides)
Cormode, Muthukrishnan, "An improved data stream summary: The count min sketch and its applications"
9
Thu, Feb 7
Online Aggregation (slides)
Hellerstein, Haas, Wang, "Online Aggregation", SIGMOD '97
Haas, Hellerstein, "Ripple Joins", SIGMOD '99
Chaudhuri, Motwani, Narasayya, "On Random Sampling over Joins", SIGMOD'99
10
Tue, Feb 12 Weighted Majority & online learning (slides)


Parallel Architechtures

11
Thu, Feb 14
Map Reduce basics (slides)
Chapter 2: Sections 2.1, 2.2, 2.3, Ullman & Rajaram textbook

Tue, Feb 19
NO CLASS

12
Thu, Feb 21
Evaluating Joins on Map Reduce (slides)
Project Proposal is Due
Okcan, Reideweld, "Processing Theta Joins using Map Reduce", SIGMOD '11
13
Tue, Feb 26
Graphs on Map Reduce (slides)
Rastogi, Machanavajjhala, Chitnis, Das Sarma, "Finding Connected Components in Map Reduce in Logarithmic Rounds", ICDE 2013
Kang, Tsourakis, Faloutsos, "Pegasus: a Peta Scale Graph Mining System", ICDM 2009
14
Thu, Feb 28
Graph Processing & Iterations (slides)
Assignment 2 is posted
Malewicz et al, "Pregel: A system for large scale graph processing", SIGMOD 2010
15
Tue, Mar 5
Asynchronous Processing (slides)
Low, Gonzales, Kyrola, Bickson, Guestrin, Hellerstein, "Distributed Graphlab", VLDB 2012
Wang, Xie, Demers, Gehrke, "Asynchronous Large Scale Graph Processing Made Easy", CIDR 2013
16
Thu, Mar 7
Feed Following (slides) Silberstein, Terrace, Cooper, Ramakrishnan, "Feeding Frency: Selectively Materializing User's Event Feeds", SIGMOD 2010
Silberstein, Machanavajjhala, Ramakrishnan, "Feed Following", DBSocial 2011

Tue, Mar 12
NO CLASS Springbreak


Thu, Mar 14
NO CLASS Springbreak

Thu, Mar 21
Buffer



Clustering & Deduplication
17
Tue, Mar 26
Clustering (slides)
Chapter 7:   Ullman & Rajaraman textbook
McCallum, Nigam, Ungar, "Efficient Clustering of High Dimensional Data Sets", KDD 2000
18
Thu, Mar 28
Deduplication  (slides) Chapter 3:   Ullman & Rajaraman textbook
19
Tue, Apr 2
Efficient identification of near duplicates & Locality Sensitive Hashing (slides) Chapter 3:   Ullman & Rajaraman textbook
20
Thu, Apr 4
Correlation Clustering (slides) Bansal, Chawla, Blum, "Correlation Clustering", ML 2004
Elsner, Schudy, "Bounding and comparing methods for correlation clustering", ILP-NLP 2009
21
Tue, Apr 9
Deduplication in Linked Data (slides) Bhattacharya, Getoor, "Collective Entity Resolution in relational data",  TKDD 2007

22
Thu, Apr 11
Collective Entity Resolution & Scalability (slides) Rastogi, Dalvi, Garofalakis, "Large Scale Collective Entity Matching", VLDB 2011
23
Tue, Apr 16
Active Learning Sarawagi, Bhamidipaty, "Iterative Deduplication using Active Learning", KDD 2002
Balcan, Beygelzimer, Langford, "Agnostic Active Learning", ICML 2006
24
Mon, Apr 29
Project Presentations