Algorithms for Big-Data Management

CompSci 590.04, Fall 2015

Instructor: Ashwin Machanavajjhala
TA: Xi He
When: 3:05 PM - 4:20 PM Wednesdays and Fridays
Where Social Science 124

Office Hours:
    Ashwin Machanavajjhala: By appointment (send email to firstname@cs.duke.edu)
    Xi He: Wednesdays 4:30 - 6 PM @ LSRC D309
 

Overview: Many real world applications are moving from being computationally-bound to being data-bound -- we are seeing a wide variety of staggeringly large datasets. There are billions of emails and search queries, and millions of tweets and photos posted everyday, in addition to our every action being tracked online (via cookies) and in the physical world (e.g., through video cameras). Analyzing this data, especially by combining multiple datasets from different sources, has revolutionized e-commerce, advertising and health industries, and has facilitated advances in science and public policy (e.g. Google Flu).

This course will provide an introduction to algorithm design on such large datasets. We will cover parallel algorithms (which partition computation across many machines), streaming algorithms (that never store the whole input in memory), and computing over noisy data (with applications to privacy and crowd-sourcing).

Lectures in this course will be based on one or more recent publications. There will be no exams. Students will be graded on (a) class participation (attendance, reading assigned papers, class discussion, paper presentation), (b) scribing notes for one or two lectures, and (c) the completion of a group project. Projects can focus on developing new theory/ algorithms, or on implementing/adapting known algorithms for new application domains.


Prerequisites: The course is open to interested graduate and undergraduate students with sufficient mathematical maturity. Students are expected to have completed Discrete Math (CS 230 or equivalent) and Algorithms (CS 330 or equivalent) or Databases (CS316). Knowledge of basic probability will be assumed.


Administrivia:

  Tentative Syllabus:

#
Date
Topic
Readings


Sampling

1
Wed, 8/26
Reservoir Sampling (slides)
J. Vitter, “Random Sampling with a Reservoir”, ACM Transactions on Mathematical
Software, 1985
2
Fri, 8/28
Sampling from databases (slides) N. Dalvi, R. Kumar, A. Machanavajjhala, V. Rastogi, "Sampling hidden objects using nearest-neighbor oracles", KDD 2011.
Optional: Frank Olken "Random Sampling from Databases", PhD Thesis
3
Wed,9/2
Monte Carlo Estimation, DNF counting (slides) R. Karp, M. Luby, N. Madras, "Monte Carlo Estimation Algorithm for Enumeration Problems", Journal of Algorithms 10(3) 1989
4
Fri, 9/4
Markov Chain Monte Carlo (slides)

Assignment 1 posted

L. Lovasz "Random Walks on Graphs: A Survey", Combinatorics 1993
5
Wed, 9/9
Coupling and Mixing Times (slides) (notes)



Streaming Algorithms
6
Fri, 9/11
Filtering & Estimating the number of distinct objects (slides)

Assignment 1 DUE
Chapter 4: Section 4.3, 4.4,  Ullman & Rajaraman textbook
7
Wed, 9/16
Frequency Moments (notes)
Chapter 3: Section 4.5,  Ullman & Rajaraman textbook
Alon, Mathias, Szegedy, "The space complexity of approximating the frequency moments", STOC 96
8
Fri, 9/18
k-wise independence & Count min sketch
(Andrew McGregor's Slides)


Assignment 2 posted
Cormode, Muthukrishnan, "An improved data stream summary: The count min sketch and its applications"
9
Wed, 9/23
Count min sketch and Heavy Hitters

10
Fri, 9/25
Online Prediction and Decision Making (slides)

Assignment 2 DUE Monday 9/28



Parallel Architechtures

11
Wed, 9/30
Map Reduce basics (slides) Chapter 2: Sections 2.1, 2.2, 2.3, Ullman & Rajaram textbook
12
Fri, 10/2
Map Reduce (continued)

Locality Sensitive Hashing (slides)

Project Proposal is Due

Chapter 2: Sections 2.1, 2.2, 2.3, Ullman & Rajaram textbook

Chapter 3:   Ullman & Rajaraman textbook
13
Wed, 10/7
Graphs on Map Reduce (slides) Rastogi, Machanavajjhala, Chitnis, Das Sarma, "Finding Connected Components in Map Reduce in Logarithmic Rounds", ICDE 2013
Kang, Tsourakis, Faloutsos, "Pegasus: a Peta Scale Graph Mining System", ICDM 2009
14
Fri, 10/9
Graph Processing & Iterations (slides)



Malewicz et al, "Pregel: A system for large scale graph processing", SIGMOD 2010

Wed 10/14
NO CLASS

Assignment 3 posted

15
Fri, 10/16
Asynchronous Processing


Low, Gonzales, Kyrola, Bickson, Guestrin, Hellerstein, "Distributed Graphlab", VLDB 2012
Wang, Xie, Demers, Gehrke, "Asynchronous Large Scale Graph Processing Made Easy", CIDR 2013
16
Wed, 10/21
Resilient Distributed Datasets (slides)
Zaharia et al "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing", NSDI 2012
17
Fri, 10/23
Feed Following  (slides)


Silberstein, Terrace, Cooper, Ramakrishnan, "Feeding Frency: Selectively Materializing User's Event Feeds", SIGMOD 2010
Silberstein, Machanavajjhala, Ramakrishnan, "Feed Following", DBSocial 2011


Joins
18
Wed 10/28
Basic Join Algorithms (slides)

Assignment 3 DUE


19
Fri 10/30
Worst Case Optimal Join Algorithms (slides)

20
Wed 11/4
Record Linkage and Fuzzy Joins

Mid-term report DUE
Chapter 3:   Ullman & Rajaraman textbook


Computing with Error

21
Fri, 11/6
Motivation: Differential privacy
"Calibrating Noise to Sensitivity in Private Date Analysis"
C. Dwork, F. McSherry, K. Nissim, A.Smith TCC 2006
"Differential Privacy" C. Dwork, ICALP 2006
22
Wed, 11/11
Answering sets of counting queries


"Boosting the Accuracy of Differentially Private Histograms
Through Consistency
", M. Hay, V. Rastogi, G. Miklau, D. Suciu, PVLDB 2010
"Optimizing Linear Counting Queries Under Differential Privacy"
C. Li, M. Hay, V. Rastogi, G. Miklau, A. McGregor PODS 2010
23
Fri, 11/13
Computing Histograms


"A Data- and Workload- aware Algorithm for Range Queries under Differential privacy", Li, Hay, Miklau, Wang, VLDB 2014
"A simple and practical algorithm for differentially private data release", F. McSherry, K. Ligett, M. Hardt, NIPS 2012
24
Wed 11/18
Sorting under Error "Computing with Noisy Information", Fiege, Raghavan, Peleg, Upfal, SIAM Journal of Computing 1994
25
Fri, 11/20
Project Presentations