Algorithms for Big-Data Management

CompSci 590.04, Fall 2015

Instructor: Ashwin Machanavajjhala
TA: Xi He
When: 3:05 PM - 4:20 PM Wednesdays and Fridays
Where: Social Science 124

Office Hours:
Ashwin Machanavajjhala: By appointment (send email to firstname@cs.duke.edu)
Xi He: Wednesdays 4:30 - 6 PM @ LSRC D309

Overview: Many real world applications are moving from being computationally-bound to being data-bound -- we are seeing a wide variety of staggeringly large datasets. There are billions of emails and search queries, and millions of tweets and photos posted everyday, in addition to our every action being tracked online (via cookies) and in the physical world (e.g., through video cameras). Analyzing this data, especially by combining multiple datasets from different sources, has revolutionized e-commerce, advertising and health industries, and has facilitated advances in science and public policy (e.g. Google Flu).

This course will provide an introduction to algorithm design on such large datasets. We will cover parallel algorithms (which partition computation across many machines), streaming algorithms (that never store the whole input in memory), and computing over noisy data (with applications to privacy and crowd-sourcing).

Lectures in this course will be based on one or more recent publications. There will be no exams. Students will be graded on (a) class participation (attendance, reading assigned papers, class discussion, paper presentation), (b) scribing notes for one or two lectures, and (c) the completion of a group project. Projects can focus on developing new theory/ algorithms, or on implementing/adapting known algorithms for new application domains.

Prerequisites: The course is open to interested graduate and undergraduate students with sufficient mathematical maturity. Students are expected to have completed Discrete Math (CS 230 or equivalent) and Algorithms (CS 330 or equivalent) or Databases (CS316). Knowledge of basic probability will be assumed.

Administrivia:

There are no exams.
Each lecture will be based on one or two recent publications. Students are expected to read the assigned papers before class and will be graded based on class participation.
There will be three asignments that will involve programming tasks.
Students must complete a class project in groups of size 3-4. Projects can focus on developing new theory/algorithms, or on implementing/adapting known algorithms to a real application setting. Details on projects are here.

Tentative Syllabus:

#	Date	Topic	Readings
		Sampling
1	Wed, 8/26	Reservoir Sampling (slides)	J. Vitter, “Random Sampling with a Reservoir”, ACM Transactions on Mathematical Software, 1985
2	Fri, 8/28	Sampling from databases (slides)	N. Dalvi, R. Kumar, A. Machanavajjhala, V. Rastogi, "Sampling hidden objects using nearest-neighbor oracles", KDD 2011. Optional: Frank Olken "Random Sampling from Databases", PhD Thesis
3	Wed,9/2	Monte Carlo Estimation, DNF counting (slides)	R. Karp, M. Luby, N. Madras, "Monte Carlo Estimation Algorithm for Enumeration Problems", Journal of Algorithms 10(3) 1989
4	Fri, 9/4	Markov Chain Monte Carlo (slides) Assignment 1 posted	L. Lovasz "Random Walks on Graphs: A Survey", Combinatorics 1993
5	Wed, 9/9	Coupling and Mixing Times (slides) (notes)
		Streaming Algorithms
6	Fri, 9/11	Filtering & Estimating the number of distinct objects (slides) Assignment 1 DUE	Chapter 4: Section 4.3, 4.4, Ullman & Rajaraman textbook
7	Wed, 9/16	Frequency Moments (notes)	Chapter 3: Section 4.5, Ullman & Rajaraman textbook Alon, Mathias, Szegedy, "The space complexity of approximating the frequency moments", STOC 96
8	Fri, 9/18	k-wise independence & Count min sketch (Andrew McGregor's Slides) Assignment 2 posted	Cormode, Muthukrishnan, "An improved data stream summary: The count min sketch and its applications"
9	Wed, 9/23	Count min sketch and Heavy Hitters
10	Fri, 9/25	Online Prediction and Decision Making (slides) Assignment 2 DUE Monday 9/28
		Parallel Architechtures
11	Wed, 9/30	Map Reduce basics (slides)	Chapter 2: Sections 2.1, 2.2, 2.3, Ullman & Rajaram textbook
12	Fri, 10/2	Map Reduce (continued) Locality Sensitive Hashing (slides) Project Proposal is Due	Chapter 2: Sections 2.1, 2.2, 2.3, Ullman & Rajaram textbook Chapter 3: Ullman & Rajaraman textbook
13	Wed, 10/7	Graphs on Map Reduce (slides)	Rastogi, Machanavajjhala, Chitnis, Das Sarma, "Finding Connected Components in Map Reduce in Logarithmic Rounds", ICDE 2013 Kang, Tsourakis, Faloutsos, "Pegasus: a Peta Scale Graph Mining System", ICDM 2009
14	Fri, 10/9	Graph Processing & Iterations (slides)	Malewicz et al, "Pregel: A system for large scale graph processing", SIGMOD 2010
	Wed 10/14	NO CLASS Assignment 3 posted
15	Fri, 10/16	Asynchronous Processing	Low, Gonzales, Kyrola, Bickson, Guestrin, Hellerstein, "Distributed Graphlab", VLDB 2012 Wang, Xie, Demers, Gehrke, "Asynchronous Large Scale Graph Processing Made Easy", CIDR 2013
16	Wed, 10/21	Resilient Distributed Datasets (slides)	Zaharia et al "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing", NSDI 2012
17	Fri, 10/23	Feed Following (slides)	Silberstein, Terrace, Cooper, Ramakrishnan, "Feeding Frency: Selectively Materializing User's Event Feeds", SIGMOD 2010 Silberstein, Machanavajjhala, Ramakrishnan, "Feed Following", DBSocial 2011
		Joins
18	Wed 10/28	Basic Join Algorithms (slides) Assignment 3 DUE
19	Fri 10/30	Worst Case Optimal Join Algorithms (slides)
20	Wed 11/4	Record Linkage and Fuzzy Joins Mid-term report DUE	Chapter 3: Ullman & Rajaraman textbook
		Computing with Error
21	Fri, 11/6	Motivation: Differential privacy	"Calibrating Noise to Sensitivity in Private Date Analysis" C. Dwork, F. McSherry, K. Nissim, A.Smith TCC 2006 "Differential Privacy" C. Dwork, ICALP 2006
22	Wed, 11/11	Answering sets of counting queries	"Boosting the Accuracy of Differentially Private Histograms Through Consistency", M. Hay, V. Rastogi, G. Miklau, D. Suciu, PVLDB 2010 "Optimizing Linear Counting Queries Under Differential Privacy" C. Li, M. Hay, V. Rastogi, G. Miklau, A. McGregor PODS 2010
23	Fri, 11/13	Computing Histograms	"A Data- and Workload- aware Algorithm for Range Queries under Differential privacy", Li, Hay, Miklau, Wang, VLDB 2014 "A simple and practical algorithm for differentially private data release", F. McSherry, K. Ligett, M. Hardt, NIPS 2012
24	Wed 11/18	Sorting under Error	"Computing with Noisy Information", Fiege, Raghavan, Peleg, Upfal, SIAM Journal of Computing 1994
25	Fri, 11/20	Project Presentations