CompSci 590.06: Advanced Topics in CS

Understanding Data: Theory and Applications

Fall 2015

Instructor: Sudeepa Roy
Email: sudeepa AT cs.duke.edu
Day/Time: Tuesdays and Thursdays, 11:45 am - 1:00 pm
Place: North 306
Office Hours: LSRC D325, Tuesdays 2:00 pm - 2:50 pm and by appointment

News

9/23/2015: We have a new classroom starting 09/24 (Thursday): North 306!

8/28/2015: The details of review questions have been included in the schedule (five reviews in total during the course). Also the selection of papers has been slightly modified.

8/20/2015: Welcome to 590.06! The course starts on 8/25 (Tuesday). See you all soon!

Overview

With the surge in the data routinely collected from diverse sources, there is a great need for tools and techniques that can help a range of users and data analysts understand data as they explore the available datasets. This course gives an overview of the traditional as well as the state-of-the art approaches studied in the database community toward this goal. Topics include OLAP data cube techniques for exploring multi-dimensional datasets, data provenance, causality and explanations for query answers, handling uncertain data (probabilistic, incomplete, and inconsistent databases), data analysis with humans in the loop, and systems for large-scale data analytics and visualization.

Prerequisites

This is an advanced graduate-level course, but undergraduate students with the necessary background and interest are also welcome. A basic understanding of Discrete Maths (CS 230/equivalent), Databases (CS 316/equivalent), and Algorithms (CS 330/equivalent) is expected. Please contact the instructor if you have not taken such courses but are interested in taking this course.

Logistics

The goals of this course are: (1) to learn a number of research topics that focus on theory, applications, and systems for data analysis in databases, (2) to practice reading and presenting the main ideas from existing research papers, and most importantly, (3) to have a hands-on experience in doing research in data analysis.

Lectures will cover one or more papers on each topic. There are no exams. Students are expected to actively participate in the discussions during the lectures.

The students will answer five review questions (0.5-1 page each) on five different topics covered in the course (see the schedule for details). The review questions ask for potential research directions on each topic that the student can think of.

In addition, the students will present one paper of their choice in the duration of the course and lead the discussion that day.

Students will do a semester-long research project, either individually or in groups of two. The project can be on a theoretical problem, or developing a prototype for data analysis for a particular domain or application (no survey please!). The instructor will send out potential project ideas, but the students are welcome and encouraged to select a topic close to their own research interests related to the theme of the course. There will be four deliverables for the course - a project proposal, a midterm progress report, the final project report, and a presentation and/or demonstration in the class. The aim should be to have a research paper or a prototype for demonstration (better, both!) that can be submitted to a conference/journal/workshop either in databases or in the respective research area of the student.

Grading

Research Project: 50%
Review Questions (five): 25%
Class Participation: 10%
Paper Presentation: 15%

Schedule

(subject to change)

	Day	Topic	Paper(s)	Slides	Comments
1	8/25	Introduction and Review of SQL		Lecture-1	As a guest lecturer
Classical Topics (OLAP Cube - Data Warehouse - Data Mining)
2	8/27	Data Cube Overview and Implementation (1)	Main reading: Agarwal-Agrawal-Deshpande-Gupta-Naughton-Ramakrishnan-Sarawagi, VLDB 1996 On the Computation of Multidimensional Aggregates (link) Optional reading: Gray-Chaudhuri-Bosworth-Layman-Reichart-Venkatrao-Pellow-Pirahesh, ICDE 1996 Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals (link)	Lecture-2	As a guest lecturer
3	9/1	Cube Implementation (2)	Main reading: Harinarayan-Rajaraman-Ullman, SIGMOD 1996 Implementing data cubes efficiently (link)	Lecture-3
4	9/3	Data warehousing and Iceberg Queries	Main reading: Chaudhuri-Dayal, SIGMOD Record 1997 An Overview of Data Warehousing and OLAP Technology (link) Fang-Shivakumar- Garcia-Molina -Motwani-Ullman, VLDB 1998 Computing Iceberg Queries Efficiently (link)	Lecture-4
5	9/8	Index for ROLAP and a MOLAP cube (array-based) implementation	Main reading: Gupta-Harinarayan-Rajaraman-Ullman, ICDE 1997 Index Selection for OLAP (link) Zhao-Deshpande-Naughton, SIGMOD 1997 An Array-Based Algorithm for Simultaneous Multidimensional Aggregates (link)	Lecture-5
6	9/10	Mining Association Rules	Main reading: Agrawal-Srikant, VLDB 1994 Fast Algorithms for Mining Association Rules in Large Databases (link)	Lecture 6	Presentation by the authors in VLDB'04 (Test-of-time award) slides
	9/13	Project Proposal Due Review Question-1: Review (0.5 to 1 page) due for "OLAP - Data Cube - Data Warehouse": Write a short review of one of the papers. or Can you think about a research direction related to your own research interests that will benefit from the Data Cube operator (provided by a DBMS or a new implementation)? Write about it. Mention the technical challenges, and if any of the approaches covered in the class can possibly help.
Provenance
7	9/15	Provenance Overview and Semirings	Main reading: Green-Karvounarakis-Tannen, PODS 2007 Provenance Semirings (link) Optional reading: Buneman-Khanna-Tan, ICDT 2001 Why and where: A Characterization of Data Provenance (link) Cheney-Chiticariu-Tan, Foundations and Trends in Databases 2009 Provenance in Databases: Why, How, and Where (link)		EDBT/ICDT 2010 keynote by Dr. Val Tannen (link)
8	9/17	Why-Not Queries (Query Based)	Main reading: Chapman-Jagadish, SIGMOD 2009 Why Not? (link) Optional reading: Tran-Chan, SIGMOD 2010 How to Conquer Why-Not Questions (link)	Lecture-8
8	9/22	Explanations for Databases ("Detour" lecture for projects)	Main reading: Roy-Suciu, SIGMOD 2014 A Formal Approach to Finding Explanations for Database Queries (link) Optional reading: Wu-Madden, PVLDB 2013 Scorpion: Explaining Away Outliers in Aggregate Queries (link)	Lecture-9
10	9/24	Deletion Propagation + Why-Not Queries (Data Based)	Main reading: Buneman-Khanna-Tan, PODS 2002 On Propagation of Deletions and Annotations through Views (link) Huang-Chen-Doan-Naughton, PVLDB 2008 On the Provenance of Non-Answers to Queries over Extracted Data (link) Optional reading: Kimelfeld-Vondrak-Williams, PODS 2011 Maximizing Conjunctive Views in Deletion Propagation (link) Herschel-Hernandez, PVLDB 2010 Explaining Missing Answers to SPJUA Queries (link)	Lecture-10
	9/27	Review Question-2: Review (0.5 to 1 page) due for "Provenance": Write a short review of one of the papers. or Write about a potential research direction that involves tracing provenance (of data or query answers) and is related to your own research interests. Also mention the technical challenges, and if any of the approaches covered in the class can possibly help.
Uncertain Data
11	9/29	Probabilistic Databases	Main reading: (Chapters 3-5) Suciu-Olteanu-Re-Koch, 2011 Book: Probabilistic Databases (link) Optional reading: Dalvi-Suciu, JACM 2012 The Dichotomy of Probabilistic Inference for Unions of Conjunctive Queries (link) Dalvi-Suciu, VLDB 2004 Efficient Query Evaluation on Probabilistic Databases (link)	Lecture-11
12	10/1	Probabilistic Databases (Continued)	Continued	Lecture-12
13	10/6	Incomplete Databases	Main reading: (Chapters 19) Abiteboul-Hull-Vianu Book: Foundations of Databases (link) Optional reading: Imielinski-Lipski, JACM 1984 Incomplete Information in Relational Databases (link)	Lecture-13
14	10/8	Inconsistent Databases	Reading: Chomicki, ICDT 2007 (Invited talk) Consistent Query Answering: Five Easy Pieces (link) Bertossi, SIGMOD Record 2006 Consistent Query Answers in Databases (link) Arenas-Bertossi-Chomicki, PODS 1999 Consistent Query Answers in Inconsistent Databases (link)	Lecture-14
	10/11	Review Question-3: Submit homework on uncertain data.
	10/13	Fall Break
Causality
15	10/15	Causality in AI Midterm Project Progress Report Due	Optional reading: Halpern-Pearl, UAI 2001 Causes and Explanations: A Structural-Model Approach - Part I : Causes (link) Pearl, UCLA Tech Report 1999 Probabilities of Causation: Three Counterfactual Interpretations and their Identification (link) Meliou-Roy-Suciu, VLDB 2014 Tutorial: Causality and Explanations in Databases (link)	Lecture-15
16	10/20	Causality in Databases	Main reading: Meliou-Gatterbauer-Moore-Suciu, PVLDB 2010 The Complexity of Causality and Responsibility for Query Answers and Non-Answers (link) Optional reading: Meliou-Gatterbauer-Nath-Suciu, SIGMOD 2011 Tracing Data Errors with View-Conditioned Causality (link)	Lecture-16
17	10/22	Causality in Statistics	Optional reading: Rubin, Journal of the American Statistical Association, 2005 Causal Inference Using Potential Outcomes: Design, Modeling, Decisions (link)	Lecture-17	Student presentations
		No review for this topic. Work on your project!
Data Analysis with Humans
18	10/27	Database Usability	Main reading: Jagadish-Chapman-Elkiss-Jayapandian-Li-Nandi-Yu, SIGMOD 2007 Making Database Systems Usable (link) Optional reading: Li-Chan-Maier, VLDB 2015 Query From Examples: An Iterative, Data-Driven Approach to Query Construction (link)	Lecture-18	Student presentations
19	10/29	Crowdsourcing Systems	Main reading: Franklin-Kossmann-Kraska-Ramesh-Xin, SIGMOD 2011 CrowdDB: Answering Queries with Crowdsourcing (link) Marcus-Wu-Karger-Madden-Miller, PVLDB 2011 Human-powered Sorts and Joins (link) Optional reading: Parameswaran-Park- Garcia-Molina -Polyzotis-Widom, CIKM 2012 Deco: Declarative Crowdsourcing (link)		Slides available online
	11/1	Review Question-4: Review (0.5 to 1 page) due for "Data Analysis with Humans": Write a short review of one of the papers. or Write about a potential research direction that includes data analysis and human interaction (either as a user or as the crowd), and is related to your own research interest. Also mention the technical challenges, and if any of the approaches covered in the class can possibly help.
20	11/3	Crowdsourcing Operators (Max)	Main reading: Guo-Parameswaran- Garcia-Molina, SIGMOD 2012 So Who Won? Dynamic Max Discovery with the Crowd (link) Davidson-Khanna-Milo-Roy, TODS 2014 Top-k and Clustering with Noisy Comparisons (link) Optional reading: Feige-Raghavan-Peleg-Upfal, SIAM Journal on Computing, 1994 Computing with Noisy Information (link)	Lecture-20	Slides available online
Systems for Data Analysis
21	11/5	Tools for ML in Databases	Overview (one as main): Hazy, MAD Skills, MLbase		Slides available online
22	11/10	Tools for Large-scale Analytics	Overview (one as main): Dremel, Shark, Spark		Student presentations
23	11/12	Tools for Visualization	Overview (one as main): Tableau, Graph Lab, Data Wrangler		Slides available online
	11/15	Review Question-5: Hands-on experience on one data-analytics system of your choice (listed or not-listed): Install the system, choose a dataset, run queries, write about your observation, and attach the graphs/tables (< 1 page). You can try to use it for your project. You may want to start early. This is the last review.
24	11/17	Feature Selection in Analytics	Main reading: Zhang-Kumar-Re, SIGMOD 2014 Materialization Optimizations for Feature Selection Workloads (link)	Lecture-24
25	11/19	Efficient Feature Selection Algorithms			Student presentations
Conclusions
26	11/24	Project Demonstrations and Presentations Final Project Report Due

Reading List

(This list contains papers that will be covered in the lectures as well as additional readings. This list will be updated regularly.)

Classical Topics

(OLAP Cube - Data Warehouse - Data Mining)

Gray-Chaudhuri-Bosworth-Layman-Reichart-Venkatrao-Pellow-Pirahesh, ICDE 1996
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals (link)
Agarwal-Agrawal-Deshpande-Gupta-Naughton-Ramakrishnan-Sarawagi, VLDB 1996
On the Computation of Multidimensional Aggregates (link)
Harinarayan-Rajaraman-Ullman, SIGMOD 1996
Implementing data cubes efficiently (link)
Gupta-Harinarayan-Rajaraman-Ullman, ICDE 1997
Index selection for OLAP (link)
Agrawal-Srikant, VLDB 1994
Fast Algorithms for Mining Association Rules in Large Databases (link)
Chaudhuri-Dayal, SIGMOD Record 1997
An Overview of Data Warehousing and OLAP Technology (link)
Fang-Shivakumar- Garcia-Molina -Motwani-Ullman, VLDB 1998
Computing Iceberg Queries Efficiently (link)
Xin-Han-Li-Wah, VLDB 2003
Star-Cubing: Computing Iceberg Cubes by Top-Down and Bottom-Up Integration (link)
Ross-Srivastava, VLDB 1997
Fast Computation of Sparse Datacubes (link)
Li-Han-Gonzalez, VLDB 2004
High-Dimensional OLAP: A Minimal Cubing Approach (link)
Sathe-Sarawagi, VLDB 2001
Intelligent Rollups in Multidimensional OLAP Data (link)
Sarawagi-Sathe, SIGMOD 2000
i3: Intelligent, Interactive Investigaton of OLAP data cubes (link)

Provenance

Buneman-Khanna-Tan, ICDT 2001
Why and where: A Characterization of Data Provenance (link)
Green-Karvounarakis-Tannen, PODS 2007
Provenance Semirings (link)
Cui-Widom-Wiener. TODS 2000
Tracing the Lineage of View Data in a Warehousing Environment (link)
Amsterdamer-Deutch-Tannen, PODS 2011
Provenance for aggregate queries (link)
Cheney-Chiticariu-Tan, Foundations and Trends in Databases 2009
Provenance in Databases: Why, How, and Where (link)
Chapman-Jagadish, SIGMOD 2009
Why Not? (link)
Tran-Chan, SIGMOD 2010
How to Conquer Why-Not Questions (link)
Herschel-Hernandez, PVLDB 2010
Explaining Missing Answers to SPJUA Queries (link)
Herschel-Hernandez-Tan, VLDB 2009
Artemis: A System for Analyzing Missing Answers (link)
Huang-Chen-Doan-Naughton, PVLDB 2008
On the Provenance of Non-Answers to Queries over Extracted Data (link)
Buneman-Khanna-Tan, PODS 2002
On Propagation of Deletions and Annotations through Views (link)
Cong-Fan-Geerts-Luo, TKDE 2011
On the Complexity of View Update and its Applications to Annotation Propagation (link)
Kimelfeld-Vondrak-Williams, PODS 2011
Maximizing Conjunctive Views in Deletion Propagation (link)

Uncertain Data

Dalvi-Suciu, JACM 2012
The Dichotomy of Probabilistic Inference for Unions of Conjunctive Queries (link)
Dalvi-Suciu, VLDB 2004
Efficient Query Evaluation on Probabilistic Databases (link)
Cavallo-Pittarelli. VLDB 1987
The Theory of Probabilistic Databases (link)
Fuhr, VLDB 1990
A Probabilistic Framework for Vague Queries and Imprecise Information in Databases (link)
Zimanyi, TCS 1997
Query Evaluation in Probabilistic Relational Databases (link)
Re-Suciu, PVLDB 2008
Approximate Lineage for Probabilistic Databases (link)
Re-Dalvi-Suciu, ICDE 2007
Efficient Top-k Query Evaluation on Probabilistic Data (link)
Imielinski-Lipski, JACM 1984
Incomplete Information in Relational Databases (link)
Arenas-Bertossi-Chomicki, PODS 1999
Consistent Query Answers in Inconsistent Databases (link)
Lopatenko-Bertossi, ICDT 2007
Complexity of Consistent Query Answering in Databases under Cardinality-based and Incremental Repair Semantics (link)
Fagin-Kimelfeld-Kolaitis, PODS 2015
Dichotomies in the Complexity of Preferred Repairs (link)

Causality and Explanations

Rubin, Journal of the American Statistical Association, 2005
Causal Inference Using Potential Outcomes: Design, Modeling, Decisions (link)
Rosenbaum-Rubin, Biometrika, 1983
The Central Role of the Propensity Score in Observational Studies for Causal Effects (link)
Pearl, The International Journal of Biostatistics 2010
An Introduction to Causal Inference (link)
Halpern-Pearl, UAI 2001
Causes and Explanations: A Structural-Model Approach - Part I: Causes (link)
Meliou-Gatterbauer-Moore-Suciu, PVLDB 2010
The Complexity of Causality and Responsibility for Query Answers and Non-Answers (link)
Meliou-Gatterbauer-Nath-Suciu, SIGMOD 2011
Tracing Data Errors with View-Conditioned Causality (link)
Wu-Madden, PVLDB 2013
Scorpion: Explaining Away Outliers in Aggregate Queries (link)
Roy-Suciu, SIGMOD 2014
A Formal Approach to Finding Explanations for Database Queries (link)
Meliou-Roy-Suciu, VLDB 2014
Tutorial: Causality and Explanations in Databases (link)

Data Analysis with Humans

Feige-Raghavan-Peleg-Upfal, SIAM Journal on Computing, 1994
Computing with Noisy Information (link)
Davidson-Khanna-Milo-Roy, TODS 2014
Top-k and Clustering with Noisy Comparisons (link)
Guo-Parameswaran- Garcia-Molina, SIGMOD 2012
So Who Won? Dynamic Max Discovery with the Crowd (link)
Franklin-Kossmann-Kraska-Ramesh-Xin, SIGMOD 2011
CrowdDB: Answering Queries with Crowdsourcing (link)
Marcus-Wu-Karger-Madden-Miller, PVLDB 2011
Human-powered Sorts and Joins (link)
Parameswaran-Park- Garcia-Molina -Polyzotis-Widom, CIKM 2012
Deco: Declarative Crowdsourcing (link)
Jagadish-Chapman-Elkiss-Jayapandian-Li-Nandi-Yu, SIGMOD 2007
Making Database Systems Usable (link)
Tran-Chan-Parthasarathy, SIGMOD 2009
Query by Output (link)

Systems for Data Analysis

Xin-Rosen-Zaharia-Franklin-Shenker-Stoica, SIGMOD 2013
Shark: SQL and Rich Analytics at Scale (link)
Zaharia-Chowdhury-Franklin-Shenker-Stoica, HotCloud 2010
Spark: Cluster Computing with Working Sets (link)
Cohen-Dolan-Dunlap-Hellerstein-Welton, PVLDB 2009
MAD Skills: New Analysis Practices for Big Data (link)
Melnik-Gubarev-Long-Romer-Shivakumar-Tolton-Vassilakis, PVLDB 2010
Dremel: Interactive Analysis of Web-Scale Datasets (link)
Kumar-Niu-Re, CACM 2013
Hazy: Making it Easier to Build and Maintain Big-Data Analytics (link)
Feng-Kumar-Recht-Re, SIGMOD 2012
Towards a Unified Architecture for in-RDBMS Analytics (link)
Hellerstein-Re-Schoppmann-Wang-Fratkin-Gorajek-Ng-Welton-Feng-Li-Kumar, PVLDB 2012
The MADlib Analytics Library or MAD Skills (link)
Kraska-Talwalkar-Duchi-Griffith-Franklin-Jordan, CIDR 2013
MLBase: A Distributed Machine-learning System (link)
Tableau (link)
GraphLab/Dato (link)
Low-Bickson-Gonzalez-Guestrin-Kyrola-Hellerstein, PVLDB 2012
Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud (link)
Data Wrangler (link)
Kandel-Paepcke-Hellerstein-Heer, CHI 2011
Wrangler: Interactive Visual Specification of Data Transformation Scripts (link)
Feng-Kumar-Recht-Re, SIGMOD 2012
Towards a Unified Architecture for in-RDBMS Analytics (link)

CompSci 590.06: Advanced Topics in CS

Understanding Data: Theory and Applications

Fall 2015

News

Overview

Prerequisites

Logistics

Grading

Schedule

Classical Topics

(OLAP Cube - Data Warehouse - Data Mining)

Provenance

Uncertain Data

Causality

Data Analysis with Humans

Systems for Data Analysis

Conclusions