CompSci 590.06: Advanced Topics in CS

Understanding Data: Theory and Applications

Fall 2015


Instructor: Sudeepa Roy
Email: sudeepa AT cs.duke.edu
Day/Time: Tuesdays and Thursdays, 11:45 am - 1:00 pm
PlaceNorth 306
Office Hours: LSRC D325, Tuesdays 2:00 pm - 2:50 pm and by appointment



News

  • 9/23/2015:   We have a new classroom starting 09/24 (Thursday): North 306!
  • 8/28/2015:   The details of review questions have been included in the schedule (five reviews in total during the course). Also the selection of papers has been slightly modified.
  • 8/20/2015:   Welcome to 590.06! The course starts on 8/25 (Tuesday). See you all soon!


Overview

With the surge in the data routinely collected from diverse sources, there is a great need for tools and techniques that can help a range of users and data analysts understand data as they explore the available datasets. This course gives an overview of the traditional as well as the state-of-the art approaches studied in the database community toward this goal. Topics include OLAP data cube techniques for exploring multi-dimensional datasets, data provenance, causality and explanations for query answers, handling uncertain data (probabilistic, incomplete, and inconsistent databases), data analysis with humans in the loop, and systems for large-scale data analytics and visualization.

Prerequisites

This is an advanced graduate-level course, but undergraduate students with the necessary background and interest are also welcome. A basic understanding of Discrete Maths (CS 230/equivalent), Databases (CS 316/equivalent), and Algorithms (CS 330/equivalent) is expected. Please contact the instructor if you have not taken such courses but are interested in taking this course.

Logistics

The goals of this course are: (1) to learn a number of research topics that focus on theory, applications, and systems for data analysis in databases, (2) to practice reading and presenting the main ideas from existing research papers, and most importantly, (3) to have a hands-on experience in doing research in data analysis.

Lectures will cover one or more papers on each topic. There are no exams. Students are expected to actively participate in the discussions during the lectures.

The students will answer five review questions (0.5-1 page each) on five different topics covered in the course (see the schedule for details). The review questions ask for potential research directions on each topic that the student can think of.

In addition, the students will present one paper of their choice in the duration of the course and lead the discussion that day.

Students will do a semester-long research project, either individually or in groups of two. The project can be on a theoretical problem, or developing a prototype for data analysis for a particular domain or application (no survey please!). The instructor will send out potential project ideas, but the students are welcome and encouraged to select a topic close to their own research interests related to the theme of the course. There will be four deliverables for the course - a project proposal, a midterm progress report, the final project report, and a presentation and/or demonstration in the class. The aim should be to have a research paper or a prototype for demonstration (better, both!) that can be submitted to a conference/journal/workshop either in databases or in the respective research area of the student.

Grading

  • Research Project: 50%
  • Review Questions (five): 25%      
  • Class Participation: 10%
  • Paper Presentation: 15%

Schedule

(subject to change)

  Day Topic Paper(s) Slides Comments
1 8/25 Introduction and
Review of SQL
Lecture-1 As a guest lecturer

Classical Topics

(OLAP Cube - Data Warehouse - Data Mining)

2 8/27 Data Cube
Overview and Implementation (1)
Main reading:
Agarwal-Agrawal-Deshpande-Gupta-Naughton-Ramakrishnan-Sarawagi, VLDB 1996
On the Computation of Multidimensional Aggregates (link)

Optional reading:
Gray-Chaudhuri-Bosworth-Layman-Reichart-Venkatrao-Pellow-Pirahesh, ICDE 1996
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals (link)
Lecture-2 As a guest lecturer
3 9/1 Cube Implementation (2) Main reading:
Harinarayan-Rajaraman-Ullman, SIGMOD 1996
Implementing data cubes efficiently (link)
Lecture-3
4 9/3 Data warehousing
and Iceberg Queries
Main reading:
Chaudhuri-Dayal, SIGMOD Record 1997
An Overview of Data Warehousing and OLAP Technology (link)
Fang-Shivakumar- Garcia-Molina -Motwani-Ullman, VLDB 1998
Computing Iceberg Queries Efficiently (link)
Lecture-4
5 9/8 Index for ROLAP
and a MOLAP cube (array-based) implementation
Main reading:
Gupta-Harinarayan-Rajaraman-Ullman, ICDE 1997
Index Selection for OLAP (link)
Zhao-Deshpande-Naughton, SIGMOD 1997
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates (link)
Lecture-5
6 9/10 Mining Association Rules Main reading:
Agrawal-Srikant, VLDB 1994
Fast Algorithms for Mining Association Rules in Large Databases (link)
Lecture 6 Presentation by the authors in VLDB'04
(Test-of-time award) slides
9/13 Project Proposal Due
Review Question-1: Review (0.5 to 1 page) due for "OLAP - Data Cube - Data Warehouse":
Write a short review of one of the papers.
or
Can you think about a research direction related to your own research interests that will benefit from the Data Cube operator (provided by a DBMS or a new implementation)? Write about it. Mention the technical challenges, and if any of the approaches covered in the class can possibly help.

Provenance

7 9/15 Provenance
Overview and Semirings
Main reading:
Green-Karvounarakis-Tannen, PODS 2007
Provenance Semirings (link)

Optional reading:
Buneman-Khanna-Tan, ICDT 2001
Why and where: A Characterization of Data Provenance (link)
Cheney-Chiticariu-Tan, Foundations and Trends in Databases 2009
Provenance in Databases: Why, How, and Where (link)
EDBT/ICDT 2010 keynote
by Dr. Val Tannen
(link)
8 9/17 Why-Not Queries
(Query Based)
Main reading:
Chapman-Jagadish, SIGMOD 2009
Why Not? (link)

Optional reading:
Tran-Chan, SIGMOD 2010
How to Conquer Why-Not Questions (link)
Lecture-8
8 9/22 Explanations for Databases
("Detour" lecture for projects)
Main reading:
Roy-Suciu, SIGMOD 2014
A Formal Approach to Finding Explanations for Database Queries (link)

Optional reading:
Wu-Madden, PVLDB 2013
Scorpion: Explaining Away Outliers in Aggregate Queries (link)
Lecture-9
10 9/24 Deletion Propagation +
Why-Not Queries
(Data Based)
Main reading:
Buneman-Khanna-Tan, PODS 2002
On Propagation of Deletions and Annotations through Views (link)
Huang-Chen-Doan-Naughton, PVLDB 2008
On the Provenance of Non-Answers to Queries over Extracted Data (link)

Optional reading:
Kimelfeld-Vondrak-Williams, PODS 2011
Maximizing Conjunctive Views in Deletion Propagation (link)
Herschel-Hernandez, PVLDB 2010
Explaining Missing Answers to SPJUA Queries (link)
Lecture-10
9/27 Review Question-2: Review (0.5 to 1 page) due for "Provenance":
Write a short review of one of the papers.
or
Write about a potential research direction that involves tracing provenance (of data or query answers) and is related to your own research interests. Also mention the technical challenges, and if any of the approaches covered in the class can possibly help.

Uncertain Data

11 9/29 Probabilistic Databases Main reading:
(Chapters 3-5) Suciu-Olteanu-Re-Koch, 2011
Book: Probabilistic Databases (link)

Optional reading:
Dalvi-Suciu, JACM 2012
The Dichotomy of Probabilistic Inference for Unions of Conjunctive Queries (link)
Dalvi-Suciu, VLDB 2004
Efficient Query Evaluation on Probabilistic Databases (link)
Lecture-11
12 10/1 Probabilistic Databases
(Continued)
Continued Lecture-12
13 10/6 Incomplete Databases Main reading:
(Chapters 19) Abiteboul-Hull-Vianu
Book: Foundations of Databases (link)

Optional reading:
Imielinski-Lipski, JACM 1984
Incomplete Information in Relational Databases (link)
Lecture-13
14 10/8 Inconsistent Databases Reading:
Chomicki, ICDT 2007 (Invited talk)
Consistent Query Answering: Five Easy Pieces (link) Bertossi, SIGMOD Record 2006
Consistent Query Answers in Databases (link) Arenas-Bertossi-Chomicki, PODS 1999
Consistent Query Answers in Inconsistent Databases (link)
Lecture-14
10/11 Review Question-3: Submit homework on uncertain data.
  10/13 Fall Break

Causality

15 10/15 Causality in AI

Midterm Project Progress Report Due
Optional reading:
Halpern-Pearl, UAI 2001
Causes and Explanations: A Structural-Model Approach - Part I : Causes (link)
Pearl, UCLA Tech Report 1999
Probabilities of Causation: Three Counterfactual Interpretations and their Identification (link)
Meliou-Roy-Suciu, VLDB 2014
Tutorial: Causality and Explanations in Databases (link)
Lecture-15
16 10/20 Causality in Databases Main reading:
Meliou-Gatterbauer-Moore-Suciu, PVLDB 2010
The Complexity of Causality and Responsibility for Query Answers and Non-Answers (link)

Optional reading:
Meliou-Gatterbauer-Nath-Suciu, SIGMOD 2011
Tracing Data Errors with View-Conditioned Causality (link)
Lecture-16
17 10/22 Causality in Statistics

Optional reading:
Rubin, Journal of the American Statistical Association, 2005
Causal Inference Using Potential Outcomes: Design, Modeling, Decisions (link)
Lecture-17 Student presentations
No review for this topic. Work on your project!

Data Analysis with Humans

18 10/27 Database Usability Main reading:
Jagadish-Chapman-Elkiss-Jayapandian-Li-Nandi-Yu, SIGMOD 2007
Making Database Systems Usable (link)

Optional reading:
Li-Chan-Maier, VLDB 2015
Query From Examples: An Iterative, Data-Driven Approach to Query Construction (link)
Lecture-18 Student presentations
19 10/29 Crowdsourcing Systems Main reading:
Franklin-Kossmann-Kraska-Ramesh-Xin, SIGMOD 2011
CrowdDB: Answering Queries with Crowdsourcing (link)
Marcus-Wu-Karger-Madden-Miller, PVLDB 2011
Human-powered Sorts and Joins (link)

Optional reading:
Parameswaran-Park- Garcia-Molina -Polyzotis-Widom, CIKM 2012
Deco: Declarative Crowdsourcing (link)
Slides available online
11/1 Review Question-4: Review (0.5 to 1 page) due for "Data Analysis with Humans":
Write a short review of one of the papers.
or
Write about a potential research direction that includes data analysis and human interaction (either as a user or as the crowd), and is related to your own research interest. Also mention the technical challenges, and if any of the approaches covered in the class can possibly help.
20 11/3 Crowdsourcing Operators
(Max)
Main reading:
Guo-Parameswaran- Garcia-Molina, SIGMOD 2012
So Who Won? Dynamic Max Discovery with the Crowd (link)
Davidson-Khanna-Milo-Roy, TODS 2014
Top-k and Clustering with Noisy Comparisons (link)

Optional reading:
Feige-Raghavan-Peleg-Upfal, SIAM Journal on Computing, 1994
Computing with Noisy Information (link)
Lecture-20 Slides available online

Systems for Data Analysis

21 11/5 Tools for ML in Databases Overview (one as main):
Hazy, MAD Skills, MLbase
Slides available online
22 11/10 Tools for Large-scale Analytics Overview (one as main):
Dremel, Shark, Spark
Student presentations
23 11/12 Tools for Visualization Overview (one as main):
Tableau, Graph Lab, Data Wrangler
Slides available online
11/15 Review Question-5:
Hands-on experience on one data-analytics system of your choice (listed or not-listed): Install the system, choose a dataset, run queries, write about your observation, and attach the graphs/tables (< 1 page). You can try to use it for your project. You may want to start early. This is the last review.
24 11/17 Feature Selection in Analytics Main reading:
Zhang-Kumar-Re, SIGMOD 2014
Materialization Optimizations for Feature Selection Workloads (link)
Lecture-24
25 11/19 Efficient Feature Selection Algorithms Student presentations

Conclusions

26 11/24 Project Demonstrations
and Presentations

Final Project Report Due



Reading List

(This list contains papers that will be covered in the lectures as well as additional readings. This list will be updated regularly.)

  • Classical Topics

    (OLAP Cube - Data Warehouse - Data Mining)

    • Gray-Chaudhuri-Bosworth-Layman-Reichart-Venkatrao-Pellow-Pirahesh, ICDE 1996
      Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals (link)
    • Agarwal-Agrawal-Deshpande-Gupta-Naughton-Ramakrishnan-Sarawagi, VLDB 1996
      On the Computation of Multidimensional Aggregates (link)
    • Harinarayan-Rajaraman-Ullman, SIGMOD 1996
      Implementing data cubes efficiently (link)
    • Gupta-Harinarayan-Rajaraman-Ullman, ICDE 1997
      Index selection for OLAP (link)
    • Agrawal-Srikant, VLDB 1994
      Fast Algorithms for Mining Association Rules in Large Databases (link)
    • Chaudhuri-Dayal, SIGMOD Record 1997
      An Overview of Data Warehousing and OLAP Technology (link)
    • Fang-Shivakumar- Garcia-Molina -Motwani-Ullman, VLDB 1998
      Computing Iceberg Queries Efficiently (link)
    • Xin-Han-Li-Wah, VLDB 2003
      Star-Cubing: Computing Iceberg Cubes by Top-Down and Bottom-Up Integration (link)
    • Ross-Srivastava, VLDB 1997
      Fast Computation of Sparse Datacubes (link)
    • Li-Han-Gonzalez, VLDB 2004
      High-Dimensional OLAP: A Minimal Cubing Approach (link)
    • Sathe-Sarawagi, VLDB 2001
      Intelligent Rollups in Multidimensional OLAP Data (link)
    • Sarawagi-Sathe, SIGMOD 2000
      i3: Intelligent, Interactive Investigaton of OLAP data cubes (link)
  • Provenance

    • Buneman-Khanna-Tan, ICDT 2001
      Why and where: A Characterization of Data Provenance (link)
    • Green-Karvounarakis-Tannen, PODS 2007
      Provenance Semirings (link)
    • Cui-Widom-Wiener. TODS 2000
      Tracing the Lineage of View Data in a Warehousing Environment (link)
    • Amsterdamer-Deutch-Tannen, PODS 2011
      Provenance for aggregate queries (link)
    • Cheney-Chiticariu-Tan, Foundations and Trends in Databases 2009
      Provenance in Databases: Why, How, and Where (link)
    • Chapman-Jagadish, SIGMOD 2009
      Why Not? (link)
    • Tran-Chan, SIGMOD 2010
      How to Conquer Why-Not Questions (link)
    • Herschel-Hernandez, PVLDB 2010
      Explaining Missing Answers to SPJUA Queries (link)
    • Herschel-Hernandez-Tan, VLDB 2009
      Artemis: A System for Analyzing Missing Answers (link)
    • Huang-Chen-Doan-Naughton, PVLDB 2008
      On the Provenance of Non-Answers to Queries over Extracted Data (link)
    • Buneman-Khanna-Tan, PODS 2002
      On Propagation of Deletions and Annotations through Views (link)
    • Cong-Fan-Geerts-Luo, TKDE 2011
      On the Complexity of View Update and its Applications to Annotation Propagation (link)
    • Kimelfeld-Vondrak-Williams, PODS 2011
      Maximizing Conjunctive Views in Deletion Propagation (link)
  • Uncertain Data

    • Dalvi-Suciu, JACM 2012
      The Dichotomy of Probabilistic Inference for Unions of Conjunctive Queries (link)
    • Dalvi-Suciu, VLDB 2004
      Efficient Query Evaluation on Probabilistic Databases (link)
    • Cavallo-Pittarelli. VLDB 1987
      The Theory of Probabilistic Databases (link)
    • Fuhr, VLDB 1990
      A Probabilistic Framework for Vague Queries and Imprecise Information in Databases (link)
    • Zimanyi, TCS 1997
      Query Evaluation in Probabilistic Relational Databases (link)
    • Re-Suciu, PVLDB 2008
      Approximate Lineage for Probabilistic Databases (link)
    • Re-Dalvi-Suciu, ICDE 2007
      Efficient Top-k Query Evaluation on Probabilistic Data (link)
    • Imielinski-Lipski, JACM 1984
      Incomplete Information in Relational Databases (link)
    • Arenas-Bertossi-Chomicki, PODS 1999
      Consistent Query Answers in Inconsistent Databases (link)
    • Lopatenko-Bertossi, ICDT 2007
      Complexity of Consistent Query Answering in Databases under Cardinality-based and Incremental Repair Semantics (link)
    • Fagin-Kimelfeld-Kolaitis, PODS 2015
      Dichotomies in the Complexity of Preferred Repairs (link)
  • Causality and Explanations

    • Rubin, Journal of the American Statistical Association, 2005
      Causal Inference Using Potential Outcomes: Design, Modeling, Decisions (link)
    • Rosenbaum-Rubin, Biometrika, 1983
      The Central Role of the Propensity Score in Observational Studies for Causal Effects (link)
    • Pearl, The International Journal of Biostatistics 2010
      An Introduction to Causal Inference (link)
    • Halpern-Pearl, UAI 2001
      Causes and Explanations: A Structural-Model Approach - Part I: Causes (link)
    • Meliou-Gatterbauer-Moore-Suciu, PVLDB 2010
      The Complexity of Causality and Responsibility for Query Answers and Non-Answers (link)
    • Meliou-Gatterbauer-Nath-Suciu, SIGMOD 2011
      Tracing Data Errors with View-Conditioned Causality (link)
    • Wu-Madden, PVLDB 2013
      Scorpion: Explaining Away Outliers in Aggregate Queries (link)
    • Roy-Suciu, SIGMOD 2014
      A Formal Approach to Finding Explanations for Database Queries (link)
    • Meliou-Roy-Suciu, VLDB 2014
      Tutorial: Causality and Explanations in Databases (link)
  • Data Analysis with Humans

    • Feige-Raghavan-Peleg-Upfal, SIAM Journal on Computing, 1994
      Computing with Noisy Information (link)
    • Davidson-Khanna-Milo-Roy, TODS 2014
      Top-k and Clustering with Noisy Comparisons (link)
    • Guo-Parameswaran- Garcia-Molina, SIGMOD 2012
      So Who Won? Dynamic Max Discovery with the Crowd (link)
    • Franklin-Kossmann-Kraska-Ramesh-Xin, SIGMOD 2011
      CrowdDB: Answering Queries with Crowdsourcing (link)
    • Marcus-Wu-Karger-Madden-Miller, PVLDB 2011
      Human-powered Sorts and Joins (link)
    • Parameswaran-Park- Garcia-Molina -Polyzotis-Widom, CIKM 2012
      Deco: Declarative Crowdsourcing (link)
    • Jagadish-Chapman-Elkiss-Jayapandian-Li-Nandi-Yu, SIGMOD 2007
      Making Database Systems Usable (link)
    • Tran-Chan-Parthasarathy, SIGMOD 2009
      Query by Output (link)
  • Systems for Data Analysis

    • Xin-Rosen-Zaharia-Franklin-Shenker-Stoica, SIGMOD 2013
      Shark: SQL and Rich Analytics at Scale (link)
    • Zaharia-Chowdhury-Franklin-Shenker-Stoica, HotCloud 2010
      Spark: Cluster Computing with Working Sets (link)
    • Cohen-Dolan-Dunlap-Hellerstein-Welton, PVLDB 2009
      MAD Skills: New Analysis Practices for Big Data (link)
    • Melnik-Gubarev-Long-Romer-Shivakumar-Tolton-Vassilakis, PVLDB 2010
      Dremel: Interactive Analysis of Web-Scale Datasets (link)
    • Kumar-Niu-Re, CACM 2013
      Hazy: Making it Easier to Build and Maintain Big-Data Analytics (link)
    • Feng-Kumar-Recht-Re, SIGMOD 2012
      Towards a Unified Architecture for in-RDBMS Analytics (link)
    • Hellerstein-Re-Schoppmann-Wang-Fratkin-Gorajek-Ng-Welton-Feng-Li-Kumar, PVLDB 2012
      The MADlib Analytics Library or MAD Skills (link)
    • Kraska-Talwalkar-Duchi-Griffith-Franklin-Jordan, CIDR 2013
      MLBase: A Distributed Machine-learning System (link)
    • Tableau (link)
    • GraphLab/Dato (link)
    • Low-Bickson-Gonzalez-Guestrin-Kyrola-Hellerstein, PVLDB 2012
      Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud (link)
    • Data Wrangler (link)
    • Kandel-Paepcke-Hellerstein-Heer, CHI 2011
      Wrangler: Interactive Visual Specification of Data Transformation Scripts (link)
    • Feng-Kumar-Recht-Re, SIGMOD 2012
      Towards a Unified Architecture for in-RDBMS Analytics (link)