Master's Defense

Efficient Algorithms for Querying Large, Complex Data

Speaker:Stavros Sintos
ssintos at cs.duke.edu
Date: Tuesday, October 10, 2017
Time: 10:00am - 12:00pm
Location: D344 LSRC, Duke

Abstract

Query processing is an important problem in many research fields including database systems, data mining, and geometric computing. The goal is to preprocess input data into a data structure to efficiently answer queries. With the data sets becoming increasingly large and complex, queries are also becoming complex, therefore new challenging problems have emerged in the area of query processing. We focus on two broad themes.

First, we investigate many problems related to answering queries in the presence of uncertainty in data. Motivated by applications in sensor network, data cleaning, and data integration, there has been a growing interest in managing uncertain data. Uncertainty is typically captured using stochastic data models, and querying data requires either statistics about the probabilistic behavior of the underlying data, or data cleaning to reduce the uncertainty of the query answer. We present efficient data structures for answering range-max queries on uncertain data. Given a collection of uncertain points in the d-dimensional space, where each point is associated with a value, the goal is to preprocess the points to a data structure such that given a query rectangle R, compute the expected maximum or the most likely maximum of the values of the points that lie in R. Next, we study data cleaning problems. Given a function f over the uncertain objects of a database, we propose algorithms for choosing a subset of objects to clean with cost at most C to i) minimize the variance of f or ii) maximize the surprise of f, in average over all user queries.

The second theme is computing data summaries to answer queries quickly. We present algorithms to create a small representation Q of a much larger data set P such that for every user preference, there is always an object in Q whose preference score is not much worse than the score of the k-th most preferred object in P. Our method can be used to answer efficiently top-k queries without storing the full data set P.

We conclude by discussing current research and future research directions.

Advisor(s): Pankaj Agarwal
Committee: Kamesh Munagala, Cynthia Rudin, Jun Yang