Durability Queries on Temporal Data
Temporal data is ubiquitous in our everyday life, but tends to be noisy and often exhibits transient patterns. To make better decisions with data, we must avoid jumping to conclusions based on certain particular query results or observations. Instead, a useful perspective is to consider "durability'', or, intuitively speaking, finding results that are robust and stand "the test of time''. This thesis studies durability queries on temporal data that return durable results efficiently and effectively.
The focus of this thesis is two-fold: (1) design meaningful and practical notions of durability (and corresponding queries) on different types of temporal data, and (2) develop efficient techniques for durability query processing. We first study sequence-based temporal datasets where each temporal object has a series of values indexed by time. Durability queries ask for objects whose (snapshot) values were among the top k for at least some fraction of the times during a given time interval; e.g., "from 2013 to 2016, United Airlines has the highest stock price among American-based airline companies for at least 80% of the time.'' Second, we consider instant-stamped temporal datasets where each data record is stamped by a time instant. Here, durability queries look for records that stand out among nearby records (defined by a time window) and retain their supremacy for a long period of time; e.g. "On January 22, 2006, Kobe Bryant dropped 81 points against Toronto Raptors, a scoring record that since then has yet to be broken.'' Finally, going beyond analyzing historical data, we investigate the notation of durability into the future, where durability needs to be predicted by performing stochastic simulation of temporal models.
For answering durability queries across these problem settings, we apply principled approaches to design fast, scalable algorithms and indexing methods. Our solutions broadly combine geometric, statistical, and approximate query processing techniques to provide a meaningful balance between query efficiency and result quality, along with theoretical worst-case (or average-case) guarantees.
Zoom link: https://duke.zoom.us/j/7668913573