Algorithms for Analyzing Spatio-temporal Data

Ph. D. Defense
Speaker Name
Abhinandan Nath
Date and Time

In today's age, huge data sets are becoming ubiquitous. In addition to their size, most of these data sets are often noisy, have outliers, and are incomplete. Hence, analyzing such data is challenging. We look at applying geometric techniques to tackle some of these challenges, with an emphasis on designing provably efficient algorithms.

With the massive amounts of data available today, it is common to store and process data using multiple machines. Parallel programming frameworks such as MapReduce and its variants are becoming popular for handling such large data. We present the first provably efficient algorithms to compute, store, and query data structures for range queries and approximate nearest neighbor queries in a popular parallel computing abstraction that captures the salient features of MapReduce and other massively parallel communication (MPC) models. Our algorithms are competitive in terms of running time and workload to their classical counterparts.

We propose parallel algorithms in the MPC model for processing large terrain elevation data (represented as a 3D point cloud) that are too big to fit on one machine. In particular, we present a simple randomized algorithm to compute the Delaunay triangulation of the $xy$-projections of the input points. Next we describe an efficient algorithm to compute the contour tree (a topological descriptor that succinctly encodes the contours of a terrain) of the resulting triangulated terrain.

We then look at comparing real-valued functions, by computing a distance function between their merge trees (a small-sized descriptor that succinctly captures the sublevel sets of a function). Merge trees are robust to noise in the data, and can be used effectively as proxies for the data sets themselves in some cases. We also use it to compute the Gromov-Hausdorff distance, a natural way to measure distance between two metric spaces. We give the first proof of hardness and the first non-trivial approximation algorithm for computing the Gromov-Hausdorff distance between geodesic metrics in trees.

Finally we look at the problem of capturing shared portions between large number of input trajectories. We formulate it as a subtrajectory clustering problem - the clustering of subsequences of trajectories. We propose a new model for clustering subtrajectories. Each cluster of subtrajectories is represnted as a \emph{pathlet}, a sequence of points that is not necessarily a subsequence of an input trajectory. We present a single objective function for finding the optimal collection of pathlets that best represents the trajectories taking into account noise and other artifacts of the data. We show that the subtrajectory clustering problem is NP-hard and present fast approximation algorithms. We further improve the running time of our algorithm if the input trajectories are ``well-behaved". We also present experimental results on both real and synthetic data sets.

Advisor: Pankaj Agarwal, Committee: Kamesh Munagala, Rong Ge, Yusu Wang