next up previous
Next: Theses Up: Jeff Vitter's Curriculum Vitæ Previous: Research Funding

Research Interests

In my research I seek to exploit the rich interdependence between theory and practice. My work in algorithm design and analysis spans several application areas. The challenge in each case is to design algorithms that are both provably efficient and practical to implement. I am also interested in the complementary field of computational complexity, which considers the inherent difficulty of problems and thus allows us to determine when algorithms are optimal.

Efficient algorithms for external memory. My primary interest is the design and analysis of algorithms that are optimal in terms of the number of input/output operations (I/Os) between internal (primary) memory and external (secondary) storage. My recent book serves as a reference for much of the work in the field. We have developed models of parallel block transfer that apply to the new high-performance disk and hierarchical memory systems being developed. Our approach using duality provides the state-of-the-art approach to sorting using parallel disks. Practical algorithms have been developed for a variety of information processing applications, such as sorting, FFT, matrix operations, range searching, and a variety of computational geometry and combinatorial problems. Work in geographic information systems is described below. We can often prove that the algorithms are optimal, in terms of both I/O and internal processing. In terms of algorithm engineering, we have worked on programming environments such as TPIE (Transparent Parallel I/O programming Environment) that allow users to implement efficient external memory algorithms in a high-level manner.

Compressed text indexes and data structures. Exhaustive search through large collections of data is prohibitively slow. Efficient indexing of massive documents is critical. Good indexes for textual and sequence data should be efficient to construct, allow fast access and queries on the text, and be compact and space-efficient. The wavelet tree data structure I introduced with colleagues (not to be confused with wavelets discussed below) is a practical and elegant structure for coding sequences of characters from a multicharacter alphabet, and it is a key tool in modern text indexing. Until about 12 years ago, the best known data structures that supported fast pattern matching required several times more space than the data being indexed! Our work was the first to achieve provably linear-space indexes. We have since developed the first sublinear-space text indexes that provably require only as much space as entropy-compressed text, while still allowing fast and powerful searches. Moreover, the original text is not needed and can be reconstructed from the index. Empirical studies of our data structures show superior performance in practice compared with other methods. We are currently looking at compressed data structures for a variety of other problems.

Data compression, which deals with developing efficient methods for compressing source data so as to reduce space consumption. In the online setting, compression has important uses in network communication. Efficient variants of Huffman and arithmetic coding algorithms have been developed and analyzed. In text compression we have developed algorithms that improve the speed of statistical methods such as PPMC as well as algorithms that are comparable with Ziv-Lempel methods in speed but better in compression. We have developed the first provably good compressed search indexes for text and sequence data, and we are currently examining further improvements. Image compression work includes both lossless image compression and lossy vector quantization; often the algorithms we develop are progressive and thus facilitate browsing with no added overhead. Our FELICS method has been implemented in hardware for the Mars Reconnaissance Orbiter. We have also worked on video compression methods, specifically motion compensation algorithms for low bit rate applications like teleconferencing and rate control algorithms for high bit rate scenarios like MPEG. We developed the paradigm of minimizing the combined measure of rate plus distortion; this rate-distortion optimization has since been incorporated into the H.264/MPEG-4 AVC standard's reference encoder, used widely in the computing and communications industry (e.g., Tandberg Television MPEG-2 EN8100 and SD & HD MPEG-4 iPlex, Ateme H.264 encoder, Grass Valley ViBE encoders, libavcodec, Mainconcept H.264 encoder, Microsoft VC-1 encoder, Theora 1.1-alpha1, x264 H.264 encoder, and Xvid MPEG-4 ASP encoder).

Prediction for caching and prefetching, which deals with developing prediction methods for optimizing system performance. Caching (or paging) usually refers to the process in which the system decides which page to remove from cache when a new page that the user requests is fetched into cache. Prefetching is the active process in which the system uses the user's idle time to predict which page the user will access next and then moves the likely pages into cache in anticipation of the user. We have applied the predictive aspects of optimal data compression methods to develop provably optimal prefetching methods. The methods predict as well in the limit as special-purpose methods tuned to the characteristics of the sequence of page requests. In one case, for sequences generated by Markov sources, the prediction accuracy converges to the intrinsic fault rate. In another, accuracy is optimal in the worst case, compared with all finite-state predictors.

Database access and data mining, which deals with the search for and extraction of useful patterns among data in large depositories, such as in commercial databases, image databases, data warehouses, the World Wide Web, and genomics applications. A theme in our work is the development of methods that are simultaneously efficient in terms of their processing time (often dominated by I/O accesses) and the quality of the results they produce. Much of our I/O work on spatial databases and GIS is described above in the computational geometry section. One key data mining task is learning how to construct rules to classify data items automatically. Prediction, applied above for caching and prefetching, is useful for determining likely future events based upon the past data stored in a database. We have worked on how to cluster data based upon measures of similarity and how to do similarity search, as in image databases. We have introduced wavelets to the database community as a key technique for summarizing, approximating, and predicting data. Wavelets have since become heavily used in database optimization, indexing, data warehousing, image processing, and data mining. We have successfully applied wavelets constructed in an I/O-efficient manner to the problems of selectivity estimation in databases and finding approximate solutions to OLAP (on-line analytic processing) queries. I received the 2009 ACM SIGMOD Test of Time award, for co-authoring the SIGMOD paper from 10 years earlier that has had the most impact in the following decade in terms of research, products, and methodology.

Machine learning and neural networks, which deals with the design and quantitative evaluation of algorithms for learning concepts from examples -- an important component of artificial intelligence. The goal is to construct algorithms or neural networks that can ``learn'' a target concept after seeing a reasonable number of typical examples. Emphasis is put on the efficiency of such algorithms, in terms of the degree of correctness of the hypothesis produced and of the number of examples and storage space required. Theoretical studies have led to new algorithms, quantitative models, and bounds on performance, especially in the realm of parallel learning algorithms, neural networks, map learning for robotics, memory-based systems, and program testing methods.

Computational geometry, which deals with the design and analysis of algorithms for manipulating geometric quantities, such as lines, planes, rectangles, and convex objects. One problem considered deals with efficient methods for eliminating ``hidden'' lines and surfaces in two-dimensional graphical displays. We have investigated efficient algorithms in terms of a new object model of complexity that can better model graphics hardware. We have worked on efficient new algorithms for binary space partitions and geometric clustering, with applications to data compression, machine learning, and graphics. We have considered point location and important subcases of graph problems in planar and euclidean graphs. An important area of investigation is to develop I/O-efficient indexing methods for the massive amounts of geometric data that can be stored in geographic information systems (GIS) and other spatial databases. We are developing efficient algorithms for search, clustering, accessing data in triangulated irregular networks, contour-line extraction, terrain processing, and map overlay.

Parallel processing. My work includes the design of fast parallel combinatorial algorithms and development of new complexity models for parallel computation. We have developed algorithms for transitive closure, shortest paths, and graph drawing in planar structures, cooperative search, and unification. Results have been obtained that demonstrate interesting tradeoffs between running time and amount of parallelism, with applications to VLSI and the utilization of supercomputers. We have introduced complexity classes for general parallel computation that capture a practical notion of ``parallelizability.'' Parallel algorithms for unification and other complete problems in the class $\cal P$ have been developed.

Incremental (or dynamic) computation, which deals with how to update the solution to a problem when the input is changed incrementally. Such techniques are especially important when the problem is of large scale. Complexity models have been developed, and several problems have been classified according to degree of incremental complexity. We have discovered interesting relationships between a problem's potential for parallelization and its potential for dynamization. Efficient incremental algorithms have been developed for a variety of graph problems in restricted domains, including reachability, connectivity, shortest path, maximum flow, and various problems whose general formulations are $\cal NP$-complete.

Random sampling and order statistics, which deals with finding novel and optimum methods for generating various random quantities, such as a random sample of n records from a file containing N records, or a random variate whose distribution depends upon a set of N dynamically changing weights. These methods have important applications in databases, data mining, simulations, statistical surveys, and probabilistic algorithms.

Analysis of hashing methods, concentrating on the study of coalesced hashing and an algorithm for processing sweep line information called hashing with lazy deletion. Related questions in queueing theory deal with the distribution of the maximum queue size in stochastic processes and dynamic data structures.

Design and analysis of data structures and graph algorithms. This includes the mathematical analysis of fundamental combinatorial structures such as families of trees, the design and analysis of priority queue algorithms that are both theoretically optimal and practical to use, and the development of algorithms for unification and its variants. Work on hashing and Euclidean graph algorithms is described earlier.


next up previous
Next: Theses Up: Jeff Vitter's Curriculum Vitæ Previous: Research Funding
Jeff Vitter
2009-11-16