Research Overview

Research Interests

My research interests are in large-scale Data Processing Systems and Database Systems. In particular, my work focuses on ease-of-use, manageability, and automated tuning of both centralized and distributed data-intensive computing systems. In addition, I am interested in applying database techniques in other areas like scientific computing, bioinformatics, and numerical analysis.

Duke Database Group I believe research is necessary not only to acquire data and formulate theories, but it is just as important to apply those theories and use that data in the real world. Hence, I always strive to get involved with research that can lead to practical applications and have major impact outside the academic community.

Research Projects

Starfish: A Self-tuning System for Big Data Analytics

Timely and cost-effective analytics over "Big Data" is now a key ingredient for success in businesses and scientific disciplines. The Hadoop MapReduce platform is a popular choice for big data analytics. Unfortunately, Hadoop's performance out of the box leaves much to be desired, causing suboptimal use of resources, time, and money. We introduce Starfish, a self-tuning system for big data analytics. Starfish builds on Hadoop, while adapting to system workloads and user needs to provide good performance automatically; without any need for users to understand and manipulate the many tuning knobs in the Hadoop platform. Read more...

Query Optimization Techniques for Partitioned Tables

Table partitioning has evolved into a powerful mechanism to improve the overall performance and manageability of database systems, but is not utilized effectively during query optimization. We have developed new techniques to generate efficient plans for SQL queries involving multiway joins over partitioned tables. Our techniques are designed for easy incorporation into bottom-up query optimizers that are in wide use today. We have prototyped these techniques in PostgreSQL. Read more...

Automating the Process of SQL Tuning

SQL tuning, the attempt to improve a poorly-performing execution plan produced by the database query optimizer, is a critical aspect of database performance tuning. For example, the optimizer may pick a poor join order in the plan, overlook an important index, use a nested-loop join when a hash join would have done better, or cause an expensive, but avoidable, sort to happen. Today, some form of manual intervention is needed to correct these mistakes. Our goal is to fully automate the process of SQL tuning using an experiment-driven approach. A nontrivial challenge is in choosing a minimal set of experiments to conduct in order to reach a satisfactory plan quickly. A new system has been prototyped using PostgreSQL. Read more...

RIOT: I/O Efficient Numerical Computing without SQL

RIOT-DB is a system that enables efficient processing of scientific programs written in R that manipulates large amount of data without requiring them to reside completely in memory. The novelty in RIOT-DB is in automatic query generation and storage management while using a relational database system as a backend. The next generation of RIOT will make further gains in I/O-efficiency for numerical computing with a specialized storage engine, algorithms, and multi-query optimization strategies. Read more...

Dynamic Scheduling of Query Mixes in QShuffler

QShuffler, a query scheduler, takes into consideration the interaction among concurrent queries in order to minimize the completion time of business intelligence workloads. The challenges include modeling of interactions, experiment design for learning and maintaining the models, and algorithms for interaction-aware scheduling. QShuffler is being prototyped in IBM DB2. Read more...

Software Releases

License Agreement Overview

All software projects are distributed using the Software License Agreement for academic and research (non-commercial) purposes from Duke University. You may also license a software under a specially-negotiated non-exclusive commercial use license. The term "commercial use" is defined broadly: if the software is used for commercial gain or to further any commercial purpose, a commercial use license is required. If you have any question about whether your use would be considered commercial, or if you would like to negotiate a non-exclusive commercial use license, please contact me at: hero at cs dot duke dot edu.

Starfish: A Self-tuning System for Big Data Analytics

Starfish can be employed by users to (a) get a deep understanding of a MapReduce program's behavior during execution, (b) ask hypothetical questions on how the program's behavior will change when parameter settings, cluster resources, or data properties change, and (c) ultimately optimize the program. Get the source code here.

Xplus: A SQL-Tuning-Aware Query Optimizer

Xplus is a new SQL-tuning-aware query optimizer that goes beyond the traditional plan-first-execute-next approach; Xplus is able to run some (sub)plans proactively, collect monitoring data from the runs, and iterate. Get the source code here.

Publications