Duke Database Research Group

Home
People
Publications
Courses
Meetings & Seminars
Links
Contact

Welcome to Duke Database Research Group! We are broadly interested in database and information systems as well as their applications. This group was established in 2001, and is currently supported by National Science Foundation, National Institute of Health, Duke University, and IBM Corporation.

Our recent research focuses on derived data maintenance. Derived data is the result of applying some transformation, structural or computational, to base data. The use of derived data to facilitate access to base data is a recurring technique in many areas of computer science. Used in hardware and software caches, derived data speeds up access to base data. Used in replicated systems, it improves reliability and performance of applications in a wide-area network. Used as index structures, it provides fast alternative access paths to base data. Used as materialized views in databases or data warehouses, it improves the performance of complex queries over base data. Used as synopses, it provides fast, approximate answers to queries or statistics needed for cost-based optimization. Derived data may vary in complexity: it can be a simple copy of base data, in the cases of caching and replication, or it can be the result of complex transformations, in the cases of indexes and materialized views. Derived data may also vary in accuracy: caches and materialized views are usually exact, while synopses are approximate. Regardless of the varying forms, purposes, complexity, and accuracy of derived data, it must be maintained when base data is updated. Thus, derived data maintenance is a fundamental problem in computer science. It is also an evolving problem: existing techniques are constantly challenged by the explosive growth in data volume and number of data producers and consumers, and by increasing diversity in data formats. Traditionally, derived data maintenance has been tackled separately in different contexts, e.g., index updates and materialized view maintenance in databases, cache coherence and replication protocols in distributed systems. Although they share the same underlying theme, these techniques have been developed and applied largely disjointly. Newer and more complex data management tasks, however, call for creative combinations of the traditionally separate ideas.

We are actively investigating techniques and applications of derived data maintenance in the following contexts:

  • XML data processing. Because of its portability and flexibility, XML is rapidly becoming a standard format for exchanging data over the Internet. However, the theoretical and practical foundation for XML processing is still not as strong as the relational model. Although many XML operations can be implemented on top of relational databases, doing so incurs the overhead of converting XML to and from its relational representation. Our goal is to develop efficient native XML processing techniques, using a combination of derived data including caching, indexing, and materialized views. We are also interested in I/O-efficient handling of disk-resident XML data. The main challenge is how to handle the rich and flexible structure of XML data, which consists of semi-structured trees or graphs rather than tables with prescribed formats.
  • Continuous query processing. In contrast to a traditional query, which runs once against a snapshot of the database, a continuous query is a standing query that continuously generates new results (or changes to results) as database updates continue to arrive in a stream. In this sense, a continuous query is a form of derived data as well. We are building an efficient and scalable continuous query system featuring an expressive SQL-based subscription language, which allows users to control precisely when they want to get notified of updates to their continuous queries. An example query would be "the top ten most heavily loaded machines in the cluster and their current load" with the notification condition "the average load across the cluster has changed by more than 10% since the last notification." Challenges in building this system include scaling to a large number of complex subscriptions, "stateful" subscription processing, and interfacing data processing servers with the network for scalable dissemination of notifications.

In addition to work on derived data maintenance, we have tackled a number of other problems, such as bibliographic data extraction and cleansing, incorporating Web search techniques in relational databases, temporal database implementation, query optimization over heterogeneous data sources, etc. Please refer to our publications for details.

Last updated Mon Aug 21 14:27:49 EDT 2006