Research Projects
RIOT: I/O Efficient Numerical Computing without SQL
R is a numerical computing environment that is widely popular for statistical data analysis. Like many such environments, R performs poorly for large datasets whose sizes exceed that of physical memory. However, I/O-efficiency and query optimization are features standard to modern database systems. In an attempt to bridge the gap between these two very different fields, we developed RIOT-DB, an initial R package prototype that uses a relational database system as a backend. Through automatic query generation and good use of relational databases, we were able to manipulate large amounts of data without requiring them to reside completely in memory.
Despite the overhead and inadequacy of generic database systems in handling array data and numerical computation, RIOT-DB significantly outperforms R in many large-data scenarios, thanks to a suite of high-level, inter-operation optimizations that integrate seamlessly into R. Compared with previous approaches that require users to learn new languages and rewrite their programs to interface with a database, RIOT-DB users are insulated from anything database related.
Our experience of implementing RIOT-DB has provided many insights, the most important of which is the fact that I/O-efficiency cannot be achieved without high-level, inter-operation optimizations. Even though the generic database system used by RIOT-DB carries enormous overhead in storing and processing arrays, its pipelined execution model and query optimizer are able to turn the tide in its favor in many cases. With a specialized storage engine, algorithms, and database-style optimization strategies tailored towards numerical computing, we expect the next generation of RIOT to make further gains in I/O-efficiency. Another pleasant surprise from this experience is that transparency is indeed possible, with ideas such as deferred valuation and clever mechanisms to implement them within the confines of R. As future work, we plan to investigate how to apply our techniques to other language environments beyond those intended for numerical computing.
Publications
-
Y. Zhang, H. Herodotou, and J. Yang.
RIOT: I/O-Efficient Numerical Computing without SQL.
In Proc. of the Fourth Biennial Conf. on Innovative Data Systems Research (CIDR '09), January 2009.