next up previous
Next: Dynamic Rank/Select Dictionaries with Up: EXTERNAL MEMORY ALGORITHMS, I/O Previous: A Cache-Oblivious Index for

   
The SBC-tree: An Index for Run-Length Compressed Sequences

M. Y. Eltabakh, W.-K. Hon, R. Shah, W. Aref, and J. S. Vitter. ``The SBC-tree: An Index for Run-Length Compressed Sequences,'' in preparation. An extended abstract appears in Proceedings of the 11th International Conference on Extending Database Technology (EDBT '08), Nantes, France, March 2008, published in Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany.

Full text (Adobe pdf format)

Run-Length-Encoding (RLE) is a data compression technique that is used in various applications, e.g., biological se- quence databases, multimedia, and facsimile transmission. One of the main challenges is how to operate, e.g., indexing, searching, and retrieval, on the compressed data without decompressing it. In this paper, we present the String B-tree for Compressed sequences, termed the SBC-tree, for indexing and searching RLE-compressed sequences of arbitrary length. The SBC-tree is a two-level index structure based on the well-known String B-tree and a 3-sided range query structure. The SBC-tree supports substring as well as prefix matching, and range search operations over RLE-compressed sequences. The SBC-tree has an optimal external-memory space complexity of O(N/B) pages, where N is the total length of the compressed sequences, and B is the disk page size. The insertion and deletion of all suffixes of a compressed sequence of length m takes $O(m \log_B (N+m))$ I/O operations. Substring matching, prefix matching, and range search execute in an optimal $O(\log_B N +
(\vert p\vert+T)/B)$ I/O operations, where |p| is the length of the compressed query pattern and T is the query output size. We present also two variants of the SBC-tree: the SBC-tree that is based on an R-tree instead of the 3-sided structure, and the one-level SBC-tree that does not use a two-dimensional index. These variants do not have provable worst-case theoretical bounds for search operations, but perform well in practice. The SBC-tree index is realized inside PostgreSQL in the context of a biological protein database application. Performance results illustrate that using the SBC-tree to index RLE-compressed sequences achieves up to an order of magnitude reduction in storage, up to 30% reduction in I/Os for the insertion operations, and retains the optimal search performance achieved by the String B-tree over the uncompressed sequences.


next up previous
Next: Dynamic Rank/Select Dictionaries with Up: EXTERNAL MEMORY ALGORITHMS, I/O Previous: A Cache-Oblivious Index for
Jeff Vitter
2008-04-02