Run-Length-Encoding (RLE) is a data compression technique
that is used in various applications, e.g., biological se-
quence databases, multimedia, and facsimile transmission.
One of the main challenges is how to operate, e.g.,
indexing, searching, and retrieval, on the compressed data
without decompressing it. In this paper, we present the
String B-tree for Compressed sequences, termed the SBC-tree,
for indexing and searching RLE-compressed sequences of
arbitrary length. The SBC-tree is a two-level index
structure based on the well-known String B-tree and a
3-sided range query structure. The SBC-tree supports
substring as well as prefix matching, and range search
operations over RLE-compressed sequences. The SBC-tree has
an optimal external-memory space complexity of O(N/B) pages,
where N is the total length of the compressed sequences, and
B is the disk page size. The insertion and deletion of all
suffixes of a compressed sequence of length m takes
I/O operations. Substring matching, prefix
matching, and range search execute in an optimal
I/O operations, where |p| is the length of the
compressed query pattern and T is the query output size. We
present also two variants of the SBC-tree: the SBC-tree that
is based on an R-tree instead of the 3-sided structure, and
the one-level SBC-tree that does not use a two-dimensional
index.
These variants do not have provable worst-case theoretical
bounds for search operations, but perform well in practice.
The SBC-tree index is realized inside PostgreSQL in the
context of a biological protein database
application. Performance results illustrate that using the
SBC-tree to index
RLE-compressed sequences achieves up to an order of
magnitude reduction in storage, up to 30% reduction in I/Os
for
the insertion operations, and retains the optimal search
performance achieved by the String B-tree over the uncompressed sequences.