next up previous
Next: When Indexing Equals Compression: Up: DATA COMPRESSION Previous: Compressed Suffix Arrays and

   
High-Order Entropy-Compressed Text Indexes

R. Grossi, A. Gupta, and J. S. Vitter. ``High-Order Entropy-Compressed Text Indexes,'' Proceedings of the 14th Annual SIAM/ACM Symposium on Discrete Algorithms (SODA '03), Baltimore, MD, January 2003, 841-850.

Full text (Adobe pdf format) of the SODA 2003 conference paper

We present a novel implementation of compressed suffix arrays exhibiting new tradeoffs between search time and space occupancy for a given text (or sequence) of n symbols over an alphabet $\Sigma$, where each symbol is encoded by $\log \vert\Sigma\vert$ bits. We show that compressed suffix arrays use just $nH_h + \smash{O(n \log \log n/
\log_{\vert\Sigma\vert} n)}$ bits, while retaining full text indexing functionalities, such as searching any pattern sequence of length m in $\smash{O(m \log
\vert\Sigma\vert + {\mathop{\rm polylog}}(n))}$ time. The term $H_h \leq \log \vert\Sigma\vert$ denotes the hth-order empirical entropy of the text, which means that our index is nearly optimal in space apart from lower-order terms, achieving asymptotically the empirical entropy of the text (with a multiplicative constant 1). If the text is highly compressible so that Hn = o(1) and the alphabet size is small, we obtain a text index with o(m) search time that requires only o(n) bits. We also report further results and tradeoffs on high-order entropy-compressed text indexes.


next up previous
Next: When Indexing Equals Compression: Up: DATA COMPRESSION Previous: Compressed Suffix Arrays and
Jeff Vitter
2008-04-02