next up previous
Next: SBC-tree: Efficient Indexing for Up: DATA COMPRESSION Previous: Compressed Dictionaries: Space Measures,

   
An Algorithmic Framework for Compression and Text Indexing

R. Grossi, A. Gupta, and J. S. Vitter. ``An Algorithmic Framework for Compression and Text Indexing,'' being submitted to journal. This work is an extension of two separate pieces of research in conference form: ``High-Order Entropy-Compressed Text Indexes,'' Proceedings of the 14th Annual SIAM/ACM Symposium on Discrete Algorithms (SODA '03), Baltimore, MD, January 2003, 841-850; and ``A Nearly Tight Analysis of the Burrows Wheeler Transform,'' Proceedings of the 5th Workshop on Analytical Algorithmics and Combinatorics (ANALCO '08), San Francisco, CA, January 2008.

Full text (Adobe pdf format)

We present a unified algorithmic framework to obtain nearly optimal space bounds for text compression and compressed text indexing, apart from lower-order terms. For a text T of n symbols drawn from an alphabet $\Sigma$, our bounds are stated in terms of the hth-order empirical entropy of the text, Hh. In particular, we provide a tight analysis of the Burrows-Wheeler transform (BWT) establishing a bound of $nH_h + M(T,\Sigma,h)$ bits, where $M(T,\Sigma,h)$ denotes the asymptotical number of bits required to store the empirical statistical model for contexts of order up to h appearing in T. Using the same framework, we also obtain an implementation of the compressed suffix array (CSA) which achieves $nH_h + M(T,\Sigma,h) + O(n \lg\lg
n/\lg_{\vert\Sigma\vert}
n)$ bits of space while still retaining competitive full-text indexing functionality.

The novelty of the proposed framework lies in its use of the finite set model instead of the empirical probability model (as in previous work), giving us new insight into the design and analysis of our algorithms. For example, we show that our analysis gives improved bounds since $M(T,\Sigma,h) \leq \min \{ g'_h \lg (n/g'_h+1), H_h^* n +
\lg n + g''_h
\}$, where $g'_h = O(\vert\Sigma\vert^{h+1})$ and $g''_h =
O(\vert\Sigma\vert^{h+1} \lg
\vert\Sigma\vert^{h+1})$ do not depend on the text length n, while $H_h^* \geq
H_h$ is the modified hth-order empirical entropy of T. Moreover, we show a strong relationship between a compressed full-text index and the succinct dictionary problem. We also examine the importance of lower-order terms, as these can dwarf any savings achieved by high-order entropy. We report further results and tradeoffs on high-order entropy-compressed text indexes in the paper.


next up previous
Next: SBC-tree: Efficient Indexing for Up: DATA COMPRESSION Previous: Compressed Dictionaries: Space Measures,
Jeff Vitter
2008-06-14