next up previous
Next: Space-Efficient Framework for Top-k Up: DATA COMPRESSION Previous: On Searching Compressed String

   
On Entropy-Compressed Text Indexing in External Memory

W.-K. Hon, R. Shah, S. V. Thankachan, and and J. S. Vitter. ``On Entropy-Compressed Text Indexing in External Memory,'' Proceedings of the 16th International Conference on String Processing and Information Retrieval (SPIRE '09), Saariselkä, Finland, August 2009, published in Lecture Notes in Computer Science, 5721 Springer-Verlag, Berlin, Germany, 75-89.

Full text (Adobe pdf format)

A new trend in the field of pattern matching is to design indexing data structures which take the space very close to that required by the indexed text (in entropy-compressed form) and also simultaneously achieve good query performance. Two popular indexes, namely the FM-index of Ferragina and Manzini and the CSA of Grossi and Vitter, achieve this goal by exploiting the Burrows-Wheeler transform (BWT). However, due to the intricate permutation structure of BWT, no locality of reference can be guaranteed when we perform pattern matching with these indexes. Chien et al. gave an alternative text index which is based on sparsifying the traditional suffix tree and maintaining an auxiliary 2-D range query structure. Given a text T of length n drawn from a $\sigma$-sized alphabet set, they achieved an $O(n \log \sigma)$-bit index for T and showed that this index can preserve locality in pattern matching and hence is amenable to be used in external-memory settings. We improve upon this index and show how to apply entropy compression to reduce index space. Our index takes $O(n(H_k+1)) + o(n\log \sigma)$ bits of space where Hk is the kth-order empirical entropy of the text. This is achieved by creating variable length blocks of text using arithmetic coding.


next up previous
Next: Space-Efficient Framework for Top-k Up: DATA COMPRESSION Previous: On Searching Compressed String
Jeff Vitter
2009-11-09