next up previous
Next: High-Order Entropy-Compressed Text Indexes Up: DATA COMPRESSION Previous: Efficient Algorithms for MPEG

   
Compressed Suffix Arrays and Suffix Trees, with Applications to Text Indexing and String Matching

R. Grossi and J. S. Vitter. ``Compressed Suffix Arrays and Suffix Trees, with Applications to Text Indexing and String Matching,'' SIAM Journal on Computing, 35(2), 2005, 378-407. This paper combines two pieces of work: An extended abstract of the first appears in Proceedings of the 32nd Annual ACM Symposium on Theory of Computing (STOC '00), Portland, OR, May 2000, 397-406. The second line of work is being developed further into a conference publication entitled ``A new analysis of the Burrows-Wheeler transform.''

Full text (gzip-compressed postscript)

Full text (Adobe pdf format)

Slides for talk (gzip-compressed postscript)

Slides for talk (Adobe pdf format)

The proliferation of online text, such as on the World Wide Web and in databases, motivates the need for space-efficient index methods that support fast search. Consider a text T of n binary symbols to index. Given any query pattern P of m binary symbols, the goal is to search for P in T quickly, with T being fully scanned only once, namely, when the index is created. All indexing schemes published in the last thirty years support searching in $\Theta(m)$ worst-case time and require $\Theta(n)$ memory words (or $\Theta(n \log n )$ bits), which is significantly larger than the text itself. In this paper we provide a breakthrough both in searching time and index space under the same model of computation as the one adopted in previous work. Based upon new compressed representations of suffix arrays and suffix trees, we construct an index structure that occupies only O(n) bits and compares favorably with inverted lists in space. We can search any binary pattern P, stored in $O(m/\log n)$ words, in only o(m) time. Specifically, searching takes O(1) time for $m = o(\log n)$, and $O(m/\log n + \log^\epsilon n) = o(m)$ time for $m = \Omega(\log n)$ and any fixed $0 < \epsilon < 1$. That is, we achieve optimal $O(m/\log n)$ search time for sufficiently large $m =
\Omega(\log^{1+\epsilon} n)$. We can list all the $\mathit{occ}$ pattern occurrences in optimal $O(\mathit{occ})$ additional time when $m =
\Omega({\mathop{\rm polylog}}(n))$ or when $\mathit{occ} = \Omega(n^\epsilon)$; otherwise, listing takes $O(\mathit{occ} \, \log^\epsilon n)$ additional time.


next up previous
Next: High-Order Entropy-Compressed Text Indexes Up: DATA COMPRESSION Previous: Efficient Algorithms for MPEG
Jeff Vitter
2008-04-02