Full text (Adobe pdf format) of the SODA 2003 conference paper
We present a novel implementation of compressed suffix arrays exhibiting
new tradeoffs between search time and space occupancy for a given text
(or sequence) of n symbols over an alphabet
,
where each
symbol is encoded by
bits. We show that compressed
suffix arrays use just
bits,
while retaining full text indexing functionalities, such as searching
any pattern sequence of length m in
time. The term
denotes the hth-order empirical
entropy of the text, which means that our index is nearly optimal in
space apart from lower-order terms, achieving asymptotically the
empirical entropy of the text (with a multiplicative constant 1).
If the text is highly compressible so that
Hn = o(1) and the alphabet
size is small, we obtain a text index with o(m) search time that
requires only o(n) bits. We also report further results and tradeoffs
on high-order entropy-compressed text indexes.