We present a unified algorithmic framework to obtain nearly
optimal space
bounds for text compression and compressed text indexing,
apart from
lower-order terms. For a text T of n symbols drawn from
an
alphabet
,
our bounds are stated in terms of the
hth-order
empirical entropy of the text, Hh. In particular, we
provide a tight
analysis of the Burrows-Wheeler transform (BWT)
establishing a bound of
bits, where
denotes
the
asymptotical number of bits required to store the empirical
statistical
model for contexts of order up to h appearing
in T. Using the same
framework, we also obtain an implementation of the
compressed suffix array
(CSA) which achieves
bits of space while still retaining competitive
full-text indexing
functionality.
The novelty of the proposed framework lies in its use of the
finite set model instead of the empirical probability model
(as in previous work), giving us new insight into the design
and analysis of our algorithms. For example, we show that
our analysis gives improved bounds since
,
where
and
do not depend on the
text length n, while
is the modified
hth-order empirical entropy of T. Moreover, we show a
strong relationship between a compressed full-text index and
the succinct dictionary problem. We also examine the
importance of lower-order terms, as these can dwarf any
savings achieved by high-order entropy. We report further
results and tradeoffs on high-order entropy-compressed text
indexes in the paper.