Full text (gzip-compressed postscript)
Slides for talk (gzip-compressed postscript)
Slides for talk (Adobe pdf format)
The proliferation of online text, such as on the World Wide Web and
in databases, motivates the need for space-efficient index methods
that support fast search. Consider a text T of n binary symbols
to index. Given any query pattern P of m binary symbols, the
goal is to search for P in T quickly, with T being fully
scanned only once, namely, when the index is created. All indexing
schemes published in the last thirty years support searching in
worst-case time and require
memory words (or
bits), which is significantly larger than the
text itself. In this paper we provide a breakthrough both in
searching time and index space under the same model of computation
as the one adopted in previous work. Based upon new compressed
representations of suffix arrays and suffix trees, we construct an
index structure that occupies only O(n) bits and compares
favorably with inverted lists in space. We can search any binary
pattern P, stored in
words, in only o(m) time.
Specifically, searching takes O(1) time for
,
and
time for
and any fixed
.
That is, we achieve optimal
search time for sufficiently large
.
We can list all the
pattern occurrences in optimal
additional time when
or when
;
otherwise, listing takes
additional time.