Given a set
of D strings of
total length n, our task is to report the ``most
relevant'' strings for a given query pattern P. This
involves somewhat more advanced query functionality than the
usual pattern matching, as some notion of ``most relevant''
is involved. In information retrieval literature, this task
is best achieved by using inverted indexes. However,
inverted indexes work only for some predefined set of
patterns. In the pattern matching community, the most
popular pattern-matching data structures are suffix trees
and suffix arrays. However, a typical suffix tree search
involves going through all the occurrences of the pattern
over the entire string collection, which might be a lot more
than the required relevant documents.
The first formal framework to study such kind of retrieval
problems was given by Muthukrishnan. He considered two
metrics for relevance: frequency and proximity. He took a
threshold-based approach on these metrics and gave data
structures taking
words of space. We study
this problem in a slightly different framework of reporting
the top k most relevant documents (in sorted order) under
similar and more general relevance metrics. Our framework
gives linear space data structure with optimal query times
for arbitrary score functions. As a corollary, it improves
the space utilization for the problems considered by
Muthukrishnan while maintaining optimal query performance.
We also develop compressed variants of these data structures
for several specific relevance metrics.