next up previous
Next: DATA COMPRESSION Up: EXTERNAL MEMORY ALGORITHMS, I/O Previous: Algorithms and Data Structures

   
On Searching Compressed String Collections Cache-Obliviously

P. Ferragina, R. Grossi, A. Gupta, R. Shah, and J. S. Vitter. ``On Searching Compressed String Collections Cache-Obliviously,'' Proceedings of the 27th Annual ACM Symposium on Principles of Database Systems (PODS '08), Vancouver, Canada, June 2008.

Full text (Adobe pdf format)

Current data structures for searching large string collections are limited in that they either fail to achieve minimum space or they cause too many cache misses. In this paper, we discuss some edge linearizations of the classic trie data structure that are simultaneously cache-friendly and storable in compressed space. The widely known frontcoding scheme is one example of linearization; it is at the core of Prefix B-trees and many other disk-conscious compressed indexes for string collections. However, it is largely thought of as a space-effective heuristic without efficient search support.

In this paper, we introduce new insights on front-coding and other novel linearizations, and study how close their space occupancy is to the information-theoretic minimum. The moral is that they are not just heuristics. The second contribution of this paper engineers these linearizations to design a novel dictionary encoding scheme that achieves nearly optimal space, oŽers competitive I/O-search time, and is also conscious of the query distribution. Finally, we combine those data structures with cache-oblivious tries and obtain a succinct variant, whose space is close to the information-theoretic minimum.


next up previous
Next: DATA COMPRESSION Up: EXTERNAL MEMORY ALGORITHMS, I/O Previous: Algorithms and Data Structures
Jeff Vitter
2009-05-24