next up previous
Next: I/O-Efficient Algorithms for Contour-line Up: EXTERNAL MEMORY ALGORITHMS, I/O Previous: I/O-Efficient Algorithms and Environments

On Sorting Strings in External Memory

L. Arge, P. Ferragina, R. Grossi, and J. S. Vitter.  ``On Sorting Strings in External Memory,'' Proceedings of the 29th Annual ACM Symposium on Theory of Computing (STOC '97), El Paso, TX, May 1997, 540-548, and in ``Sequence Sorting in Secondary Storage," Proceedings of the 1997 International Conference on Compression and Complexity of Sequences (SEQUENCES '97), Positano, Italy, June 1997.

Full text (gzip-compressed postscript)

Full text (Adobe pdf format)

In this paper we address for the first time the I/O complexity of the problem of sorting strings in external memory, which is a fundamental component of many large-scale text applications. In the standard unit-cost RAM comparison model, the complexity of sorting K strings of total length N is $\Theta(K\log_2 K+N)$. By analogy, in the external memory (or I/O) model, where the internal memory has size M and the block transfer size is B, it would be natural to guess that the I/O complexity of sorting strings is $\Theta(\frac{K}{B}\log_{M/B}
\frac{K}{B}+\frac{N}{B})$, but the known algorithms do not come even close to achieving this bound. Our results show, somewhat counterintuitively, that the I/O complexity of string sorting depends upon the length of the strings relative to the block size. We first consider a simple comparison I/O model, where one is not allowed to break the strings into their characters, and we show that the I/O complexity of string sorting in this model is $\Theta(\frac{N_1}{B}\log_{M/B}
\frac{N_1}{B} + K_2\log_{M/B} K_2+\frac{N}{B})$, where N1 is the total length of all strings shorter than B and K2 is the number of strings longer than B. We then consider two more general I/O comparison models in which string breaking is allowed. We obtain improved algorithms and in several cases lower bounds that match their I/O bounds. Finally, we develop more practical algorithms without assuming the comparison model.


next up previous
Next: I/O-Efficient Algorithms for Contour-line Up: EXTERNAL MEMORY ALGORITHMS, I/O Previous: I/O-Efficient Algorithms and Environments
Jeff Vitter
2008-04-02