Full text (gzip-compressed postscript)
In this paper we address for the first time the I/O complexity of the
problem of sorting strings in external memory, which is a fundamental
component of many large-scale text applications. In the standard
unit-cost RAM comparison model, the complexity of sorting K strings of
total length N is
.
By analogy, in the external
memory (or I/O) model, where the internal memory has size M and the
block transfer size is B, it would be natural to guess that the I/O
complexity of sorting strings is
,
but the known algorithms do not come even
close to achieving this bound. Our results show, somewhat
counterintuitively, that the I/O complexity of string sorting depends
upon the length of the strings relative to the block size. We first
consider a simple comparison I/O model, where one is not allowed to break
the strings into their characters, and we show that the I/O complexity of
string sorting in this model is
,
where N1 is the total
length of all strings shorter than B and K2 is the number of strings
longer than B. We then consider two more general I/O comparison models
in which string breaking is allowed. We obtain improved algorithms and in
several cases lower bounds that match their I/O bounds. Finally, we
develop more practical algorithms without assuming the comparison model.