"NeXSort: Sorting XML in External Memory
Speaker:Adam Silberstein
(03/24/2004)
Abstract
XML plays an important role in delivering data over the Internet, and
the need to store and manipulate XML in its native format has become
increasingly relevant. This growing need necessitates work on
developing native XML operators, especially for one as fundamental as
sort. In this paper we present NeXSort, an algorithm that leverages
the hierarchical nature of XML to efficiently sort an XML document in
external memory. In a fully sorted XML document, children of every
non-leaf element are ordered according to a given sorting criterion.
Among NeXSort's uses is in combination with structural merge as the
XML version of sort-merge join, which allows us to merge large XML
documents using only a single pass once they are sorted.
The hierarchical structure of an XML document limits the number of
possible legal orderings among its elements, which means that sorting
XML is fundamentally ``easier'' than sorting a flat file. We prove
that the I/O lower bound for sorting XML in external memory is
Θ(max{n, n logm(k/B)}), where n is the number of blocks
in the input XML document, m is the number of main memory blocks
available for sorting, B is the number of elements that can fit in
one block, and k is the maximum fan-out of the input document tree.
We show that NeXSort performs within a constant factor of this
theoretical lower bound. In practice we demonstrate, even with a
naive implementation, NeXSort significantly outperforms a regular
external merge sort of all elements by their key paths, unless the XML
document is nearly flat, in which case NeXSort degenerates
essentially to external merge sort.
Return to the SPIDER schedule
Jaidev Patwardhan
Last modified: Tue Jan 27 15:26:39 EST 2004