Full text (gzip-compressed postscript)
In this paper, we examine the spatial join problem. In particular, we focus on the case when neither of the inputs is indexed. We present a new algorithm, Scalable Sweep-based Spatial Join (SSSJ), that is based on the distribution-sweeping technique recently proposed in computational geometry, and that is the first to achieve theoretically optimal bounds on internal computation time as well as I/O transfers. We present experimental results based on an efficient implementation of the SSSJ algorithm, and compare it to the state-of-the-art Partition-Based Spatial-Merge (PBSM) algorithm of Patel and DeWitt.
Our SSSJ algorithm performs an initial sorting step along the vertical
axis, after which we use the distribution-sweeping technique to partition
the input into a number of vertical strips, such that the data in each
strip can be efficiently processed by an internal-memory sweepline
algorithm. A key observation that allowed us to greatly improve the
practical performance of our algorithm is that in most sweepline algorithms
not all input data is needed in main memory at the same time. In our
initial experiments, we observed that on real-life two-dimensional spatial
data sets of size N, the internal-memory sweepline algorithm requires
only
memory space. This behavior (also known as the
square-root rule in the VLSI literature) implies that for real-life
two-dimensional data sets, we can bypass the vertical partitioning step and
directly perform the sweepline algorithm after the initial external sorting
step. We implemented SSSJ such that partitioning is only done when it is
detected that that the sweepline algorithm exhausts the internal
memory. This results in an algorithm that not only is extremely efficient
for real-life data but also offers guaranteed worst-case bounds and
predictable behavior on skewed and/or bad input data: Our experiments show
that SSSJ performs at least
better than PBSM on real-life data sets,
and that it robustly handles skewed data on which PBSM suffers a serious
performance degeneration.
As part of our experimental work we experimented with a number of different
techniques for performing the internal sweepline. By using an efficient
partitioning heuristic, we were able to speed up the internal sweeping used
by PBSM by a factor of over 4 on the average for real-life data sets. The
resulting improved PBSM then performs approximately
better than SSSJ
on the real-life data we used, and it is thus a good choice of algorithm
when the data is known not to be too skewed.