next up previous
Next: Competitive Analysis of Buffer Up: EXTERNAL MEMORY ALGORITHMS, I/O Previous: Theory and Practice of

   
Scalable Sweep-Based Spatial Join

L. A. Arge, O. Procopiuc, S. Ramaswamy, T. Suel, and J. S. Vitter. ``Scalable Sweep-Based Spatial Join,'' Proceedings of the 1998 International Conference on Very Large Databases (VLDB '98), New York, August 1998, 570-581. This work forms the basis of two patent applications currently filed and pending with the patent office.

Full text (gzip-compressed postscript)

Full text (Adobe pdf format)

In this paper, we examine the spatial join problem. In particular, we focus on the case when neither of the inputs is indexed. We present a new algorithm, Scalable Sweep-based Spatial Join (SSSJ), that is based on the distribution-sweeping technique recently proposed in computational geometry, and that is the first to achieve theoretically optimal bounds on internal computation time as well as I/O transfers. We present experimental results based on an efficient implementation of the SSSJ algorithm, and compare it to the state-of-the-art Partition-Based Spatial-Merge (PBSM) algorithm of Patel and DeWitt.

Our SSSJ algorithm performs an initial sorting step along the vertical axis, after which we use the distribution-sweeping technique to partition the input into a number of vertical strips, such that the data in each strip can be efficiently processed by an internal-memory sweepline algorithm. A key observation that allowed us to greatly improve the practical performance of our algorithm is that in most sweepline algorithms not all input data is needed in main memory at the same time. In our initial experiments, we observed that on real-life two-dimensional spatial data sets of size N, the internal-memory sweepline algorithm requires only $O(\sqrt{N})$ memory space. This behavior (also known as the square-root rule in the VLSI literature) implies that for real-life two-dimensional data sets, we can bypass the vertical partitioning step and directly perform the sweepline algorithm after the initial external sorting step. We implemented SSSJ such that partitioning is only done when it is detected that that the sweepline algorithm exhausts the internal memory. This results in an algorithm that not only is extremely efficient for real-life data but also offers guaranteed worst-case bounds and predictable behavior on skewed and/or bad input data: Our experiments show that SSSJ performs at least $25\%$ better than PBSM on real-life data sets, and that it robustly handles skewed data on which PBSM suffers a serious performance degeneration.

As part of our experimental work we experimented with a number of different techniques for performing the internal sweepline. By using an efficient partitioning heuristic, we were able to speed up the internal sweeping used by PBSM by a factor of over 4 on the average for real-life data sets. The resulting improved PBSM then performs approximately $10\%$ better than SSSJ on the real-life data we used, and it is thus a good choice of algorithm when the data is known not to be too skewed.


next up previous
Next: Competitive Analysis of Buffer Up: EXTERNAL MEMORY ALGORITHMS, I/O Previous: Theory and Practice of
Jeff Vitter
2008-04-02