ACM Computing Surveys 28A(4), December 1996, http://www.cs.duke.edu/~jsv/SDCR96-CG/VitterGeometry/. Copyright © 1996 by the Association for Computing Machinery, Inc. See the permissions statement below.
Abstract: Large-scale problems involving geometric data arise in numerous settings, and severe communication bottlenecks can arise in solving them. Work is needed in the development of I/O-efficient algorithms, as well as those that effectively utilize hierarchical memory. In order for new algorithms to be implemented efficiently in practice, the machines that they run on must support fundamental external-memory operations. We discuss several advantages offered by TPIE (Transparent Parallel I/O Programming Environment) to enable I/O-efficient implementations. See also the citation page [Vitter 1996] for this position statement.Categories and Subject Descriptors: D.4.2 [Operating Systems]: Storage Management - Secondary Storage; D.4.4 [Operating Systems]: Communications Management - Input/Output; E.2 [Data Storage Representations]: Contiguous representations; F.2.2 [Analysis of Algorithms and Problem Complexity: Nonnumerical Algorithms and Problems - Computations on discrete structures, geometrical problems and computations; B.4.4 [Input/Output and Data Communications: Performance Analysis and Design Aids] - Formal models, Worst-case analysis;
General Terms: Algorithms, Design, Languages, Performance, Theory.
Additional Key Words and Phrases: computational geometry, I/O, external memory, secondary memory, communication, disk drive, parallel disks.
Large-scale problems involving geometric data are ubiquitous in spatial databases, geographic information systems (GIS), constraint logic programming, object-oriented databases, statistics, virtual reality systems, and computer graphics. NASA's soon-to-be petabyte-sized databases (1 million gigabytes) will require a variety of complex geometric queries to be handled.
Of particular interest to the geometry community are geographic information systems, which store, manipulate, and search through enormous amounts of spatial data. GIS are used for scientific applications such as environmental impact, wildlife repopulation, epidemiologic analysis, and earthquake studies and for commercial applications such as market analysis, utility facilities distribution, and mineral exploration. Geometric Datatypes and representations commonly used by GIS systems include points, lines, polygons, raster images, and quadtrees.
GIS are a rich source of important problems that require good use of external-memory techniques. Typical subproblems that need to be solved include range searching, nearest neighbors, generating contours from triangulated elevation data, and producing map overlays. As an illustration, the computation of new scenes or maps from existing information---also called map overlaying---is an important GIS operation. Some existing software packages are completely based on this operation. Given two thematic maps (maps with, e.g., indications of lakes, roads, pollution level), the problem is to compute a new map in which the thematic attributes of each location is a function of the thematic attributes of the corresponding locations in the two input maps. For example, the input maps could be a map of land utilization (farm, forest, house, lake), and a map of pollution levels. The map overlay operation could then be used to produce a new map of agricultural land where the degree of pollution is above a certain level. One of the main problems in map overlaying is ``line-breaking,'' which can be abstracted as the red-blue segment intersection problem: Given a set of red line segments and a set of blue line segments, report all of the intersections of red segments with blue segments. Luckily many problems on geometric objects can be reduced to a small number of base problems, such as computing intersections, convex hulls, or nearest neighbors.
The primary feature of disks that we model is their extremely long access time relative to that of solid state random-access memory. In order to amortize this access time over a large amount of data, typical disks read or write large blocks of contiguous data at once. Our problems are modeled by the following parameters:
where M < N and 0 < B < M / 2. Depending on the size of the data items, typical values for workstations and file servers in production today are on the order of M = 1,000,000 or 10,000,000 and B = 1000. Large-scale problem instances can be in the range N = 10,000,000,000 to N = 1,000,000,000,000.
In order to study the performance of external-memory algorithms, we use the standard notion of I/O complexity. We define an input/output operation (or simply I/O for short) to be the process of reading or writing a block of data to or from the disk. The I/O complexity of an algorithm is simply the number of I/Os it performs. For example, reading all of the input data requires N / B I/Os. We will use the term scanning to describe the fundamental primitive of reading (or writing) all items in a set stored contiguously on external storage by reading (or writing) the blocks of the set in a sequential manner.
For the problems we consider we define two additional parameters:
Since each I/O can transmit B items simultaneously, it is convenient to introduce the following notation: n = N / B, k = K / B, t = T / B, m = M / B. We will say that an algorithm uses a linear number of I/O operations if it uses at most O(n) I/Os to solve a problem of size N.
An increasingly popular approach to further increase the throughput of I/O systems is to use a number of disks in parallel. The number D of disks range up to 100 in current disk arrays. One method of using D disks in parallel is disk striping, in which the heads of the disks are moved synchronously, so that in a single I/O operation each disk reads or writes a block in the same location as each of the others. In terms of performance, disk striping has the effect of using a single large disk with block size B' = DB. Even though disk striping does not in theory achieve asymptotic optimality when D is very large [Vitter and Shriver 1994], it is often the method of choice in practice for using parallel disks, especially when D is moderately sized [Vengroff Vitter 1996]. New methods have made the use of independent parallel disks faster than disk striping when used for external sorting [Barve et al 1996].
We have had initial success in developing a number of important techniques for solving large-scale GIS problems [Arge Vengroff Vitter 1995] [Arge Vitter 1996]. The area of external memory algorithms for large-scale geometric processing is a fascinating area with many interesting theoretical and practical problems that remain to be solved.
Early work on I/O algorithms concentrated on algorithms for sorting and
permutation related problems.
External sorting requires
I/Os, which is the
external-memory equivalent of the well-known
time
bound for sorting in internal memory. Work has also been done on matrix
algebra and related problems arising in scientific
computation [Vengroff Vitter 1996].
More recently, researchers have designed external-memory algorithms for a
number of problems in different areas, such as in computational
geometry and graph theoretic
computation
[Chiang et al 1995]
[Goodrich et al 1995].
We can combine and modify in novel ways several of the previously known techniques for designing efficient algorithms for external memory. In particular we use the distribution sweeping and batch filtering paradigms of [Goodrich et al 1995] and the buffer tree data structure [Arge 1995]. when designing algorithms for batch problems, in which all the data must be processsed. In addition we also develop a powerful new technique that can be regarded as a practical external-memory version of batched fractional cascading on an external-memory version of a segment tree. This enables us to improve on existing external-memory algorithms as well as to develop new algorithms that answer some of the open problems in [Goodrich et al 1995]. We also use new methods motivated by B-trees that give optimal query and insertion algorithms in dynamic settings Some of our results are summarized in Table 1. For all but the batched planar point location problem, no algorithms specifically designed for external memory were previously known. The batched planar point location algorithm that was previously known only works when the planar subdivision is monotone, and the problems of triangulating a simple polygon and reporting intersections between other than orthogonal line segments were stated as open problems in [Goodrich et al 1995].
For the sake of contrast, our results are also compared with modified
internal-memory algorithms for the same problems. In most cases, these
modified algorithms are plane-sweep algorithms modified to use
B-tree-based dynamic data structures rather than binary tree-based
dynamic data structures, following the example of a class of algorithms
studied experimentally by Chiang. Such modifications lead to algorithms
using
I/Os. For two of the algorithms the known optimal
internal-memory algorithms are not plane-sweep algorithms and can
therefore not be modified in this manner. It is difficult to analyze
precisely how those algorithms perform in an I/O environment; however it
is easy to realize that they use at least order N I/Os. The I/O
bounds for algorithms based on B-trees have a logarithm of base B in
the denominator rather than a logarithm of base m. But the most
important difference between such algorithms and our results is the fact
that the updates to the dynamic data structures are handled on an
individual basis, which leads to an extra multiplicative factor of B
in the I/O bound, which is very significant in practice.
As mentioned, the red-blue line segment intersection problem is of special interest because it is an abstraction of the important map-overlay problem, which is the core of several vector-based GISs. Although a time-optimal internal-memory algorithm for the general intersection problem exists, a number of simpler solutions have been presented for the red-blue problem. However, they are highly wasteful in terms of I/O, even when some standard optimizations are applied.
In order for large-scale geometric algorithms to be implemented efficiently in practice, the machines that they run on must support fundamental external-memory operations. Unfortunately, existing file systems generally do not support the necessary semantics or provide useful tools. Examples of limitations of existing systems include virtual memory, low limits on the number of simultaneously open file descriptors, and the inability to manage streams of data on several independent disks as a single logical unit. We suggest that useful systems such as TPIE (Transparent Parallel I/O Programming Environment) should be encouraged for development and execution of large-scale geometric code [Vengroff Vitter 1996].
TPIE consists of a C++ object-oriented interface and lower-level support routines to manage main memory and parallel disk I/O{}. TPIE is intended to bridge the gap between the theory and the practice of parallel I/O systems, permitting for the first time the algorithms we develop to be implemented easily, efficiently, and portably. The goal of the TPIE system is to demonstrate that a parallel I/O system can do all of the following simultaneously:
y