ACM Computing Surveys 28A(4), December 1996, http://www.cs.duke.edu/~jsv/SDCR96-CG/VitterGeometry/. Copyright © 1996 by the Association for Computing Machinery, Inc. See the permissions statement below.


Strategic Directions in Computing Research

Working Group on Computational Geometry

Communication Issues in Large-Scale Geometric Computation


Jeffrey Scott Vitter

Department of Computer Science, Duke University
Levine Science Research Center, Durham, NC 27705-0129, USA
jsv @ cs.duke.edu, http://www.cs.duke.edu/~jsv/



Abstract: Large-scale problems involving geometric data arise in numerous settings, and severe communication bottlenecks can arise in solving them. Work is needed in the development of I/O-efficient algorithms, as well as those that effectively utilize hierarchical memory. In order for new algorithms to be implemented efficiently in practice, the machines that they run on must support fundamental external-memory operations. We discuss several advantages offered by TPIE (Transparent Parallel I/O Programming Environment) to enable I/O-efficient implementations. See also the citation page [Vitter 1996] for this position statement.

Categories and Subject Descriptors: D.4.2 [Operating Systems]: Storage Management - Secondary Storage; D.4.4 [Operating Systems]: Communications Management - Input/Output; E.2 [Data Storage Representations]: Contiguous representations; F.2.2 [Analysis of Algorithms and Problem Complexity: Nonnumerical Algorithms and Problems - Computations on discrete structures, geometrical problems and computations; B.4.4 [Input/Output and Data Communications: Performance Analysis and Design Aids] - Formal models, Worst-case analysis;

General Terms: Algorithms, Design, Languages, Performance, Theory.

Additional Key Words and Phrases: computational geometry, I/O, external memory, secondary memory, communication, disk drive, parallel disks.



1. Issues in Large-Scale Geometric Computing

Large-scale problems involving geometric data are ubiquitous in spatial databases, geographic information systems (GIS), constraint logic programming, object-oriented databases, statistics, virtual reality systems, and computer graphics. NASA's soon-to-be petabyte-sized databases (1 million gigabytes) will require a variety of complex geometric queries to be handled.

Of particular interest to the geometry community are geographic information systems, which store, manipulate, and search through enormous amounts of spatial data. GIS are used for scientific applications such as environmental impact, wildlife repopulation, epidemiologic analysis, and earthquake studies and for commercial applications such as market analysis, utility facilities distribution, and mineral exploration. Geometric Datatypes and representations commonly used by GIS systems include points, lines, polygons, raster images, and quadtrees.

GIS are a rich source of important problems that require good use of external-memory techniques. Typical subproblems that need to be solved include range searching, nearest neighbors, generating contours from triangulated elevation data, and producing map overlays. As an illustration, the computation of new scenes or maps from existing information---also called map overlaying---is an important GIS operation. Some existing software packages are completely based on this operation. Given two thematic maps (maps with, e.g., indications of lakes, roads, pollution level), the problem is to compute a new map in which the thematic attributes of each location is a function of the thematic attributes of the corresponding locations in the two input maps. For example, the input maps could be a map of land utilization (farm, forest, house, lake), and a map of pollution levels. The map overlay operation could then be used to produce a new map of agricultural land where the degree of pollution is above a certain level. One of the main problems in map overlaying is ``line-breaking,'' which can be abstracted as the red-blue segment intersection problem: Given a set of red line segments and a set of blue line segments, report all of the intersections of red segments with blue segments. Luckily many problems on geometric objects can be reduced to a small number of base problems, such as computing intersections, convex hulls, or nearest neighbors.

2. I/O Model and Results

The primary feature of disks that we model is their extremely long access time relative to that of solid state random-access memory. In order to amortize this access time over a large amount of data, typical disks read or write large blocks of contiguous data at once. Our problems are modeled by the following parameters:

eqnarray15

where M < N and 0 < B < M / 2. Depending on the size of the data items, typical values for workstations and file servers in production today are on the order of M = 1,000,000 or 10,000,000 and B = 1000. Large-scale problem instances can be in the range N = 10,000,000,000 to N = 1,000,000,000,000.

In order to study the performance of external-memory algorithms, we use the standard notion of I/O complexity. We define an input/output operation (or simply I/O for short) to be the process of reading or writing a block of data to or from the disk. The I/O complexity of an algorithm is simply the number of I/Os it performs. For example, reading all of the input data requires N / B I/Os. We will use the term scanning to describe the fundamental primitive of reading (or writing) all items in a set stored contiguously on external storage by reading (or writing) the blocks of the set in a sequential manner.

For the problems we consider we define two additional parameters:

eqnarray33

Since each I/O can transmit B items simultaneously, it is convenient to introduce the following notation: n = N / B, k = K / B, t = T / B, m = M / B. We will say that an algorithm uses a linear number of I/O operations if it uses at most O(n) I/Os to solve a problem of size N.

An increasingly popular approach to further increase the throughput of I/O systems is to use a number of disks in parallel. The number D of disks range up to 100 in current disk arrays. One method of using D disks in parallel is disk striping, in which the heads of the disks are moved synchronously, so that in a single I/O operation each disk reads or writes a block in the same location as each of the others. In terms of performance, disk striping has the effect of using a single large disk with block size B' = DB. Even though disk striping does not in theory achieve asymptotic optimality when D is very large [Vitter and Shriver 1994], it is often the method of choice in practice for using parallel disks, especially when D is moderately sized [Vengroff Vitter 1996]. New methods have made the use of independent parallel disks faster than disk striping when used for external sorting [Barve et al 1996].

3. Results for GIS Subroutines

We have had initial success in developing a number of important techniques for solving large-scale GIS problems [Arge Vengroff Vitter 1995] [Arge Vitter 1996]. The area of external memory algorithms for large-scale geometric processing is a fascinating area with many interesting theoretical and practical problems that remain to be solved.

Early work on I/O algorithms concentrated on algorithms for sorting and permutation related problems. External sorting requires tex2html_wrap_inline118 I/Os, which is the external-memory equivalent of the well-known tex2html_wrap_inline122 time bound for sorting in internal memory. Work has also been done on matrix algebra and related problems arising in scientific computation [Vengroff Vitter 1996]. More recently, researchers have designed external-memory algorithms for a number of problems in different areas, such as in computational geometry and graph theoretic computation [Chiang et al 1995] [Goodrich et al 1995].

We can combine and modify in novel ways several of the previously known techniques for designing efficient algorithms for external memory. In particular we use the distribution sweeping and batch filtering paradigms of [Goodrich et al 1995] and the buffer tree data structure [Arge 1995]. when designing algorithms for batch problems, in which all the data must be processsed. In addition we also develop a powerful new technique that can be regarded as a practical external-memory version of batched fractional cascading on an external-memory version of a segment tree. This enables us to improve on existing external-memory algorithms as well as to develop new algorithms that answer some of the open problems in [Goodrich et al 1995]. We also use new methods motivated by B-trees that give optimal query and insertion algorithms in dynamic settings Some of our results are summarized in Table 1. For all but the batched planar point location problem, no algorithms specifically designed for external memory were previously known. The batched planar point location algorithm that was previously known only works when the planar subdivision is monotone, and the problems of triangulating a simple polygon and reporting intersections between other than orthogonal line segments were stated as open problems in [Goodrich et al 1995].

For the sake of contrast, our results are also compared with modified internal-memory algorithms for the same problems. In most cases, these modified algorithms are plane-sweep algorithms modified to use B-tree-based dynamic data structures rather than binary tree-based dynamic data structures, following the example of a class of algorithms studied experimentally by Chiang. Such modifications lead to algorithms using tex2html_wrap_inline130 I/Os. For two of the algorithms the known optimal internal-memory algorithms are not plane-sweep algorithms and can therefore not be modified in this manner. It is difficult to analyze precisely how those algorithms perform in an I/O environment; however it is easy to realize that they use at least order N I/Os. The I/O bounds for algorithms based on B-trees have a logarithm of base B in the denominator rather than a logarithm of base m. But the most important difference between such algorithms and our results is the fact that the updates to the dynamic data structures are handled on an individual basis, which leads to an extra multiplicative factor of B in the I/O bound, which is very significant in practice.

   table66
Table: Summary of results.

As mentioned, the red-blue line segment intersection problem is of special interest because it is an abstraction of the important map-overlay problem, which is the core of several vector-based GISs. Although a time-optimal internal-memory algorithm for the general intersection problem exists, a number of simpler solutions have been presented for the red-blue problem. However, they are highly wasteful in terms of I/O, even when some standard optimizations are applied.

4. TPIE: Transparent Parallel I/O Programming Environment

In order for large-scale geometric algorithms to be implemented efficiently in practice, the machines that they run on must support fundamental external-memory operations. Unfortunately, existing file systems generally do not support the necessary semantics or provide useful tools. Examples of limitations of existing systems include virtual memory, low limits on the number of simultaneously open file descriptors, and the inability to manage streams of data on several independent disks as a single logical unit. We suggest that useful systems such as TPIE (Transparent Parallel I/O Programming Environment) should be encouraged for development and execution of large-scale geometric code [Vengroff Vitter 1996].

TPIE consists of a C++ object-oriented interface and lower-level support routines to manage main memory and parallel disk I/O{}. TPIE is intended to bridge the gap between the theory and the practice of parallel I/O systems, permitting for the first time the algorithms we develop to be implemented easily, efficiently, and portably. The goal of the TPIE system is to demonstrate that a parallel I/O system can do all of the following simultaneously:

Further information regarding issues of I/O efficiency can be found in the report of the Working Group on Storage I/O Issues in Large-Scale Computing.

References

[Arge 1995]
Arge, L., 1995. The Buffer Tree: A New Technique for Optimal I/O-Algorithms, Proceedings of Workshop on Algorithms and Data Structures, 1995, http://www.cs.duke.edu/~large/Papers/buffer.ps.

[Arge Vengroff Vitter 1995]
Arge, L., Vengroff, D. E., and Vitter, J. S., 1995. External-Memory Algorithms for Processing Line Segments in Geographic Information Systems, Proceedings of the 3rd Annual European Symposium on Algorithms, September 1995, file://cs.duke.edu/pub/jsv/Papers/AAV95.SegmentGIS.ps.gz.

[Arge Vitter 1996]
Arge, L., and Vitter, J. S., 1996. Optimal Interval Management in External Memory, Proceedings of the 37th Annual IEEE Symposium on Foundations of Computer Science, Burlington, VT, October 1996, file://cs.duke.edu/pub/jsv/Papers/AAV95.SegmentGIS.ps.gz. Also appeared in Abstracts of the First CGC Workshop on Computational Geometry, Center for Geometric Computing, Johns Hopkins University, Baltimore, MD, October 1996, file://cs.duke.edu/pub/jsv/Papers/ArV96.interval_management.ps.gz.

[Barve et al 1996]
Barve, R., Grove, E. F., and Vitter, J. S., 1996. Simple Randomized Mergesorting on Parallel Disks, special issue on parallel I/O in Parallel Computing, to appear, file://cs.duke.edu/pub/jsv/Papers/BGV96.Simple_Mergesort.ps.gz.

y

[Chiang et al 1995]
Chiang, Y.-J., Goodrich, M. T., Grove, E. F., Tamassia, R., Vengroff, D. E., and Vitter, J. S., 1995. External-Memory Graph Algorithms, Proceedings of the 5th Annual SIAM/ACM Symposium on Discrete Algorithms, San Francisco, January 1995, file://cs.duke.edu/pub/jsv/Papers/CGG95.external_graph.ps.gz.

[Goodrich et al 1995]
Goodrich, M. T., Tsay, J.-J., Vengroff, D. E., and Vitter, J. S., 1993. External-Memory Computational Geometry, Proceedings of the 34th Annual IEEE Symposium on Foundations of Computer Science, Palo Alto, CA, November 1993, file://cs.duke.edu/pub/jsv/Papers/GTV93.ecg.ps.gz.

[Vengroff Vitter 1996]
Vengroff, D. E., and Vitter, J. S., 1996. I/O-Efficient Computation: The TPIE Approach, Proceedings of the Goddard Conference on Mass Storage Systems and Technologies, College Park, MD, September 1996, published in NASA Conference Publication 3340, Volume II, 553-570, file://cs.duke.edu/pub/jsv/Papers/VeV96.TPIE.ps.gz.

[Vitter 1996]
Vitter, J. S., 1996. Communication Issues in Large-Scale Geometric Computation, Strategic Directions in Computing Research: Working Group on Computational Geometry, ACM Computing Surveys, 28A(4), December 1996, http://www.cs.duke.edu/~jsv/SDCR96-CG/VitterGeometry/.

[Vitter and Shriver 1994]
Vitter, J. S., and Shriver, E. A. M. 1994. Optimal Algorithms for Parallel Memory, I: Two-Level Memories, Algorithmica, 12(2-3), 1994, file://cs.duke.edu/pub/jsv/Papers/ViS94.sorting_io.ps.gz.


Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for copmonents of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or permissions@acm.org.


Last modified: Tue Nov 6 09:28:54 EST 2001
Jeffrey S. Vitter <jsv@cs.duke.edu>