ACM Computing Surveys 28A(4), December 1996, http://www.cs.duke.edu/~jsv/SDCR96-IO/VengroffVitterIO/. Copyright © 1996 by the Association for Computing Machinery, Inc. See the permissions statement below.
Darren Erik
Vengroff
Department of Electrical
Engineering, University of Delaware
Evans Hall, Newark, DE 19716, USA
vengroff@ee.udel.edu,
http://www.ee.udel.edu/~vengroff/
Jeffrey Scott
Vitter
Abstract: There has recently been much productive work in the algorithms community on techniques for efficient use of external memory in large-scale applications. In order for I/O-optimal algorithms to be implemented efficiently, the machines that they run on must support fundamental external-memory operations. Unfortunately, existing file systems generally do not support the necessary semantics or provide useful tools. There are three basic approaches to supporting development of I/O-efficient code, which we call array-oriented systems (such as PASSION and ViC*), access-oriented systems (such as the UNIX file system and Panda), and framework-oriented systems (such as TPIE, a Transparent Parallel I/O Programming Environment). In this position statement, we discuss the advantages and potential of the TPIE approach in enabling I/O-efficient computation. See also the citation page [Vengroff Vitter 1996a] of this position statement.Categories and Subject Descriptors: D.4.2 [Operating Systems]: Storage Management - Secondary Storage; D.4.4 [Operating Systems]: Communications Management - Input/Output; E.2 [Data Storage Representations]: Contiguous representations; E.5 [Files]: Sorting/searching; F.2.1 [Analysis of Algorithms and Problem Complexity: Numerical Algorithms and Problems - Computations on matrices; F.2.2 [Analysis of Algorithms and Problem Complexity: Nonnumerical Algorithms and Problems - Computations on discrete structures, geometrical problems and computations, sorting and searching; B.4.4 [Input/Output and Data Communications: Performance Analysis and Design Aids] - Formal models, Worst-case analysis;
General Terms: Algorithms, Design, Languages, Performance, Theory.
Additional Key Words and Phrases: I/O, external memory, secondary memory, communication, disk drive, parallel disks, sorting.
The bottleneck in many large-scale data applications that arise in databases, GIS systems, scientific computations, and multimedia is the time for the input/output (I/O) between internal memory and the disks~ sorting searching, vitter lindstrom.]. This bottleneck is accentuated as processors get faster and parallel computers are used. The primary feature of disks that we model is their extremely long access time relative to that of solid state random-access memory. In order to amortize this access time over a large amount of data, typical disks read or write large blocks of contiguous data at once.
A simple and convenient model for effective algorithm design can be specified by the two parameters of interest: the block size B, which is the number of contiguous elements from disk that are input or output as a unit, and the number D of parallel disks. We typically denote the problem size as N and the size of the internal memory as M. Depending on the size of the data items, typical values for workstations and file servers in production today are on the order of M = 1,000,000 or 10,000,000 and B = 1000. Large-scale problem instances can easily be in the range N = 10,000,000,000 to N = 1,000,000,000,000.
In order to study the performance of external-memory algorithms, we use the standard notion of I/O complexity. We define an input/output operation (or simply I/O for short) to be the process of reading or writing a block of data to or from the disk. The I/O complexity of an algorithm is simply the number of I/Os it performs. For example, reading all of the input data requires N / B I/Os.
An increasingly popular approach to further increase the throughput of I/O systems is to use a number of disks in parallel. The number D of disks range up to 100 in current disk arrays. One method of using D disks in parallel is disk striping, in which the heads of the disks are moved synchronously, so that in a single I/O operation each disk reads or writes a block in the same location as each of the others. In terms of performance, disk striping has the effect of using a single large disk with block size B' = DB. Even though disk striping does not in theory achieve asymptotic optimality when D is very large [Vitter and Shriver 1994], it is often the method of choice in practice for using a small number of parallel disks [Vengroff Vitter 1996b]. However, new methods have made the use of independent parallel disks faster than disk striping when used for external sorting [Barve et al 1996].
We have initial successes in developing a number of important techniques for solving large-scale problems, some of which can be found in the references given in the following references [Arge 1995] [Arge Vengroff Vitter 1995] [Arge Vitter 1996] [Barve et al 1996] [Chiang et al 1995] [Goodrich et al 1995] [Vengroff Vitter 1996b] [Vitter and Shriver 1994]. The area of external memory algorithms for is a fascinating area with many interesting theoretical and practical problems that remain to be solved. The Center for Discrete Mathematics and Theoretical Computer Science is hosting a special year on Massive Data Sets in 1997-1998.
In order for I/O-optimal algorithms to be implemented efficiently, the machines that they run on must support fundamental external-memory operations. Unfortunately, existing file systems generally do not support the necessary semantics or provide useful tools. Examples of limitations of existing systems include low limits on the number of simultaneously open file descriptors and the inability to manage streams of data on several independent disks as a single logical unit.
There are three basic approaches to supporting development of I/O-efficient code, which we call array-oriented systems (such as PASSION and ViC*), access-oriented systems (such as the UNIX file system and Panda), and framework-oriented systems (such as TPIE, a Transparent Parallel I/O Programming Environment). The first two approaches are discussed in the working group report and in other participants' position statements. In this position statement we instead discuss the advantages and potential of the TPIE approach [Vengroff Vitter 1996b].
TPIE consists of a C++ object-oriented interface and lower-level support routines to manage main memory and parallel disk I/O. TPIE is intended to bridge the gap between the theory and the practice of parallel I/O systems, permitting for the first time the algorithms we develop to be implemented easily, efficiently, and portably. The goal of the TPIE system is to demonstrate that a parallel I/O system can do all of the following simultaneously:
A number of important algorithms, such as sorting, matrix multiplication, and list ranking, have already been implemented using TPIE, and preliminary performance results are encouraging. An important line of research is to continue the development of efficient external memory algorithms, especially for pressing problems in geographical information systems (GIS), databases, and scientific processing, and to use systems like TPIE and others in order to bring improved algorithms into practice.
The TPIE system itself consists of three components: a block transfer engine (BTE), a memory manager (MM), and an access method interface (AMI). The BTE handles block transfer for a single processor. The MM performs low-level memory management across all the processors in the system. The AMI works on top of the MM and one or more BTEs, each running on a single processor, to provide a uniform interface for application programs. Applications that use this interface will be portable across hardware platforms, since they never have to deal with the underlying details of how I/O is performed on a particular machine. The high-level paradigms to be supported by TPIE include scanning, searching, selection, distribution, merging, sorting, permuting, distribution sweeping, batch filtering, dynamic block allocation, dynamic memory allocation, and hierarchical control. Using the AMI interface, programmers will be able to write a variety of applications in very little time.
Possible future directions of TPIE include interface layers that will allow existing parallel programming tools to be used in conjunction with TPIE in order to run parallel algorithms in external memory. Examples could include the NESL nested data parallel language and one or more data parallel languages such as C* and high-performance FORTRAN using translation and optimization techniques being developed by Cormen. Another extension is optimization for hierarchical memories, using the UMH model, in which all levels of the hierarchy can be active simultaneously, and programming the communication explicitly can be a rather difficult job of choreography. We have found in practice that level-2 cache/main memory interactions can be just as important as main memory/disk interactions once the latter have been optimized.
The performance of the algorithms we have implemented thus far has been quite good. Comparing I/O-efficient algorithms implemented in TPIE with the standard RAM algorithms running in a virtual memory environment, we have found that TPIE performance can be better by more than an order of magnitude for realistically sized problems. In some cases, for example the standard algorithm for matrix multiplication, TPIE can perform the computation using a small amount of main memory and effectively compete with a completely in-core algorithm (in which the main memory is large enough to store all the data and do the computation). For list ranking, the RAM algorithm linearly traverses a linked list, where the nodes are stored in random locations, producing a page fault at essentially every step. Despite the overhead involved in sorting the data each time we simulate a step used in the PRAM simulation, we can get improvements of up to a factor of 40 in practice.
In the long run, we hope that the development of systems that facilitate the development of efficient external algorithms will influence the design not only of high-performance file systems, but also of the underlying architecture of the I/O systems they are designed to run on.