Next: Overview of Trapeze and
Up: Network I/O with Trapeze
Previous: Abstract
Storage access is a driving application for high-speed LAN interconnects.
Over the next few years, new high-speed network standards -- primarily
Gigabit Ethernet -- will consolidate an order-of-magnitude gain in LAN
performance already achieved with specialized cluster interconnects such as
Myrinet and SCI. Combined with faster I/O bus standards, these networks
greatly expand the capacity of even inexpensive PCs to handle large amounts of
data for scalable computing, network services, multimedia and visualization.
These gains in communication speed
enable a new generation of network storage systems whose
performance tracks the rapid
advances in network technology rather than the slower rate of advances
in disk technology.
With gigabit-per-second networks, a fetch request
for a faulted page or file block can complete up to two orders of
magnitude faster from remote memory than from a local disk (assuming a
seek). Moreover, a storage system built from disks distributed
through the network (e.g., attached to dedicated
servers [11,12,10], cooperating
peers [3,13], or the network
itself [8]) can be made
incrementally scalable, and can source and sink
data to and from individual clients at network speeds.
The Trapeze project is an effort to harness the power of gigabit-per-second
networks
to ``cheat'' the disk I/O bottleneck for I/O-intensive applications.
We use the network as the sole access path to
external storage, pushing all disk storage out into the network. This
network-centric approach to I/O views the client's file system and virtual
memory system as extensions of the network protocol stack.
The key elements of our approach are:
- Emphasis on communication performance. Our system
is based on custom Myrinet firmware and a lightweight kernel-kernel
messaging layer optimized for block I/O traffic. The firmware includes
features for zero-copy block movement, and uses
an adaptive message pipelining strategy to reduce block fetch latency
while delivering high bandwidth under load.
- Integration of network memory as an intermediate layer of the
storage hierarchy. The Trapeze project originated with communication
support for
a network memory service [6], which stresses
network performance by removing disks from the critical path of I/O.
We are investigating techniques to manage
network memory as a distributed, low-overhead, ``smart'' file buffer cache
between local memory and disks, to exploit its potential to
mask disk access latencies.
- Parallel block-oriented I/O storage.
We are developing a new scalable storage layer, called Slice, that
partitions file data and metadata across a collection of I/O
servers. While
the I/O nodes in our design
could be network storage appliances, we have chosen to
use generic PCs because they are cheap, fast, and programmable.
This paper is organized as follows. Section 2
gives a broad overview of the Trapeze project elements,
with a focus on the features relevant to high-speed block I/O.
Section 3 presents more detail on the adaptive
message pipelining scheme implemented in the Trapeze/Myrinet firmware.
Section 4 presents
some experimental results showing the network storage access
performance currently achievable with Slice and Trapeze. We conclude in
Section 5.
Next: Overview of Trapeze and
Up: Network I/O with Trapeze
Previous: Abstract
Jeff Chase
8/4/1999