INTERNET-DRAFT                                          C. Sapuntzakis
                                                         Cisco Systems
                                                            A. Romanow
                                                         Cisco Systems

                                                              J. Chase
                                                       Duke University

draft-csapuntz-caserdma-00.txt                            December 2000


                            The Case for RDMA 


Status of this Memo

This document is an Internet-Draft and is in full conformance with all
provisions of Section 10 of RFC2026.

Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups.  Note that
other groups may also distribute working documents as Internet-
Drafts.

Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time.  It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."

The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt

The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.

Copyright Notice
Copyright (C) Cisco Systems (2000). All Rights Reserved.


Abstract

The end-to-end performance of IP networks for bulk data transfer is
often limited by data copying overhead in the end systems.  Even when
end systems can sustain the bandwidth of high-speed networks, copying
overheads often limit their ability to carry out other processing
tasks.

Remote Direct Memory Access (RDMA) is a facility for avoiding copying
for network communication in a general and comprehensive way.  RDMA is
particularly useful for protocols that transmit bulk data mixed with
control information, such as NFS, CIFS, HTTP, or enscapsulated device
protocols such as iSCSI.  While networking architectures such as the
Virtual Interface (VI) architecture support RDMA, there is no standard
for RDMA over IP networks.  Such a standard would allow vendors of
IP-aware network hardware (such as TCP-capable network adapters) to
incorporate support for RDMA into their products.

This document reviews the I/O performance issues addressed by RDMA,
and considers issues for supporting the key elements of RDMA in an IP
networking context.

Glossary

header/payload splitting - any technique that enables a NIC to deposit
    incoming protocol headers and payloads into separate host buffers

headers - control information used by the protocol

HBA - host bus adapter, a network adapter (see NIC)

I/O operation - a request to a device, then a transfer to/from that device,
    and a status response

MTU - maximum transmission unit, the largest packet size that a given
    network device or path can carry

NIC - network interface card/controller (see HBA)

payload - in general, uninterpreted data transported by a protocol

payload steering - any technique that enables a NIC to deposit an
    incoming protocol payload into a buffer designated for that
    specific payload

protocol stack - the layers of software, firmware, or hardware
   that implement communication between applications across a network

region and region identifier (RID) - a memory buffer region reserved and
   registered for use with RDMA requests, and its unique identifier

solicited data - data that was sent in response to some control
   message

unsolicited data - data that was sent without being requested

upper-layer protocol (ULP) - an application-layer protocol like
  NFS, CIFS, HTTP, or iSCSI

1. Introduction

The principal use of the Internet and IP networks today is for
buffer-to-buffer transfers, often in the form of file or block
transfers. Today, this is done using a variety of protocols: HTTP,
FTP, NFS, and CIFS. Soon, iSCSI will be added to this list.

These upper-layer protocols (ULPs) all have one thing in common: the
majority of the bytes they send on the network are data "payloads"
that are uninterpreted by the protocol or the network. 

Each ULP has different ways of requesting and initiating data
transfers. They differ in the kinds of control information or
meta-data (e.g. cache coherence info) they specify and send across the
wire. However, all these protocols eventually come down to
transporting large blocks of uninterpreted data from a local buffer to
a remote buffer.  Transferring a payload from one host to another is
similar to a buffer-to-buffer data transfer (like the C memcpy
function) over the network. For example, one use of HTTP is to
transfer JPEG format graphic images from a web server to a web
browser's address space.

Today, gigabit speed buffer-to-buffer network transfers are chewing up
significant memory bandwidth and CPU time on the receivers.  With the
advent of IP checksum hardware, the end-system overhead for network
transfers is dominated by costs of copying in order to place incoming
data correctly in the receiver's memory buffer.  Although CPUs are
rapidly becoming more powerful, advances in network bandwidths have
also kept pace with and even exceeded Moore's Law in recent years.
Moreover, copying is limited by memory system performance, which is
not improving as fast CPU speeds.

One solution to this problem is to place the data in the correct
memory buffer directly as it arrives from the network, avoiding the
need to copy it into the correct buffer after it has arrived. If the
network interface (NIC) could place data correctly in memory, this
would free up the memory bandwidth and CPU cycles consumed by copying.

A number of mechanisms already exist to reduce copying overhead in the
IP stack.  Some of these mechanisms depend on fragile assumptions about
the hardware and application buffers, others involve ad hoc support
for specific protocols and communication scenarios, and all of them
impose other costs that may be prohibitive in some scenarios.

However, a mechanism called Remote Direct Memory Access (RDMA) offers
a solution that is simple, general, complete, and robust.  RDMA
introduces new control information into the communication stream that
directs data movement for buffer-to-buffer transfers.  Incorporating
support for RDMA into network protocols can significantly reduce the
cost of network buffer-to-buffer transfers.

RDMA accomplishes exact data placement via a generalized abstraction
at the boundary between the ULP and its transport (e.g., TCP),
allowing an RDMA-capable NIC to recognize and steer payloads
independently of the specific ULP.  Using RDMA, ULPs
gain efficient data placement without the need to program ULP-specific
details into the NIC.  Thus RDMA speeds deployment of new protocols by
not requiring the firmware or hardware on the NIC to be rewritten to
accelerate each new protocol.

To be effective, the receiving NIC must recognize the RDMA control
information, and ULP implementations or applications most be modified
to generate the RDMA control information.  In addition, support for
framing in the transport protocols would allow an RDMA-capable NIC to
locate RDMA control information in the stream in the case where
packets arrive out of order.
 
Historically, network protocols and implementations have addressed the
issue of demultiplexing multiple streams arriving at an interface.
However, there are still no accepted solutions to demultiplex control
and data arriving on a single stream.  Much current network traffic is
characterized by a small amount of control with a large amount of
data.  RDMA enables efficient data payload steering for this common
case, which is especially important as data rates increase.

This document is somewhat tutorial in seeking to set out clearly the
I/O performance issues addressed by RDMA, and the design alternatives
for an RDMA facility. It considers proposed approaches for solving the
problems, clarifying the benefits and costs of deploying and using an
RDMA approach.

The document is organized as follows.  Section 2 describes the copy
overhead problem in detail. Section 3 discusses various alternatives
to a general RDMA facility. Section 4 describes the RDMA approach in
detail.  RDMA implementation issues are considered in Section 5, and
unsolicited data in Section 6.


2. The I/O Performance Problem

Figure 1 shows a block diagram illustrating the layers involved in
transferring data in and out of a host system. We will call these
layers the network I/O stack. Each boundary in the diagram corresponds
to an I/O interface.  In general, we assume that all the modules
represented in Figure 1 (except for the NIC) run on the host CPU,
although RDMA is equally useful if portions of the I/O stack run on
the NIC.


	|-----------------------|

	  Application

	|-----------+-----------| 
          File      |
          System    | Block 
          Interface | Interface
	|-----------+-----------|
	Upper-Layer Protocol Stack                  
         (NFS, CIFS, SCSI/iSCSI, 
		HTTP)
	|-----------------------|  

	  Network Stack (IP, TCP)

	|-----------------------|

	  NIC

	|-----------------------|


In IP networks, end system CPUs may incur substantial overhead from
copying data in memory as part of I/O operations. Copying is
necessary in order to align data, place data contiguously in memory, or
place data in specific buffers supplied by the application or ULP module.
These may be important to applications for several reasons.

Alignment is important because most CPU architectures impose alignment
constraints on data accessed in units larger than a byte, e.g., for
incoming data interpreted as integers.

Contiguity of data in memory simplifies the book-keeping data
structures that describe the data and improves memory utilization by
reducing fragmentation of free space. Data contiguity may simplify
algorithms that traverse the data, reducing execution time. For
example, data contiguity enables sequential memory access.

Common network APIs such as sockets [Stevens] allow applications to
designate specific buffers for incoming data, requiring a copy to
place the incoming data correctly.  It may be possible to avoid the
copy by page remapping (see Section 3.2), but only if the data is
contiguous to occupy complete memory pages and is page-aligned
relative to the application's buffer.  Similarly, storage protocols
such as NFS and iSCSI may require contiguous, page-aligned data for
buffering in the system I/O cache.

This document concentrates on how to eliminate unnecessary data copies
used to assure correct placement of incoming data.

Some have argued that the expense of these data copies can be partly
masked if some other data scanning operation, such as checksumming or
decryption, runs over the data simultaneously (see [ALF]). However,
such optimizations are highly processor-dependent and may not yield
the expected benefits [Chase]. Moreover, this approach is not useful
unless other data scanning operations are handled in software;
hardware support for checksumming and decryption is increasingly
common.

In recent years, valuable progress has been made in minimizing
other sources of networking overhead.  Examples include checksum
offloading, extended ethernet frames, and interrupt suppression.  For a
review and evaluation of various solutions see [Chase]. These issues
are not discussed in this document.

2.1 Copy on receive

The primary issue addressed here is how application data is received
from the network.  In many I/O interfaces, when an application reads
data, the application specifies the buffer into which it will receive
data.  But, today's generic NICs are incapable of placing data
directly into the supplied buffer. This limitation is largely because
such direct placement of data requires more complexity and intelligence
than provided in generic NICs.  For example to accomplish this task,
NICs would need to separate payloads from ULP and transport headers, parse
headers, and demultiplex multiple incoming packet streams.

Most NICs today are not this sophisticated in their handling of
incoming data streams.  Instead, they deposit incoming packets into
generic host buffers supplied by the network stack software.  Both the
network and ULP stacks sift through the packets, looking successively
at headers from the link layer (e.g., Ethernet), IP, transport, and
ULP. Eventually, the data payload is recognized and copied from the
network buffers to the correct application buffer.


2.2 Copy on transmit

For the most part, sending data from applications to the network
should not require copies in the I/O stack. Today's network adapters
can gather data from anywhere in memory to form a packet, so no copy
is necessary to align outgoing packet data for the NIC.

Copying can be used as a technique to ensure that the data is not
modified between the time it is passed from the application to the I/O
interface, and the time that the data transfer completes. Other
well-known solutions exist that do not involve copying [Brustoloni].

Copy on transmit will not be discussed further.


3. Non-RDMA solutions
 
There are a range of ad-hoc solutions to avoid copying of incoming
data that do not require RDMA. These include:

	- scatter-gather buffers
	- header/payload separation
	- parsing the ULP on the NIC

3.1 Scatter-gather buffers

Once the NIC has written the application data to memory, a copy can be
avoided if we tell the application where to find its data in memory.
The application data may be scattered in memory as it may have arrived
in multiple packets.  A data structure called a scatter-gather buffer
is used to tell the application the location of the data.
Scatter-gather buffering is the only known copy avoidance technique
that does not require direct support on the NIC.

This solution is not compatible with existing I/O interfaces, such as
the sockets interface.  Also, in this approach, data is not necessarily
contiguous in memory or page-aligned.  For example, it cannot in
general be delivered securely to a user-level process without copying
it, since mapping the pages containing the received data into a user
process address space exposes the containing pages in their entirety,
not just the portions occupied by the received data.

However, scatter-gather buffering is a viable copy avoidance technique
for kernel-based applications where few data transformations are
needed.  For file system protocols, effective use of scatter-gather
buffering may require a redesign of the the file buffer cache and/or
virtual memory page cache.

3.2. Ad Hoc header/payload separation

A more sophisticated NIC might recognize transport and/or ULP headers
in order to separate the headers from the payloads. Then each payload
is "split" from its header and place the payload in a separate buffer.
Header/payload splitting is useful for copy avoidance because a
virtual memory system may then map the payload to an application
buffer by manipulating virtual memory translations to point to the
payload.  This approach, called "page flipping" or "page remapping",
is an alternative to copying for delivering the data into the
application buffers.  A prerequisite for page flipping is that the
application buffer must be page-aligned and contiguous in virtual
memory.

Header/payload splitting adds significant complexity to the NIC.  If
the network MTU is smaller than the hardware page size, then the
transfer of a page of data is spread across multiple packets. These
packets can arrive at the receiver out-of-order and/or interspersed
with packets from other flows. In order to pack the data contiguously
into pages, the NIC must do intelligent processing of the transport
and ULP.  This approach is "ad hoc" because the NIC must include
support for each transport and ULP that benefits from page flipping.
The NIC processing may be unnecessarily complex for ULPs such as NFS
that use variable-length headers or that require ULP-level state to
decode the incoming headers.

A key disadvantage is that page flipping requires TLB invalidations,
which can be prohibitively expensive on shared memory multiprocessors.


3.3. Explicit header/payload separation 

The previous section discussed header/payload separation implemented
in an ad hoc fashion. It is also possible to implement a more
generalized method of header/payload splitting that does not require
the NIC to decode ULP headers.  A generic framing mechanism
implemented at the transport layer or just above it could include
frame header fields that distinguish the ULP payload from the ULP
header.  This would enable a receiving NIC to separate received data
payloads from control information and deposit the received payload
data in contiguous page-aligned target buffer locations. Under most
conditions this is sufficient to allow low-copy implementations of
ULPs such as NFS.

The RDMA approach explored in this document is a more general
extension of this approach. 

3.4. Terminate the ULP in the NIC

If the NIC terminates the ULP, the memory copy is eliminated because
the application communicates I/O requests directly to the NIC.  The
NIC uses the information in the ULP headers to steer ULP payloads to
the correct application buffers.  This is commonly done in the
FibreChannel arena, where FibreChannel NICs (or Host Bus Adapters)
implements an I/O block (e.g., SCSI) transport on the NIC. This
approach effectively migrates all modules of the network stack from
Figure 1 onto the NIC.  FibreChannel implementations use this technique
to deliver high performance with low host overhead.

In such a scheme, the NIC needs to be informed of specific application
buffers. The NIC also needs to be capable of header/payload splitting.

While this approach may be useful for single-function devices, it is
inappropriate for general-purpose NICs.  The NIC must be reprogrammed
or extended to accelerate each ULP.  RDMA offers a general mechanism
that allows RDMA-capable NICs to avoid copies for any ULP that uses
RDMA.

4. Remote Direct Memory Access (RDMA)

This section outlines how RDMA works.

Direct memory access (DMA) is a fundamental technique that is widely
used in high-performance I/O systems.  DMA allows a device to directly
read or write host memory across an I/O interconnect (such as PCI) by
sending DMA commands to the memory controller. No CPU intervention or
copying is required.  For example, when a host requests an I/O read
operation from a DMA-capable storage device, the device uses a DMA
write to place the incoming data directly to memory buffers that the
host provides for that specific operation.  Similarly, when the host
requests an I/O write operation, the device uses a DMA read to fetch
outgoing data from host memory buffers specified by the host for that
operation.

Remote DMA can provide similar functionality in IP networks.  It is
particularly useful when an IP network is used as an I/O interconnect
for IP-capable devices, such as storage devices and their servers.
Conceptually, RDMA allows a network-attached device to read or write
remote memory, e.g., by adding control information that specifies the
buffers to receive transmitted payloads.  The remote NIC decodes this
control information and uses DMA to read/write memory, effectively
translating between the RDMA protocol and the local memory access
protocol.  In an IP network, the RDMA protocol appears at the
transport layer (e.g., as a "shim" above an existing transport
protocol such as TCP) so that a wide variety of upper-layer protocols
can make use of it with minimal changes.

The idea of RDMA has been around by various names for many years.
RDMA is an important component of the VI architecture for user-level
networking, and is also a key element of the Infiniband effort.  VI
illustrates one alternative for a networking API that accommodates
RDMA (see Section 5.1).  However, RDMA generalizes to other network
architectures.  This document addresses issues for incorporating RDMA
into conventional IP protocol stacks.  Note that VI can run over an IP
transport such as TCP, but only if the NIC implements the full
transport.

Since TCP is the most widely used transport for upper-layer protocols,
using RDMA with TCP is the first case to consider. However, RDMA can
be used with other transport protocols, specifically SCTP.

4.1 How RDMA works

An RDMA facility embeds new RDMA control commands into the byte stream
or packet stream.  A full RDMA protocol includes two key commands:
RDMA READ and RDMA WRITE.  The receiving NIC translates these commands
into local memory reads and writes.

For security reasons, it is undesirable to allow transmitters to read
or write arbitrary memory on the receiver.  Any RDMA scheme must
prevent any unauthorized memory accesses.  Most RDMA schemes protect
memory by allowing RDMA reads/writes only to buffers that the receiver
has explicitly identified to the NIC as valid RDMA targets.  The
process of informing the NIC about a buffer is called "registration".

The following steps illustrate the common case of a data transfer
using RDMA WRITE in the context of a request/response storage protocol
such as NFS or iSCSI:

	1. Client application calls an I/O interface, requesting that
	the result of the I/O be put into a buffer B.

	2. Client implementation registers buffer B with the NIC. 

	3. Client sends the I/O READ request to server.

	4. Server issues one or more RDMA WRITE(s) to write I/O data
	into client's buffer B.

	5. Server sends the file system READ response for the I/O.

Of course, on each I/O operation the server must know to which client
addresses to write.  One alternative is for the client to pass a
token identifying the target buffer in the request; the server
returns the token in its response.  This is the approach used in
VI implementations.  An alternative is for both the client and server
to each synthesize the token from other unique identifiers present in
the request [TCPRDMA]. 

Most RDMA schemes use a region identifier (RID) and an offset to
identify the target buffer in a token. The (RID, offset) pair amounts
to a form of virtual address; the receiving NIC translates the virtual
addresses to physical addresses using table lookup.  As such, if a
mapping to a physical page does not appear in the table, there is no
way a transmitter can refer to it.

Once an entry is in the table, the NIC can potentially access the
physical memory of a buffer at any time. As such, the buffer must not
be re-used for other purposes. One alternative is for the OS to "pin"
the buffer in physical memory, allowing the NIC to safely hold the
physical addresses corresponding to the buffer.  Once the region
mapping is removed, the OS can "unpin" the physical memory.

4.2. Unsolicited payloads

NFS, CIFS, and HTTP all support sending data in a WRITE (or POST)
request along with the request. This is optimistic; it assumes the
receiving application has space (other than TCP window) to buffer the
WRITE payload. The payload and transfer are called "unsolicited" in that
they were not requested by the receiver.  RDMA WRITE is
straightforward for solicited data, since the sender can receive the
RID and buffer address in the message that solicits the data, as in
the preceding example.  In the case of unsolicited data, it is not
clear how the sender obtains the RID necessary for an RDMA WRITE.

RDMA may be used for unsolicited data in the following way.  The
receiver may expose a memory region for unsolicited data from each
sender.  The sender, when it wishes to do an unsolicited WRITE, can
RDMA its data into that region. Then, along with the WRITE request,
the sender may pass a pointer (e.g., region offset) to the data it
wrote.  This requires that the receiver (server) pass an RID for
unsolicited data at connection open and supply a new region if the
unsolicited region fills.  Alternatively, the receiver may handle
unsolicited data by responding to the WRITE request with an RDMA
READ (if supported) to fetch the data, as described in Section 4.3.

4.3  Reading remote memory

Some RDMA protocols allow one party to read another's memory with an
RDMA READ operation. As with the RDMA WRITE, the NICs and not the CPUs
process the RDMA READs.

The receiving NIC may complete the RDMA READ from the receiver's
memory without interrupting the CPU. The operation is potentially
useful because CPU interrupts are expensive in general-purpose
systems. Switching between the currently executing task and the
interrupt handler involves flushing pipelines, saving and restoring
context, and other overheads.

Although any RDMA READ may be emulated using an RDMA WRITE in the
opposite direction, use of RDMA READ as an alternative has potential
advantages.  First, an RDMA READ requester does not need to export a
region RID to receive the incoming data as an RDMA WRITE.  This is
useful because it allows servers to avoid reserving and exposing
memory regions for large numbers of clients.  Second, RDMA READ allows
the requester to control the order and rate of data transmitted by the
sender or RDMA READ target.

For example, a network storage device or server may implement write
operations by issuing RDMA READs to its client, rather than allowing
the client to use RDMA WRITE to transfer the data to the server.  This
allows the server to control use of the buffer space it allocates for
the transfers, and to pull the data from the client in an order that is
convenient for the server, e.g., to optimize disk performance.
The emerging VI-based Direct Access File System uses RDMA READ for
file write operations, in part for these reasons.

RDMA READ is more complex than RDMA WRITE because it implies that the
target NIC autonomously transmits data back to the requester, e.g.,
without involving a host CPU.  This implies that the NIC implements the
complete transport protocol necessary to send such data without involving
or interfering with the protocol stack in host software.

Use of RDMA READ requires ULPs designed to take advantage of it, as
well as more powerful NICs.  While it offers several benefits, there
may be alternative means to achieve many of the same benefits, such as
simple interrupt suppressing NICs and ULP protocol features to control
the rate and order of data flow, as provided in the iSCSI draft
specification.

In contrast to RDMA READ, RDMA WRITE is simple and general, does not
require full implementation of the transport on the NIC, and is easily
incorporated into existing request/response protocols with minimal
impact.  The remainder of this document focuses on RDMA WRITE.

4.4 Security

The principal mechanism for RDMA security is region addressing using
RID-based virtual addresses as described above in Section 4.1.  Under
no circumstances may a transmitter access memory that has not been
explicitly registered for RDMA use by the receiver.  Thus RDMA does
not introduce fundamental new security issues beyond the standard
concerns of interception and corruption of data and commands on an
insecure connection.  In this case, the concern is whether RIDs for
registered RDMA regions may be misused.

To further improve safety, each RID may include a sparse (hard to
guess) key value; only transmitters who know the key can read or write
to the memory region.  RIDs protected in this way are essentially weak
capabilities.  NICs may also place access-control lists or permissions
on pages, or limit region access to specific connections.

For real security on untrusted networks, the RDMA protocol may be
protected in-transit using security and endpoint authentication
features at the transport layer or below, such as TLS or IPsec.

5 RDMA APIs

Direct I/O to application buffers requires an interface for
registering buffers with the NIC and receiving notification that RDMA
transfers have completed.  It is straightforward to devise internal
kernel interfaces to enable use of RDMA for kernel-based ULPs.
However, use of RDMA by user-space applications may require extensions
to existing kernel networking APIs.  For example, the Berkeley Unix
sockets [Stevens] interface, as currently specified, does not directly
support RDMA.

5.1 The VI interface

The VI programming interface [VI] supports both message passing and
RDMA. The VI interface has calls for registering and pinning buffers.
The interface supports both polling and asynchronous notification of
events, e.g., RDMA completions.  The VI interface does not specify the
wire protocol and allows a variety of protocols, including IP
protocols.

The VI interface assumes that user-space programs may directly access
the NIC without transitioning to kernel mode. This precludes use of
the full VI API in conjunction with conventional TCP/IP protocol
stacks.  However, one option is to supplement the socket interface
with RDMA-related elements of the VI interface.

5.2 Winsock Direct

The Winsock Direct API, available on Windows 2000, is an extension of
the sockets interface that supports reliable messages and RDMA.

6 Implementing RDMA

Conceptually, the RDMA abstraction belongs at the transport layer so
that it generalizes to multiple ULPs.  The sending side of the RDMA
protocol is straightforward to implement at the boundary between the
ULP and the underlying transport, i.e., as a "shim" to TCP.  However,
the key aspects of the receiving side of an RDMA protocol are
implemented within the NIC, a link-level device that is logically
below the transport layer.  This is the crux of the problem for
implementing RDMA.

Transport-level support for enhanced framing (e.g., in TCP) would be
useful for implementing RDMA.  For RDMA to be effective, the receiving
NIC must be able to read and decode the control information necessary
for it to implement RDMA.  At minimum, this requires it to recognize
transport-layer headers and identify RDMA control headers embedded in
the incoming data.  It is trivial to locate these headers within an
ordered byte stream using a simple byte counting method (length field)
for framing.  The difficulty is that packets may arrive at the RDMA
receiver (NIC) out of order, and some or all of the transport-layer
facility to reorder data may be implemented above the NIC, e.g., in
host software, as shown in Figure 1.  Thus there must be some
mechanism that enables the receiving NIC to retain or recover its
ability to locate RDMA headers in the presence of sequence holes,
i.e., when packets arrive out of order.

One option is for the NIC to buffer out-of-order data until any late
packets arrive, allowing the NIC to recover any lost framing
information.  Note that this does not preclude delivering the
out-of-order data to the host along a slow path that does not benefit
from RDMA.  Keeping a copy of the data until all sequence holes are
filled allows the NIC to traverse the RDMA headers in the data stream,
positioning it to locate subsequent RDMA headers and re-establish the
RDMA fast path.  If the NIC does not have sufficient memory to buffer
the data, it may discard it, forcing the sender to retransmit more of
the data after a sequence hole.

A second option is to integrate framing support into the transport,
allowing the receiver to locate RDMA headers even when packets arrive
out of order.  Note that every packet must contain an RDMA header for
this approach to be fully general.  For example, consider a packet
carrying an RDMA header that applies to data in subsequent packets.
Even with enhanced framing, if the packet containing the RDMA header
is lost, the NIC cannot correctly apply the RDMA operation to the
arriving data until it receives the RDMA header.

Several alternatives have been proposed for integrating framing into
TCP.  These include introducing a new TCP option [TCPRDMA] or
constraining the TCP sender's selection of segment boundaries to
correspond with framing boundaries [VITCP].  Each of these
approaches would have some impact on TCP implementations and
APIs, and some of them also extend the wire protocol.

The TCP options approach requires a minor extension of the TCP wire
protocol, and modification to both the sender and the receiver, which
is especially painful considering today's inflexible in-kernel TCP
implementations.  The TCP options approach does not break backward
compatibility since unmodified endpoints will not negotiate the
option. Also, the options information is regarded only as an
optimization; it is not required for the application to parse the TCP
stream.

7 Conclusion

Remote DMA provides for efficient placement of data in memory.  The
NIC writes data into memory with the proper alignment.  Furthermore,
the NIC can often place data directly into application buffers.

The Remote DMA abstraction provides generalized mechanism useful with
many higher level protocols such as NFS, without the need for ULP
support in the NIC, and with only minor extensions to the ULP protocol
implementations.


Authors' Addresses

Constantine Sapuntzakis
Cisco Systems, Inc.
170 W. Tasman Drive
San Jose, CA 95134
USA

Phone: +1 408 525 5497
Email: csapuntz@cisco.com

Allyn Romanow
Cisco Systems, Inc.
170 W. Tasman Drive
San Jose, CA 95134
USA

Phone: +1 408 525 8836
Email: allyn@cisco.com

Jeff Chase
Department of Computer Science
Duke University
Durham, NC 27708-0129
USA

Phone: +1 919 660 6559
Email: chase@cs.duke.edu

References

[ALF] D. D. Clark and D. L. Tennenhouse, "Architectural considerations
for a new generation of protocols," in SIGCOMM Symposium on
Communications Architectures and Protocols , (Philadelphia,
Pennsylvania), pp. 200--208, IEEE, Sept. 1990.  Computer
Communications Review, Vol. 20(4), Sept. 1990.

[Brustoloni] J. Brustoloni and P. Steenkiste. "Effects of buffering
semantics on I/O performance," in Operating System Design and
Implementation (OSDI), Seattle, WA, Oct 1996.

[Chase] J. Chase, A. Gallatin, and Ken Yocum, "End-system
Optimizations for High-Speed TCP", IEEE Communications special
issue on high-speed TCP, 2001.
http://www.cs.duke.edu/ari/publications/end-system.ps (or .pdf).

[CIFS] Paul Leach, "A Common Internet File System (CIFS/1.0) Protocol
Preliminary Draft",
http://www.cifs.com/specs/draft-leach-cifs-v1-spec-01.txt, December
1997

[HTTP] J. Gettys et al., "Hypertext Transfer Protocol - HTTP/1.1",
RFC 2616, June 1999

[NFSv3] B. Callaghan, "NFS Version 3 Protocol Specification",
RFC 1813, June 1995

[RPC] R. Srinivasan, "RPC: Remote Procedure Call Protocol
Specification Version 2", RFC 1831, August 1995

[iSCSI] J. Satran, et al., "iSCSI", draft-ietf-ips-iscsi-01.txt

[Stevens] W. Richard Stevens, "Unix Network Programming Volume 1,"
Prentice Hall, 1998, ISBN 0-13-490012-X

[TCP] J. Postel, "Transmission Control Protocol - DARPA
Internet Program Protocol Specification", RFC 793, September 1981

[TCPRDMA] C. Sapuntzakis and D. Cheriton, "TCP RDMA option",
http://www.ietf.org/internet-drafts/draft-csapuntz-tcprdma-00.txt

[Winsock Direct] "Winsock Direct Specification", Windows 2000 DDK,
http://www.microsoft.com/ddk/ddkdocs/win2K/wsdspec_1h66.htm

[VI] Virtual Interface Architecture Specification version 1.0,
http://www.viarch.org/


[VITCP] DiCecco, S., et al., "VI/TCP (Internet VI)", 
draft-dicecco-vitcp-01.txt, November 2000