Over the next few years, new high-speed network standards -- primarily Gigabit Ethernet -- will consolidate an order-of-magnitude gain in network performance already achieved with specialized cluster interconnects such as Myrinet and SCI. As these technologies gain acceptance in LANs and server farms, they will place new performance pressure on network software. Although the latest desktop-class computers are capable of outstanding I/O performance, there is little quantitative basis to: (1) predict the performance they will actually deliver using standard TCP/IP networking on the new generation of networks, (2) quantify the importance of proposed optimizations (e.g., Jumbo Frames, zero-copy buffering, checksum offloading) to achieving the potential hardware performance, or (3) judge when alternatives such as user-level networking (e.g., VIA) are justified. In most cases published performance results are based on research prototypes using previous-generation technology.
This paper presents experiences with high-speed TCP/IP networking on a gigabit-per-second Myrinet network . Our work is based on the Trapeze messaging system [10,5,1,9], which consists of a messaging library and custom firmware for Myrinet. Using Trapeze firmware, Myrinet delivers communication performance at the limit of I/O bus speeds on many platforms, closely approaching the full gigabit-per-second wire speed on the most powerful hosts. This makes Trapeze/Myrinet a good vehicle for probing the limits of both the hardware and the networking software. In the experiments presented here, we exercised a Trapeze/Myrinet network with a network device driver supporting a standard kernel-based TCP/IP protocol stack on a range of DEC Alpha and Intel-based platforms. Our purpose is to provide a quantitative snapshot of the current state of the art for point-to-point TCP/IP communication on short-haul networks with low error rates, low latency, and gigabit-per-second bandwidth.
The kernel used in our experiments is FreeBSD 4.0, a descendent of the Berkeley 4.4 BSD code base, which incorporates several years worth of TCP/IP refinements. It is now widely accepted that current TCP implementations are capable of delivering a high percentage of available link speeds with large transfers, reflecting the success of these earlier efforts. However, on gigabit-per-second networks the performance of even the best TCP/IP implementations is dependent on key optimizations for low-overhead data movement, both above and below the protocol stack. One goal of this paper is to provide quantitative data to support insights into the effects and importance of these optimizations on current workstation/PC technology.
This paper outlines the key optimizations important for obtaining hardware potential from TCP/IP, their implementation in the network interface, network driver, and kernel socket code, and their effect on delivered TCP bandwidth, UDP latency, and CPU utilization at the sender and receiver. Below the TCP/IP stack, the Trapeze NIC firmware and network driver support page-aligned payload reception, interrupt suppression, large frames (MTUs), TCP checksum offloading, and an adaptive message pipelining scheme that balances low latency and high bandwidth. Above the protocol stack, at the socket layer, we have implemented new kernel support for zero-copy data movement in TCP as an extension to a zero-copy stream interface implemented by John Dyson. We show the effect of each of these factors on TCP/IP networking performance. We also report some results using similar features with Gigabit Ethernet adapters and switches from Alteon Networks.
Using Trapeze/Myrinet with zero-copy sockets, netperf attained a peak point-point bandwidth close to the link speed at 956 Mb/s on a 500 MHz Alpha 21264 PC platform equipped with prototype LANai-5 adapters from Myricom. At this speed, bandwidth is limited by the LANai-5 CPU. Newer controllers with upgraded CPUs promise still higher bandwidths. In fact, we measured bandwidth of 988 Mb/s on the same platform over the Alteon network, which uses a faster CPU on the adapters. The previous point-to-point record reported at netperf.org was 750 Mb/s, measured on a pair of mainframe-class SMP servers interconnected by HiPPI. We are not aware of any better result on public record.
This paper is organized as follows. Section 2 gives an overview of the Trapeze network interface, and Section 2.2 outlines the various optimizations for low-overhead TCP/IP communication. Section 3 presents performance results. We conclude in Section 4.