next up previous
Next: UDP Latency Up: Experimental Results Previous: CPU Utilization

TCP Overhead

To better understand the costs responsible for the CPU utilizations presented in Figures4 and 3, we used iprobe to derive a breakdown of receiver overheads on Miata for selected MTU sizes at bandwidth levels held in the 300-400 Mb/s range by a slow sender. Iprobe (Instruction Probe) is an on-line profiling tool developed by the performance group (High Performance Servers/Benchmark Performance Engineering) of Digital/Compaq. It uses the Digital Alpha on-chip performance counters to report detailed execution breakdowns with low overhead (3%-5%), using techniques similar to those reported in [2]. We gathered our data using a local port of iprobe_suite-4.0 to FreeBSD. This port will be integrated into the next release of iprobe.


  
Figure 5: TCP Receiver CPU Utilization Breakdown
\begin{figure}
\centerline{
\epsfig {file = figs/iprobe.eps, scale = .50}
}\end{figure}

Figure 5 shows the breakdown of receiver overhead into five categories: data movement overheads for copying and checksumming, interrupt handling, virtual memory costs (buffer page allocation and/or page remapping), Trapeze driver overheads, and TCP/IP protocol stack overheads. With a 1500-byte MTU the Miata is near 80% saturation at a bandwidth of 300 Mb/s. While the overhead can be reduced somewhat by checksum offloading and interrupt suppression, about 55% of CPU time is spent on unavoidable packet-handling overheads in the driver and TCP/IP stack, and data movement costs at the socket layer. With an 8KB payload, the bandwidth level has increased to 360 Mb/s, while CPU time spent in packet handling has dropped from 55% to 24%. Data movement overheads grow slightly due to the higher bandwidth, but the larger MTU introduces the opportunity to almost fully eliminate these overheads by enabling zero-copy optimizations. While the zero-copy optimization has some cost in VM page remapping, the reduced memory system contention causes other overheads to drop slightly, leaving utilizations in the 24% range if checksums are disabled (checksum offloading is not supported on the LANai-4 NICs used in this experiment). Again, these measurements reinforce the inadequacy of the 1500-byte standard Ethernet for high-speed networking, and the importance of the Jumbo Frames standard.

The rightmost set of bars in Figure 5 shows the overhead breakdown for 57K MTUs at a bandwidth of 390 Mb/s. While data movement overheads increase slightly due to the higher bandwidth, these costs can be eliminated with page remapping, which increases VM overheads but again causes other non-VM overheads to drop slightly due to reduced memory system contention. In the zero-copy experiment, the larger MTU does not affect VM page remappings at all relative to the 8KB MTU, since these costs are proportional to the number of pages of data transferred. However, per-packet TCP/IP and driver overheads drop from 8% to just 3% of CPU, even as bandwidth increases by about 10%. The Miata can handle the 390 Mb/s of bandwidth with a comfortable 10% CPU utilization.


next up previous
Next: UDP Latency Up: Experimental Results Previous: CPU Utilization
Jeff Chase
8/4/1999