next up previous
Next: Checksum Offloading Up: Low-Overhead Data Movement Previous: Low-Overhead Data Movement

Zero-Copy Sockets

  Conventional TCP/IP communication incurs a high cost to copy data between kernel buffers and user process virtual memory at the socket layer. This situation has motivated development of techniques to reduce or eliminate data copying by page remapping between the user process and the kernel when size and alignment properties allow [6,4,7]. A page remapping scheme should preserve the copy semantics of the existing socket interface.

In general, zero-copy optimizations assume MTUs matched to the page size of the endstation hardware and operating system. Ideally, each packet payload is an even multiple of the page size, and is stored in buffers that naturally align on page boundaries. On the receive side, the NIC must separate the headers and payload into separate buffers, leaving the payload page-aligned. This can be done with special support to recognize TCP/IP packets on the NIC, or by constructing receive mbuf chains that optimistically assume that received packets are TCP packets. In Trapeze, the sending host explicitly separates header and payload portions of each packet; the Trapeze driver optimistically assumes that data in the first mbuf of an outgoing chain is header data, and places its data in the control message. The link layer preserves this separation on the receiving side.

We implemented zero-copy TCP/IP extensions at the socket layer in the FreeBSD 4.0 kernel, using code developed by John Dyson for zero-copy I/O through the read/write system call interface. The zero-copy extensions require some buffering support in the network driver, but are otherwise independent of the underlying network, assuming that it supports sufficiently large MTUs and page-aligned sends and receives. Section 3 reports results from zero-copy TCP experiments on both Trapeze/Myrinet and Alteon Gigabit Ethernet hardware.

The page remapping occurs in a variant of the uiomove kernel routine, which directs the movement of data to and from the process virtual memory for all variants of the I/O read and write system calls. Our zero-copy socket code is implemented as a new case alongside Dyson's code in uiomoveco, which is invoked from socket-layer sosend and soreceive when a process requests the kernel to transfer a page or more of data to or from a page-aligned user buffer.

For a zero-copy read, uiomoveco maps kernel buffer pages directly into the process address space. If the read is from a file, it creates a copy-on-write mapping to a page in FreeBSD's unified buffer cache; the copy-on-write preserves the file data in case the user process stores to the remapped page. In the case of a receiver read from a socket, copy-on-write is unnecessary because there is no need to retain the kernel buffer after the read; ordinarily soreceive simply frees the kernel buffers once the data has been delivered to the user process. The remapping case instead frees just the mbuf headers and any physical page frames that previously backed remapped virtual pages in the user buffer. Thus most receive-side page remappings actually trade page frames between the process and the kernel buffer pool, preserving equilibrium.

On the send side, copy-on-write is used because the sending process may overwrite its send buffer once the send is complete. The send-side code maps each whole page from the user buffer into the kernel address space, references it with an external mbuf, and marks the page as copy-on-write. The mbuf chains and their pages are then passed through the TCP/IP stack to the network driver, which attaches them to outgoing messages as payloads. When each mbuf is freed on transmit complete, the external free routine releases the page's copy-on-write mapping. The new socket layer code handles only anonymous virtual memory pages; we do not support zero-copy transmission of memory backed by mapped files because this would duplicate the functionality of the sendfile routine already implemented by David Greenman.


next up previous
Next: Checksum Offloading Up: Low-Overhead Data Movement Previous: Low-Overhead Data Movement
Jeff Chase
8/4/1999