mirror of
				https://github.com/torvalds/linux.git
				synced 2025-11-04 02:30:34 +02:00 
			
		
		
		
	Documentation for this feature was missing from the patchset.
Copied a lot from the netdev 2.1 paper, addressing some small
interface changes since then.
Changes
  v1 -> v2
    - change email discussion URL format
    - clarify that u32 counter is per-syscall, unsigned and
      wraps after UINT_MAX calls
    - describe errno on send failure specific to MSG_ZEROCOPY
    - a few very minor rewordings
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
		
	
			
		
			
				
	
	
		
			257 lines
		
	
	
	
		
			8.6 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			257 lines
		
	
	
	
		
			8.6 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
 | 
						|
============
 | 
						|
MSG_ZEROCOPY
 | 
						|
============
 | 
						|
 | 
						|
Intro
 | 
						|
=====
 | 
						|
 | 
						|
The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
 | 
						|
The feature is currently implemented for TCP sockets.
 | 
						|
 | 
						|
 | 
						|
Opportunity and Caveats
 | 
						|
-----------------------
 | 
						|
 | 
						|
Copying large buffers between user process and kernel can be
 | 
						|
expensive. Linux supports various interfaces that eschew copying,
 | 
						|
such as sendpage and splice. The MSG_ZEROCOPY flag extends the
 | 
						|
underlying copy avoidance mechanism to common socket send calls.
 | 
						|
 | 
						|
Copy avoidance is not a free lunch. As implemented, with page pinning,
 | 
						|
it replaces per byte copy cost with page accounting and completion
 | 
						|
notification overhead. As a result, MSG_ZEROCOPY is generally only
 | 
						|
effective at writes over around 10 KB.
 | 
						|
 | 
						|
Page pinning also changes system call semantics. It temporarily shares
 | 
						|
the buffer between process and network stack. Unlike with copying, the
 | 
						|
process cannot immediately overwrite the buffer after system call
 | 
						|
return without possibly modifying the data in flight. Kernel integrity
 | 
						|
is not affected, but a buggy program can possibly corrupt its own data
 | 
						|
stream.
 | 
						|
 | 
						|
The kernel returns a notification when it is safe to modify data.
 | 
						|
Converting an existing application to MSG_ZEROCOPY is not always as
 | 
						|
trivial as just passing the flag, then.
 | 
						|
 | 
						|
 | 
						|
More Info
 | 
						|
---------
 | 
						|
 | 
						|
Much of this document was derived from a longer paper presented at
 | 
						|
netdev 2.1. For more in-depth information see that paper and talk,
 | 
						|
the excellent reporting over at LWN.net or read the original code.
 | 
						|
 | 
						|
  paper, slides, video
 | 
						|
    https://netdevconf.org/2.1/session.html?debruijn
 | 
						|
 | 
						|
  LWN article
 | 
						|
    https://lwn.net/Articles/726917/
 | 
						|
 | 
						|
  patchset
 | 
						|
    [PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY
 | 
						|
    http://lkml.kernel.org/r/20170803202945.70750-1-willemdebruijn.kernel@gmail.com
 | 
						|
 | 
						|
 | 
						|
Interface
 | 
						|
=========
 | 
						|
 | 
						|
Passing the MSG_ZEROCOPY flag is the most obvious step to enable copy
 | 
						|
avoidance, but not the only one.
 | 
						|
 | 
						|
Socket Setup
 | 
						|
------------
 | 
						|
 | 
						|
The kernel is permissive when applications pass undefined flags to the
 | 
						|
send system call. By default it simply ignores these. To avoid enabling
 | 
						|
copy avoidance mode for legacy processes that accidentally already pass
 | 
						|
this flag, a process must first signal intent by setting a socket option:
 | 
						|
 | 
						|
::
 | 
						|
 | 
						|
	if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one)))
 | 
						|
		error(1, errno, "setsockopt zerocopy");
 | 
						|
 | 
						|
 | 
						|
Transmission
 | 
						|
------------
 | 
						|
 | 
						|
The change to send (or sendto, sendmsg, sendmmsg) itself is trivial.
 | 
						|
Pass the new flag.
 | 
						|
 | 
						|
::
 | 
						|
 | 
						|
	ret = send(fd, buf, sizeof(buf), MSG_ZEROCOPY);
 | 
						|
 | 
						|
A zerocopy failure will return -1 with errno ENOBUFS. This happens if
 | 
						|
the socket option was not set, the socket exceeds its optmem limit or
 | 
						|
the user exceeds its ulimit on locked pages.
 | 
						|
 | 
						|
 | 
						|
Mixing copy avoidance and copying
 | 
						|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 | 
						|
 | 
						|
Many workloads have a mixture of large and small buffers. Because copy
 | 
						|
avoidance is more expensive than copying for small packets, the
 | 
						|
feature is implemented as a flag. It is safe to mix calls with the flag
 | 
						|
with those without.
 | 
						|
 | 
						|
 | 
						|
Notifications
 | 
						|
-------------
 | 
						|
 | 
						|
The kernel has to notify the process when it is safe to reuse a
 | 
						|
previously passed buffer. It queues completion notifications on the
 | 
						|
socket error queue, akin to the transmit timestamping interface.
 | 
						|
 | 
						|
The notification itself is a simple scalar value. Each socket
 | 
						|
maintains an internal unsigned 32-bit counter. Each send call with
 | 
						|
MSG_ZEROCOPY that successfully sends data increments the counter. The
 | 
						|
counter is not incremented on failure or if called with length zero.
 | 
						|
The counter counts system call invocations, not bytes. It wraps after
 | 
						|
UINT_MAX calls.
 | 
						|
 | 
						|
 | 
						|
Notification Reception
 | 
						|
~~~~~~~~~~~~~~~~~~~~~~
 | 
						|
 | 
						|
The below snippet demonstrates the API. In the simplest case, each
 | 
						|
send syscall is followed by a poll and recvmsg on the error queue.
 | 
						|
 | 
						|
Reading from the error queue is always a non-blocking operation. The
 | 
						|
poll call is there to block until an error is outstanding. It will set
 | 
						|
POLLERR in its output flags. That flag does not have to be set in the
 | 
						|
events field. Errors are signaled unconditionally.
 | 
						|
 | 
						|
::
 | 
						|
 | 
						|
	pfd.fd = fd;
 | 
						|
	pfd.events = 0;
 | 
						|
	if (poll(&pfd, 1, -1) != 1 || pfd.revents & POLLERR == 0)
 | 
						|
		error(1, errno, "poll");
 | 
						|
 | 
						|
	ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
 | 
						|
	if (ret == -1)
 | 
						|
		error(1, errno, "recvmsg");
 | 
						|
 | 
						|
	read_notification(msg);
 | 
						|
 | 
						|
The example is for demonstration purpose only. In practice, it is more
 | 
						|
efficient to not wait for notifications, but read without blocking
 | 
						|
every couple of send calls.
 | 
						|
 | 
						|
Notifications can be processed out of order with other operations on
 | 
						|
the socket. A socket that has an error queued would normally block
 | 
						|
other operations until the error is read. Zerocopy notifications have
 | 
						|
a zero error code, however, to not block send and recv calls.
 | 
						|
 | 
						|
 | 
						|
Notification Batching
 | 
						|
~~~~~~~~~~~~~~~~~~~~~
 | 
						|
 | 
						|
Multiple outstanding packets can be read at once using the recvmmsg
 | 
						|
call. This is often not needed. In each message the kernel returns not
 | 
						|
a single value, but a range. It coalesces consecutive notifications
 | 
						|
while one is outstanding for reception on the error queue.
 | 
						|
 | 
						|
When a new notification is about to be queued, it checks whether the
 | 
						|
new value extends the range of the notification at the tail of the
 | 
						|
queue. If so, it drops the new notification packet and instead increases
 | 
						|
the range upper value of the outstanding notification.
 | 
						|
 | 
						|
For protocols that acknowledge data in-order, like TCP, each
 | 
						|
notification can be squashed into the previous one, so that no more
 | 
						|
than one notification is outstanding at any one point.
 | 
						|
 | 
						|
Ordered delivery is the common case, but not guaranteed. Notifications
 | 
						|
may arrive out of order on retransmission and socket teardown.
 | 
						|
 | 
						|
 | 
						|
Notification Parsing
 | 
						|
~~~~~~~~~~~~~~~~~~~~
 | 
						|
 | 
						|
The below snippet demonstrates how to parse the control message: the
 | 
						|
read_notification() call in the previous snippet. A notification
 | 
						|
is encoded in the standard error format, sock_extended_err.
 | 
						|
 | 
						|
The level and type fields in the control data are protocol family
 | 
						|
specific, IP_RECVERR or IPV6_RECVERR.
 | 
						|
 | 
						|
Error origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero,
 | 
						|
as explained before, to avoid blocking read and write system calls on
 | 
						|
the socket.
 | 
						|
 | 
						|
The 32-bit notification range is encoded as [ee_info, ee_data]. This
 | 
						|
range is inclusive. Other fields in the struct must be treated as
 | 
						|
undefined, bar for ee_code, as discussed below.
 | 
						|
 | 
						|
::
 | 
						|
 | 
						|
	struct sock_extended_err *serr;
 | 
						|
	struct cmsghdr *cm;
 | 
						|
 | 
						|
	cm = CMSG_FIRSTHDR(msg);
 | 
						|
	if (cm->cmsg_level != SOL_IP &&
 | 
						|
	    cm->cmsg_type != IP_RECVERR)
 | 
						|
		error(1, 0, "cmsg");
 | 
						|
 | 
						|
	serr = (void *) CMSG_DATA(cm);
 | 
						|
	if (serr->ee_errno != 0 ||
 | 
						|
	    serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY)
 | 
						|
		error(1, 0, "serr");
 | 
						|
 | 
						|
	printf("completed: %u..%u\n", serr->ee_info, serr->ee_data);
 | 
						|
 | 
						|
 | 
						|
Deferred copies
 | 
						|
~~~~~~~~~~~~~~~
 | 
						|
 | 
						|
Passing flag MSG_ZEROCOPY is a hint to the kernel to apply copy
 | 
						|
avoidance, and a contract that the kernel will queue a completion
 | 
						|
notification. It is not a guarantee that the copy is elided.
 | 
						|
 | 
						|
Copy avoidance is not always feasible. Devices that do not support
 | 
						|
scatter-gather I/O cannot send packets made up of kernel generated
 | 
						|
protocol headers plus zerocopy user data. A packet may need to be
 | 
						|
converted to a private copy of data deep in the stack, say to compute
 | 
						|
a checksum.
 | 
						|
 | 
						|
In all these cases, the kernel returns a completion notification when
 | 
						|
it releases its hold on the shared pages. That notification may arrive
 | 
						|
before the (copied) data is fully transmitted. A zerocopy completion
 | 
						|
notification is not a transmit completion notification, therefore.
 | 
						|
 | 
						|
Deferred copies can be more expensive than a copy immediately in the
 | 
						|
system call, if the data is no longer warm in the cache. The process
 | 
						|
also incurs notification processing cost for no benefit. For this
 | 
						|
reason, the kernel signals if data was completed with a copy, by
 | 
						|
setting flag SO_EE_CODE_ZEROCOPY_COPIED in field ee_code on return.
 | 
						|
A process may use this signal to stop passing flag MSG_ZEROCOPY on
 | 
						|
subsequent requests on the same socket.
 | 
						|
 | 
						|
 | 
						|
Implementation
 | 
						|
==============
 | 
						|
 | 
						|
Loopback
 | 
						|
--------
 | 
						|
 | 
						|
Data sent to local sockets can be queued indefinitely if the receive
 | 
						|
process does not read its socket. Unbound notification latency is not
 | 
						|
acceptable. For this reason all packets generated with MSG_ZEROCOPY
 | 
						|
that are looped to a local socket will incur a deferred copy. This
 | 
						|
includes looping onto packet sockets (e.g., tcpdump) and tun devices.
 | 
						|
 | 
						|
 | 
						|
Testing
 | 
						|
=======
 | 
						|
 | 
						|
More realistic example code can be found in the kernel source under
 | 
						|
tools/testing/selftests/net/msg_zerocopy.c.
 | 
						|
 | 
						|
Be cognizant of the loopback constraint. The test can be run between
 | 
						|
a pair of hosts. But if run between a local pair of processes, for
 | 
						|
instance when run with msg_zerocopy.sh between a veth pair across
 | 
						|
namespaces, the test will not show any improvement. For testing, the
 | 
						|
loopback restriction can be temporarily relaxed by making
 | 
						|
skb_orphan_frags_rx identical to skb_orphan_frags.
 |