mirror of
				https://github.com/torvalds/linux.git
				synced 2025-11-04 02:30:34 +02:00 
			
		
		
		
	The documentation explains the need to create internal syscalls' helpers,
and that they should be called `kern_xyzzy()`. However, the comment at
include/linux/syscalls.h says that they should be named as
`ksys_xyzzy()`, and so are all the helpers declared bellow it. Change the
documentation to reflect this.
Fixes: 819671ff84 ("syscalls: define and explain goal to not call syscalls in the kernel")
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net>
Link: https://lore.kernel.org/r/20210130014547.123006-1-andrealmeid@collabora.com
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
		
	
			
		
			
				
	
	
		
			577 lines
		
	
	
	
		
			27 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			577 lines
		
	
	
	
		
			27 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
 | 
						|
.. _addsyscalls:
 | 
						|
 | 
						|
Adding a New System Call
 | 
						|
========================
 | 
						|
 | 
						|
This document describes what's involved in adding a new system call to the
 | 
						|
Linux kernel, over and above the normal submission advice in
 | 
						|
:ref:`Documentation/process/submitting-patches.rst <submittingpatches>`.
 | 
						|
 | 
						|
 | 
						|
System Call Alternatives
 | 
						|
------------------------
 | 
						|
 | 
						|
The first thing to consider when adding a new system call is whether one of
 | 
						|
the alternatives might be suitable instead.  Although system calls are the
 | 
						|
most traditional and most obvious interaction points between userspace and the
 | 
						|
kernel, there are other possibilities -- choose what fits best for your
 | 
						|
interface.
 | 
						|
 | 
						|
 - If the operations involved can be made to look like a filesystem-like
 | 
						|
   object, it may make more sense to create a new filesystem or device.  This
 | 
						|
   also makes it easier to encapsulate the new functionality in a kernel module
 | 
						|
   rather than requiring it to be built into the main kernel.
 | 
						|
 | 
						|
     - If the new functionality involves operations where the kernel notifies
 | 
						|
       userspace that something has happened, then returning a new file
 | 
						|
       descriptor for the relevant object allows userspace to use
 | 
						|
       ``poll``/``select``/``epoll`` to receive that notification.
 | 
						|
     - However, operations that don't map to
 | 
						|
       :manpage:`read(2)`/:manpage:`write(2)`-like operations
 | 
						|
       have to be implemented as :manpage:`ioctl(2)` requests, which can lead
 | 
						|
       to a somewhat opaque API.
 | 
						|
 | 
						|
 - If you're just exposing runtime system information, a new node in sysfs
 | 
						|
   (see ``Documentation/filesystems/sysfs.rst``) or the ``/proc`` filesystem may
 | 
						|
   be more appropriate.  However, access to these mechanisms requires that the
 | 
						|
   relevant filesystem is mounted, which might not always be the case (e.g.
 | 
						|
   in a namespaced/sandboxed/chrooted environment).  Avoid adding any API to
 | 
						|
   debugfs, as this is not considered a 'production' interface to userspace.
 | 
						|
 - If the operation is specific to a particular file or file descriptor, then
 | 
						|
   an additional :manpage:`fcntl(2)` command option may be more appropriate.  However,
 | 
						|
   :manpage:`fcntl(2)` is a multiplexing system call that hides a lot of complexity, so
 | 
						|
   this option is best for when the new function is closely analogous to
 | 
						|
   existing :manpage:`fcntl(2)` functionality, or the new functionality is very simple
 | 
						|
   (for example, getting/setting a simple flag related to a file descriptor).
 | 
						|
 - If the operation is specific to a particular task or process, then an
 | 
						|
   additional :manpage:`prctl(2)` command option may be more appropriate.  As
 | 
						|
   with :manpage:`fcntl(2)`, this system call is a complicated multiplexor so
 | 
						|
   is best reserved for near-analogs of existing ``prctl()`` commands or
 | 
						|
   getting/setting a simple flag related to a process.
 | 
						|
 | 
						|
 | 
						|
Designing the API: Planning for Extension
 | 
						|
-----------------------------------------
 | 
						|
 | 
						|
A new system call forms part of the API of the kernel, and has to be supported
 | 
						|
indefinitely.  As such, it's a very good idea to explicitly discuss the
 | 
						|
interface on the kernel mailing list, and it's important to plan for future
 | 
						|
extensions of the interface.
 | 
						|
 | 
						|
(The syscall table is littered with historical examples where this wasn't done,
 | 
						|
together with the corresponding follow-up system calls --
 | 
						|
``eventfd``/``eventfd2``, ``dup2``/``dup3``, ``inotify_init``/``inotify_init1``,
 | 
						|
``pipe``/``pipe2``, ``renameat``/``renameat2`` -- so
 | 
						|
learn from the history of the kernel and plan for extensions from the start.)
 | 
						|
 | 
						|
For simpler system calls that only take a couple of arguments, the preferred
 | 
						|
way to allow for future extensibility is to include a flags argument to the
 | 
						|
system call.  To make sure that userspace programs can safely use flags
 | 
						|
between kernel versions, check whether the flags value holds any unknown
 | 
						|
flags, and reject the system call (with ``EINVAL``) if it does::
 | 
						|
 | 
						|
    if (flags & ~(THING_FLAG1 | THING_FLAG2 | THING_FLAG3))
 | 
						|
        return -EINVAL;
 | 
						|
 | 
						|
(If no flags values are used yet, check that the flags argument is zero.)
 | 
						|
 | 
						|
For more sophisticated system calls that involve a larger number of arguments,
 | 
						|
it's preferred to encapsulate the majority of the arguments into a structure
 | 
						|
that is passed in by pointer.  Such a structure can cope with future extension
 | 
						|
by including a size argument in the structure::
 | 
						|
 | 
						|
    struct xyzzy_params {
 | 
						|
        u32 size; /* userspace sets p->size = sizeof(struct xyzzy_params) */
 | 
						|
        u32 param_1;
 | 
						|
        u64 param_2;
 | 
						|
        u64 param_3;
 | 
						|
    };
 | 
						|
 | 
						|
As long as any subsequently added field, say ``param_4``, is designed so that a
 | 
						|
zero value gives the previous behaviour, then this allows both directions of
 | 
						|
version mismatch:
 | 
						|
 | 
						|
 - To cope with a later userspace program calling an older kernel, the kernel
 | 
						|
   code should check that any memory beyond the size of the structure that it
 | 
						|
   expects is zero (effectively checking that ``param_4 == 0``).
 | 
						|
 - To cope with an older userspace program calling a newer kernel, the kernel
 | 
						|
   code can zero-extend a smaller instance of the structure (effectively
 | 
						|
   setting ``param_4 = 0``).
 | 
						|
 | 
						|
See :manpage:`perf_event_open(2)` and the ``perf_copy_attr()`` function (in
 | 
						|
``kernel/events/core.c``) for an example of this approach.
 | 
						|
 | 
						|
 | 
						|
Designing the API: Other Considerations
 | 
						|
---------------------------------------
 | 
						|
 | 
						|
If your new system call allows userspace to refer to a kernel object, it
 | 
						|
should use a file descriptor as the handle for that object -- don't invent a
 | 
						|
new type of userspace object handle when the kernel already has mechanisms and
 | 
						|
well-defined semantics for using file descriptors.
 | 
						|
 | 
						|
If your new :manpage:`xyzzy(2)` system call does return a new file descriptor,
 | 
						|
then the flags argument should include a value that is equivalent to setting
 | 
						|
``O_CLOEXEC`` on the new FD.  This makes it possible for userspace to close
 | 
						|
the timing window between ``xyzzy()`` and calling
 | 
						|
``fcntl(fd, F_SETFD, FD_CLOEXEC)``, where an unexpected ``fork()`` and
 | 
						|
``execve()`` in another thread could leak a descriptor to
 | 
						|
the exec'ed program. (However, resist the temptation to re-use the actual value
 | 
						|
of the ``O_CLOEXEC`` constant, as it is architecture-specific and is part of a
 | 
						|
numbering space of ``O_*`` flags that is fairly full.)
 | 
						|
 | 
						|
If your system call returns a new file descriptor, you should also consider
 | 
						|
what it means to use the :manpage:`poll(2)` family of system calls on that file
 | 
						|
descriptor. Making a file descriptor ready for reading or writing is the
 | 
						|
normal way for the kernel to indicate to userspace that an event has
 | 
						|
occurred on the corresponding kernel object.
 | 
						|
 | 
						|
If your new :manpage:`xyzzy(2)` system call involves a filename argument::
 | 
						|
 | 
						|
    int sys_xyzzy(const char __user *path, ..., unsigned int flags);
 | 
						|
 | 
						|
you should also consider whether an :manpage:`xyzzyat(2)` version is more appropriate::
 | 
						|
 | 
						|
    int sys_xyzzyat(int dfd, const char __user *path, ..., unsigned int flags);
 | 
						|
 | 
						|
This allows more flexibility for how userspace specifies the file in question;
 | 
						|
in particular it allows userspace to request the functionality for an
 | 
						|
already-opened file descriptor using the ``AT_EMPTY_PATH`` flag, effectively
 | 
						|
giving an :manpage:`fxyzzy(3)` operation for free::
 | 
						|
 | 
						|
 - xyzzyat(AT_FDCWD, path, ..., 0) is equivalent to xyzzy(path,...)
 | 
						|
 - xyzzyat(fd, "", ..., AT_EMPTY_PATH) is equivalent to fxyzzy(fd, ...)
 | 
						|
 | 
						|
(For more details on the rationale of the \*at() calls, see the
 | 
						|
:manpage:`openat(2)` man page; for an example of AT_EMPTY_PATH, see the
 | 
						|
:manpage:`fstatat(2)` man page.)
 | 
						|
 | 
						|
If your new :manpage:`xyzzy(2)` system call involves a parameter describing an
 | 
						|
offset within a file, make its type ``loff_t`` so that 64-bit offsets can be
 | 
						|
supported even on 32-bit architectures.
 | 
						|
 | 
						|
If your new :manpage:`xyzzy(2)` system call involves privileged functionality,
 | 
						|
it needs to be governed by the appropriate Linux capability bit (checked with
 | 
						|
a call to ``capable()``), as described in the :manpage:`capabilities(7)` man
 | 
						|
page.  Choose an existing capability bit that governs related functionality,
 | 
						|
but try to avoid combining lots of only vaguely related functions together
 | 
						|
under the same bit, as this goes against capabilities' purpose of splitting
 | 
						|
the power of root.  In particular, avoid adding new uses of the already
 | 
						|
overly-general ``CAP_SYS_ADMIN`` capability.
 | 
						|
 | 
						|
If your new :manpage:`xyzzy(2)` system call manipulates a process other than
 | 
						|
the calling process, it should be restricted (using a call to
 | 
						|
``ptrace_may_access()``) so that only a calling process with the same
 | 
						|
permissions as the target process, or with the necessary capabilities, can
 | 
						|
manipulate the target process.
 | 
						|
 | 
						|
Finally, be aware that some non-x86 architectures have an easier time if
 | 
						|
system call parameters that are explicitly 64-bit fall on odd-numbered
 | 
						|
arguments (i.e. parameter 1, 3, 5), to allow use of contiguous pairs of 32-bit
 | 
						|
registers.  (This concern does not apply if the arguments are part of a
 | 
						|
structure that's passed in by pointer.)
 | 
						|
 | 
						|
 | 
						|
Proposing the API
 | 
						|
-----------------
 | 
						|
 | 
						|
To make new system calls easy to review, it's best to divide up the patchset
 | 
						|
into separate chunks.  These should include at least the following items as
 | 
						|
distinct commits (each of which is described further below):
 | 
						|
 | 
						|
 - The core implementation of the system call, together with prototypes,
 | 
						|
   generic numbering, Kconfig changes and fallback stub implementation.
 | 
						|
 - Wiring up of the new system call for one particular architecture, usually
 | 
						|
   x86 (including all of x86_64, x86_32 and x32).
 | 
						|
 - A demonstration of the use of the new system call in userspace via a
 | 
						|
   selftest in ``tools/testing/selftests/``.
 | 
						|
 - A draft man-page for the new system call, either as plain text in the
 | 
						|
   cover letter, or as a patch to the (separate) man-pages repository.
 | 
						|
 | 
						|
New system call proposals, like any change to the kernel's API, should always
 | 
						|
be cc'ed to linux-api@vger.kernel.org.
 | 
						|
 | 
						|
 | 
						|
Generic System Call Implementation
 | 
						|
----------------------------------
 | 
						|
 | 
						|
The main entry point for your new :manpage:`xyzzy(2)` system call will be called
 | 
						|
``sys_xyzzy()``, but you add this entry point with the appropriate
 | 
						|
``SYSCALL_DEFINEn()`` macro rather than explicitly.  The 'n' indicates the
 | 
						|
number of arguments to the system call, and the macro takes the system call name
 | 
						|
followed by the (type, name) pairs for the parameters as arguments.  Using
 | 
						|
this macro allows metadata about the new system call to be made available for
 | 
						|
other tools.
 | 
						|
 | 
						|
The new entry point also needs a corresponding function prototype, in
 | 
						|
``include/linux/syscalls.h``, marked as asmlinkage to match the way that system
 | 
						|
calls are invoked::
 | 
						|
 | 
						|
    asmlinkage long sys_xyzzy(...);
 | 
						|
 | 
						|
Some architectures (e.g. x86) have their own architecture-specific syscall
 | 
						|
tables, but several other architectures share a generic syscall table. Add your
 | 
						|
new system call to the generic list by adding an entry to the list in
 | 
						|
``include/uapi/asm-generic/unistd.h``::
 | 
						|
 | 
						|
    #define __NR_xyzzy 292
 | 
						|
    __SYSCALL(__NR_xyzzy, sys_xyzzy)
 | 
						|
 | 
						|
Also update the __NR_syscalls count to reflect the additional system call, and
 | 
						|
note that if multiple new system calls are added in the same merge window,
 | 
						|
your new syscall number may get adjusted to resolve conflicts.
 | 
						|
 | 
						|
The file ``kernel/sys_ni.c`` provides a fallback stub implementation of each
 | 
						|
system call, returning ``-ENOSYS``.  Add your new system call here too::
 | 
						|
 | 
						|
    COND_SYSCALL(xyzzy);
 | 
						|
 | 
						|
Your new kernel functionality, and the system call that controls it, should
 | 
						|
normally be optional, so add a ``CONFIG`` option (typically to
 | 
						|
``init/Kconfig``) for it. As usual for new ``CONFIG`` options:
 | 
						|
 | 
						|
 - Include a description of the new functionality and system call controlled
 | 
						|
   by the option.
 | 
						|
 - Make the option depend on EXPERT if it should be hidden from normal users.
 | 
						|
 - Make any new source files implementing the function dependent on the CONFIG
 | 
						|
   option in the Makefile (e.g. ``obj-$(CONFIG_XYZZY_SYSCALL) += xyzzy.o``).
 | 
						|
 - Double check that the kernel still builds with the new CONFIG option turned
 | 
						|
   off.
 | 
						|
 | 
						|
To summarize, you need a commit that includes:
 | 
						|
 | 
						|
 - ``CONFIG`` option for the new function, normally in ``init/Kconfig``
 | 
						|
 - ``SYSCALL_DEFINEn(xyzzy, ...)`` for the entry point
 | 
						|
 - corresponding prototype in ``include/linux/syscalls.h``
 | 
						|
 - generic table entry in ``include/uapi/asm-generic/unistd.h``
 | 
						|
 - fallback stub in ``kernel/sys_ni.c``
 | 
						|
 | 
						|
 | 
						|
x86 System Call Implementation
 | 
						|
------------------------------
 | 
						|
 | 
						|
To wire up your new system call for x86 platforms, you need to update the
 | 
						|
master syscall tables.  Assuming your new system call isn't special in some
 | 
						|
way (see below), this involves a "common" entry (for x86_64 and x32) in
 | 
						|
arch/x86/entry/syscalls/syscall_64.tbl::
 | 
						|
 | 
						|
    333   common   xyzzy     sys_xyzzy
 | 
						|
 | 
						|
and an "i386" entry in ``arch/x86/entry/syscalls/syscall_32.tbl``::
 | 
						|
 | 
						|
    380   i386     xyzzy     sys_xyzzy
 | 
						|
 | 
						|
Again, these numbers are liable to be changed if there are conflicts in the
 | 
						|
relevant merge window.
 | 
						|
 | 
						|
 | 
						|
Compatibility System Calls (Generic)
 | 
						|
------------------------------------
 | 
						|
 | 
						|
For most system calls the same 64-bit implementation can be invoked even when
 | 
						|
the userspace program is itself 32-bit; even if the system call's parameters
 | 
						|
include an explicit pointer, this is handled transparently.
 | 
						|
 | 
						|
However, there are a couple of situations where a compatibility layer is
 | 
						|
needed to cope with size differences between 32-bit and 64-bit.
 | 
						|
 | 
						|
The first is if the 64-bit kernel also supports 32-bit userspace programs, and
 | 
						|
so needs to parse areas of (``__user``) memory that could hold either 32-bit or
 | 
						|
64-bit values.  In particular, this is needed whenever a system call argument
 | 
						|
is:
 | 
						|
 | 
						|
 - a pointer to a pointer
 | 
						|
 - a pointer to a struct containing a pointer (e.g. ``struct iovec __user *``)
 | 
						|
 - a pointer to a varying sized integral type (``time_t``, ``off_t``,
 | 
						|
   ``long``, ...)
 | 
						|
 - a pointer to a struct containing a varying sized integral type.
 | 
						|
 | 
						|
The second situation that requires a compatibility layer is if one of the
 | 
						|
system call's arguments has a type that is explicitly 64-bit even on a 32-bit
 | 
						|
architecture, for example ``loff_t`` or ``__u64``.  In this case, a value that
 | 
						|
arrives at a 64-bit kernel from a 32-bit application will be split into two
 | 
						|
32-bit values, which then need to be re-assembled in the compatibility layer.
 | 
						|
 | 
						|
(Note that a system call argument that's a pointer to an explicit 64-bit type
 | 
						|
does **not** need a compatibility layer; for example, :manpage:`splice(2)`'s arguments of
 | 
						|
type ``loff_t __user *`` do not trigger the need for a ``compat_`` system call.)
 | 
						|
 | 
						|
The compatibility version of the system call is called ``compat_sys_xyzzy()``,
 | 
						|
and is added with the ``COMPAT_SYSCALL_DEFINEn()`` macro, analogously to
 | 
						|
SYSCALL_DEFINEn.  This version of the implementation runs as part of a 64-bit
 | 
						|
kernel, but expects to receive 32-bit parameter values and does whatever is
 | 
						|
needed to deal with them.  (Typically, the ``compat_sys_`` version converts the
 | 
						|
values to 64-bit versions and either calls on to the ``sys_`` version, or both of
 | 
						|
them call a common inner implementation function.)
 | 
						|
 | 
						|
The compat entry point also needs a corresponding function prototype, in
 | 
						|
``include/linux/compat.h``, marked as asmlinkage to match the way that system
 | 
						|
calls are invoked::
 | 
						|
 | 
						|
    asmlinkage long compat_sys_xyzzy(...);
 | 
						|
 | 
						|
If the system call involves a structure that is laid out differently on 32-bit
 | 
						|
and 64-bit systems, say ``struct xyzzy_args``, then the include/linux/compat.h
 | 
						|
header file should also include a compat version of the structure (``struct
 | 
						|
compat_xyzzy_args``) where each variable-size field has the appropriate
 | 
						|
``compat_`` type that corresponds to the type in ``struct xyzzy_args``.  The
 | 
						|
``compat_sys_xyzzy()`` routine can then use this ``compat_`` structure to
 | 
						|
parse the arguments from a 32-bit invocation.
 | 
						|
 | 
						|
For example, if there are fields::
 | 
						|
 | 
						|
    struct xyzzy_args {
 | 
						|
        const char __user *ptr;
 | 
						|
        __kernel_long_t varying_val;
 | 
						|
        u64 fixed_val;
 | 
						|
        /* ... */
 | 
						|
    };
 | 
						|
 | 
						|
in struct xyzzy_args, then struct compat_xyzzy_args would have::
 | 
						|
 | 
						|
    struct compat_xyzzy_args {
 | 
						|
        compat_uptr_t ptr;
 | 
						|
        compat_long_t varying_val;
 | 
						|
        u64 fixed_val;
 | 
						|
        /* ... */
 | 
						|
    };
 | 
						|
 | 
						|
The generic system call list also needs adjusting to allow for the compat
 | 
						|
version; the entry in ``include/uapi/asm-generic/unistd.h`` should use
 | 
						|
``__SC_COMP`` rather than ``__SYSCALL``::
 | 
						|
 | 
						|
    #define __NR_xyzzy 292
 | 
						|
    __SC_COMP(__NR_xyzzy, sys_xyzzy, compat_sys_xyzzy)
 | 
						|
 | 
						|
To summarize, you need:
 | 
						|
 | 
						|
 - a ``COMPAT_SYSCALL_DEFINEn(xyzzy, ...)`` for the compat entry point
 | 
						|
 - corresponding prototype in ``include/linux/compat.h``
 | 
						|
 - (if needed) 32-bit mapping struct in ``include/linux/compat.h``
 | 
						|
 - instance of ``__SC_COMP`` not ``__SYSCALL`` in
 | 
						|
   ``include/uapi/asm-generic/unistd.h``
 | 
						|
 | 
						|
 | 
						|
Compatibility System Calls (x86)
 | 
						|
--------------------------------
 | 
						|
 | 
						|
To wire up the x86 architecture of a system call with a compatibility version,
 | 
						|
the entries in the syscall tables need to be adjusted.
 | 
						|
 | 
						|
First, the entry in ``arch/x86/entry/syscalls/syscall_32.tbl`` gets an extra
 | 
						|
column to indicate that a 32-bit userspace program running on a 64-bit kernel
 | 
						|
should hit the compat entry point::
 | 
						|
 | 
						|
    380   i386     xyzzy     sys_xyzzy    __ia32_compat_sys_xyzzy
 | 
						|
 | 
						|
Second, you need to figure out what should happen for the x32 ABI version of
 | 
						|
the new system call.  There's a choice here: the layout of the arguments
 | 
						|
should either match the 64-bit version or the 32-bit version.
 | 
						|
 | 
						|
If there's a pointer-to-a-pointer involved, the decision is easy: x32 is
 | 
						|
ILP32, so the layout should match the 32-bit version, and the entry in
 | 
						|
``arch/x86/entry/syscalls/syscall_64.tbl`` is split so that x32 programs hit
 | 
						|
the compatibility wrapper::
 | 
						|
 | 
						|
    333   64       xyzzy     sys_xyzzy
 | 
						|
    ...
 | 
						|
    555   x32      xyzzy     __x32_compat_sys_xyzzy
 | 
						|
 | 
						|
If no pointers are involved, then it is preferable to re-use the 64-bit system
 | 
						|
call for the x32 ABI (and consequently the entry in
 | 
						|
arch/x86/entry/syscalls/syscall_64.tbl is unchanged).
 | 
						|
 | 
						|
In either case, you should check that the types involved in your argument
 | 
						|
layout do indeed map exactly from x32 (-mx32) to either the 32-bit (-m32) or
 | 
						|
64-bit (-m64) equivalents.
 | 
						|
 | 
						|
 | 
						|
System Calls Returning Elsewhere
 | 
						|
--------------------------------
 | 
						|
 | 
						|
For most system calls, once the system call is complete the user program
 | 
						|
continues exactly where it left off -- at the next instruction, with the
 | 
						|
stack the same and most of the registers the same as before the system call,
 | 
						|
and with the same virtual memory space.
 | 
						|
 | 
						|
However, a few system calls do things differently.  They might return to a
 | 
						|
different location (``rt_sigreturn``) or change the memory space
 | 
						|
(``fork``/``vfork``/``clone``) or even architecture (``execve``/``execveat``)
 | 
						|
of the program.
 | 
						|
 | 
						|
To allow for this, the kernel implementation of the system call may need to
 | 
						|
save and restore additional registers to the kernel stack, allowing complete
 | 
						|
control of where and how execution continues after the system call.
 | 
						|
 | 
						|
This is arch-specific, but typically involves defining assembly entry points
 | 
						|
that save/restore additional registers and invoke the real system call entry
 | 
						|
point.
 | 
						|
 | 
						|
For x86_64, this is implemented as a ``stub_xyzzy`` entry point in
 | 
						|
``arch/x86/entry/entry_64.S``, and the entry in the syscall table
 | 
						|
(``arch/x86/entry/syscalls/syscall_64.tbl``) is adjusted to match::
 | 
						|
 | 
						|
    333   common   xyzzy     stub_xyzzy
 | 
						|
 | 
						|
The equivalent for 32-bit programs running on a 64-bit kernel is normally
 | 
						|
called ``stub32_xyzzy`` and implemented in ``arch/x86/entry/entry_64_compat.S``,
 | 
						|
with the corresponding syscall table adjustment in
 | 
						|
``arch/x86/entry/syscalls/syscall_32.tbl``::
 | 
						|
 | 
						|
    380   i386     xyzzy     sys_xyzzy    stub32_xyzzy
 | 
						|
 | 
						|
If the system call needs a compatibility layer (as in the previous section)
 | 
						|
then the ``stub32_`` version needs to call on to the ``compat_sys_`` version
 | 
						|
of the system call rather than the native 64-bit version.  Also, if the x32 ABI
 | 
						|
implementation is not common with the x86_64 version, then its syscall
 | 
						|
table will also need to invoke a stub that calls on to the ``compat_sys_``
 | 
						|
version.
 | 
						|
 | 
						|
For completeness, it's also nice to set up a mapping so that user-mode Linux
 | 
						|
still works -- its syscall table will reference stub_xyzzy, but the UML build
 | 
						|
doesn't include ``arch/x86/entry/entry_64.S`` implementation (because UML
 | 
						|
simulates registers etc).  Fixing this is as simple as adding a #define to
 | 
						|
``arch/x86/um/sys_call_table_64.c``::
 | 
						|
 | 
						|
    #define stub_xyzzy sys_xyzzy
 | 
						|
 | 
						|
 | 
						|
Other Details
 | 
						|
-------------
 | 
						|
 | 
						|
Most of the kernel treats system calls in a generic way, but there is the
 | 
						|
occasional exception that may need updating for your particular system call.
 | 
						|
 | 
						|
The audit subsystem is one such special case; it includes (arch-specific)
 | 
						|
functions that classify some special types of system call -- specifically
 | 
						|
file open (``open``/``openat``), program execution (``execve``/``exeveat``) or
 | 
						|
socket multiplexor (``socketcall``) operations. If your new system call is
 | 
						|
analogous to one of these, then the audit system should be updated.
 | 
						|
 | 
						|
More generally, if there is an existing system call that is analogous to your
 | 
						|
new system call, it's worth doing a kernel-wide grep for the existing system
 | 
						|
call to check there are no other special cases.
 | 
						|
 | 
						|
 | 
						|
Testing
 | 
						|
-------
 | 
						|
 | 
						|
A new system call should obviously be tested; it is also useful to provide
 | 
						|
reviewers with a demonstration of how user space programs will use the system
 | 
						|
call.  A good way to combine these aims is to include a simple self-test
 | 
						|
program in a new directory under ``tools/testing/selftests/``.
 | 
						|
 | 
						|
For a new system call, there will obviously be no libc wrapper function and so
 | 
						|
the test will need to invoke it using ``syscall()``; also, if the system call
 | 
						|
involves a new userspace-visible structure, the corresponding header will need
 | 
						|
to be installed to compile the test.
 | 
						|
 | 
						|
Make sure the selftest runs successfully on all supported architectures.  For
 | 
						|
example, check that it works when compiled as an x86_64 (-m64), x86_32 (-m32)
 | 
						|
and x32 (-mx32) ABI program.
 | 
						|
 | 
						|
For more extensive and thorough testing of new functionality, you should also
 | 
						|
consider adding tests to the Linux Test Project, or to the xfstests project
 | 
						|
for filesystem-related changes.
 | 
						|
 | 
						|
 - https://linux-test-project.github.io/
 | 
						|
 - git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
 | 
						|
 | 
						|
 | 
						|
Man Page
 | 
						|
--------
 | 
						|
 | 
						|
All new system calls should come with a complete man page, ideally using groff
 | 
						|
markup, but plain text will do.  If groff is used, it's helpful to include a
 | 
						|
pre-rendered ASCII version of the man page in the cover email for the
 | 
						|
patchset, for the convenience of reviewers.
 | 
						|
 | 
						|
The man page should be cc'ed to linux-man@vger.kernel.org
 | 
						|
For more details, see https://www.kernel.org/doc/man-pages/patches.html
 | 
						|
 | 
						|
 | 
						|
Do not call System Calls in the Kernel
 | 
						|
--------------------------------------
 | 
						|
 | 
						|
System calls are, as stated above, interaction points between userspace and
 | 
						|
the kernel.  Therefore, system call functions such as ``sys_xyzzy()`` or
 | 
						|
``compat_sys_xyzzy()`` should only be called from userspace via the syscall
 | 
						|
table, but not from elsewhere in the kernel.  If the syscall functionality is
 | 
						|
useful to be used within the kernel, needs to be shared between an old and a
 | 
						|
new syscall, or needs to be shared between a syscall and its compatibility
 | 
						|
variant, it should be implemented by means of a "helper" function (such as
 | 
						|
``ksys_xyzzy()``).  This kernel function may then be called within the
 | 
						|
syscall stub (``sys_xyzzy()``), the compatibility syscall stub
 | 
						|
(``compat_sys_xyzzy()``), and/or other kernel code.
 | 
						|
 | 
						|
At least on 64-bit x86, it will be a hard requirement from v4.17 onwards to not
 | 
						|
call system call functions in the kernel.  It uses a different calling
 | 
						|
convention for system calls where ``struct pt_regs`` is decoded on-the-fly in a
 | 
						|
syscall wrapper which then hands processing over to the actual syscall function.
 | 
						|
This means that only those parameters which are actually needed for a specific
 | 
						|
syscall are passed on during syscall entry, instead of filling in six CPU
 | 
						|
registers with random user space content all the time (which may cause serious
 | 
						|
trouble down the call chain).
 | 
						|
 | 
						|
Moreover, rules on how data may be accessed may differ between kernel data and
 | 
						|
user data.  This is another reason why calling ``sys_xyzzy()`` is generally a
 | 
						|
bad idea.
 | 
						|
 | 
						|
Exceptions to this rule are only allowed in architecture-specific overrides,
 | 
						|
architecture-specific compatibility wrappers, or other code in arch/.
 | 
						|
 | 
						|
 | 
						|
References and Sources
 | 
						|
----------------------
 | 
						|
 | 
						|
 - LWN article from Michael Kerrisk on use of flags argument in system calls:
 | 
						|
   https://lwn.net/Articles/585415/
 | 
						|
 - LWN article from Michael Kerrisk on how to handle unknown flags in a system
 | 
						|
   call: https://lwn.net/Articles/588444/
 | 
						|
 - LWN article from Jake Edge describing constraints on 64-bit system call
 | 
						|
   arguments: https://lwn.net/Articles/311630/
 | 
						|
 - Pair of LWN articles from David Drysdale that describe the system call
 | 
						|
   implementation paths in detail for v3.14:
 | 
						|
 | 
						|
    - https://lwn.net/Articles/604287/
 | 
						|
    - https://lwn.net/Articles/604515/
 | 
						|
 | 
						|
 - Architecture-specific requirements for system calls are discussed in the
 | 
						|
   :manpage:`syscall(2)` man-page:
 | 
						|
   http://man7.org/linux/man-pages/man2/syscall.2.html#NOTES
 | 
						|
 - Collated emails from Linus Torvalds discussing the problems with ``ioctl()``:
 | 
						|
   https://yarchive.net/comp/linux/ioctl.html
 | 
						|
 - "How to not invent kernel interfaces", Arnd Bergmann,
 | 
						|
   https://www.ukuug.org/events/linux2007/2007/papers/Bergmann.pdf
 | 
						|
 - LWN article from Michael Kerrisk on avoiding new uses of CAP_SYS_ADMIN:
 | 
						|
   https://lwn.net/Articles/486306/
 | 
						|
 - Recommendation from Andrew Morton that all related information for a new
 | 
						|
   system call should come in the same email thread:
 | 
						|
   https://lore.kernel.org/r/20140724144747.3041b208832bbdf9fbce5d96@linux-foundation.org
 | 
						|
 - Recommendation from Michael Kerrisk that a new system call should come with
 | 
						|
   a man page: https://lore.kernel.org/r/CAKgNAkgMA39AfoSoA5Pe1r9N+ZzfYQNvNPvcRN7tOvRb8+v06Q@mail.gmail.com
 | 
						|
 - Suggestion from Thomas Gleixner that x86 wire-up should be in a separate
 | 
						|
   commit: https://lore.kernel.org/r/alpine.DEB.2.11.1411191249560.3909@nanos
 | 
						|
 - Suggestion from Greg Kroah-Hartman that it's good for new system calls to
 | 
						|
   come with a man-page & selftest: https://lore.kernel.org/r/20140320025530.GA25469@kroah.com
 | 
						|
 - Discussion from Michael Kerrisk of new system call vs. :manpage:`prctl(2)` extension:
 | 
						|
   https://lore.kernel.org/r/CAHO5Pa3F2MjfTtfNxa8LbnkeeU8=YJ+9tDqxZpw7Gz59E-4AUg@mail.gmail.com
 | 
						|
 - Suggestion from Ingo Molnar that system calls that involve multiple
 | 
						|
   arguments should encapsulate those arguments in a struct, which includes a
 | 
						|
   size field for future extensibility: https://lore.kernel.org/r/20150730083831.GA22182@gmail.com
 | 
						|
 - Numbering oddities arising from (re-)use of O_* numbering space flags:
 | 
						|
 | 
						|
    - commit 75069f2b5bfb ("vfs: renumber FMODE_NONOTIFY and add to uniqueness
 | 
						|
      check")
 | 
						|
    - commit 12ed2e36c98a ("fanotify: FMODE_NONOTIFY and __O_SYNC in sparc
 | 
						|
      conflict")
 | 
						|
    - commit bb458c644a59 ("Safer ABI for O_TMPFILE")
 | 
						|
 | 
						|
 - Discussion from Matthew Wilcox about restrictions on 64-bit arguments:
 | 
						|
   https://lore.kernel.org/r/20081212152929.GM26095@parisc-linux.org
 | 
						|
 - Recommendation from Greg Kroah-Hartman that unknown flags should be
 | 
						|
   policed: https://lore.kernel.org/r/20140717193330.GB4703@kroah.com
 | 
						|
 - Recommendation from Linus Torvalds that x32 system calls should prefer
 | 
						|
   compatibility with 64-bit versions rather than 32-bit versions:
 | 
						|
   https://lore.kernel.org/r/CA+55aFxfmwfB7jbbrXxa=K7VBYPfAvmu3XOkGrLbB1UFjX1+Ew@mail.gmail.com
 |