mirror of
				https://github.com/torvalds/linux.git
				synced 2025-11-04 10:40:15 +02:00 
			
		
		
		
	Since commit73f03c2b4b("fuse: Restrict allow_other to the superblock's namespace or a descendant"), access to allow_other FUSE filesystems has been limited to users in the mounting user namespace or descendants. This prevents a process that is privileged in its userns - but not its parent namespaces - from mounting a FUSE fs w/ allow_other that is accessible to processes in parent namespaces. While this restriction makes sense overall it breaks a legitimate usecase: I have a tracing daemon which needs to peek into process' open files in order to symbolicate - similar to 'perf'. The daemon is a privileged process in the root userns, but is unable to peek into FUSE filesystems mounted by processes in child namespaces. This patch adds a module param, allow_sys_admin_access, to act as an escape hatch for this descendant userns logic and for the allow_other mount option in general. Setting allow_sys_admin_access allows processes with CAP_SYS_ADMIN in the initial userns to access FUSE filesystems irrespective of the mounting userns or whether allow_other was set. A sysadmin setting this param must trust FUSEs on the host to not DoS processes as described in73f03c2b4b. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
		
			
				
	
	
		
			426 lines
		
	
	
	
		
			17 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			426 lines
		
	
	
	
		
			17 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
.. SPDX-License-Identifier: GPL-2.0
 | 
						|
 | 
						|
====
 | 
						|
FUSE
 | 
						|
====
 | 
						|
 | 
						|
Definitions
 | 
						|
===========
 | 
						|
 | 
						|
Userspace filesystem:
 | 
						|
  A filesystem in which data and metadata are provided by an ordinary
 | 
						|
  userspace process.  The filesystem can be accessed normally through
 | 
						|
  the kernel interface.
 | 
						|
 | 
						|
Filesystem daemon:
 | 
						|
  The process(es) providing the data and metadata of the filesystem.
 | 
						|
 | 
						|
Non-privileged mount (or user mount):
 | 
						|
  A userspace filesystem mounted by a non-privileged (non-root) user.
 | 
						|
  The filesystem daemon is running with the privileges of the mounting
 | 
						|
  user.  NOTE: this is not the same as mounts allowed with the "user"
 | 
						|
  option in /etc/fstab, which is not discussed here.
 | 
						|
 | 
						|
Filesystem connection:
 | 
						|
  A connection between the filesystem daemon and the kernel.  The
 | 
						|
  connection exists until either the daemon dies, or the filesystem is
 | 
						|
  umounted.  Note that detaching (or lazy umounting) the filesystem
 | 
						|
  does *not* break the connection, in this case it will exist until
 | 
						|
  the last reference to the filesystem is released.
 | 
						|
 | 
						|
Mount owner:
 | 
						|
  The user who does the mounting.
 | 
						|
 | 
						|
User:
 | 
						|
  The user who is performing filesystem operations.
 | 
						|
 | 
						|
What is FUSE?
 | 
						|
=============
 | 
						|
 | 
						|
FUSE is a userspace filesystem framework.  It consists of a kernel
 | 
						|
module (fuse.ko), a userspace library (libfuse.*) and a mount utility
 | 
						|
(fusermount).
 | 
						|
 | 
						|
One of the most important features of FUSE is allowing secure,
 | 
						|
non-privileged mounts.  This opens up new possibilities for the use of
 | 
						|
filesystems.  A good example is sshfs: a secure network filesystem
 | 
						|
using the sftp protocol.
 | 
						|
 | 
						|
The userspace library and utilities are available from the
 | 
						|
`FUSE homepage: <https://github.com/libfuse/>`_
 | 
						|
 | 
						|
Filesystem type
 | 
						|
===============
 | 
						|
 | 
						|
The filesystem type given to mount(2) can be one of the following:
 | 
						|
 | 
						|
    fuse
 | 
						|
      This is the usual way to mount a FUSE filesystem.  The first
 | 
						|
      argument of the mount system call may contain an arbitrary string,
 | 
						|
      which is not interpreted by the kernel.
 | 
						|
 | 
						|
    fuseblk
 | 
						|
      The filesystem is block device based.  The first argument of the
 | 
						|
      mount system call is interpreted as the name of the device.
 | 
						|
 | 
						|
Mount options
 | 
						|
=============
 | 
						|
 | 
						|
fd=N
 | 
						|
  The file descriptor to use for communication between the userspace
 | 
						|
  filesystem and the kernel.  The file descriptor must have been
 | 
						|
  obtained by opening the FUSE device ('/dev/fuse').
 | 
						|
 | 
						|
rootmode=M
 | 
						|
  The file mode of the filesystem's root in octal representation.
 | 
						|
 | 
						|
user_id=N
 | 
						|
  The numeric user id of the mount owner.
 | 
						|
 | 
						|
group_id=N
 | 
						|
  The numeric group id of the mount owner.
 | 
						|
 | 
						|
default_permissions
 | 
						|
  By default FUSE doesn't check file access permissions, the
 | 
						|
  filesystem is free to implement its access policy or leave it to
 | 
						|
  the underlying file access mechanism (e.g. in case of network
 | 
						|
  filesystems).  This option enables permission checking, restricting
 | 
						|
  access based on file mode.  It is usually useful together with the
 | 
						|
  'allow_other' mount option.
 | 
						|
 | 
						|
allow_other
 | 
						|
  This option overrides the security measure restricting file access
 | 
						|
  to the user mounting the filesystem.  This option is by default only
 | 
						|
  allowed to root, but this restriction can be removed with a
 | 
						|
  (userspace) configuration option.
 | 
						|
 | 
						|
max_read=N
 | 
						|
  With this option the maximum size of read operations can be set.
 | 
						|
  The default is infinite.  Note that the size of read requests is
 | 
						|
  limited anyway to 32 pages (which is 128kbyte on i386).
 | 
						|
 | 
						|
blksize=N
 | 
						|
  Set the block size for the filesystem.  The default is 512.  This
 | 
						|
  option is only valid for 'fuseblk' type mounts.
 | 
						|
 | 
						|
Control filesystem
 | 
						|
==================
 | 
						|
 | 
						|
There's a control filesystem for FUSE, which can be mounted by::
 | 
						|
 | 
						|
  mount -t fusectl none /sys/fs/fuse/connections
 | 
						|
 | 
						|
Mounting it under the '/sys/fs/fuse/connections' directory makes it
 | 
						|
backwards compatible with earlier versions.
 | 
						|
 | 
						|
Under the fuse control filesystem each connection has a directory
 | 
						|
named by a unique number.
 | 
						|
 | 
						|
For each connection the following files exist within this directory:
 | 
						|
 | 
						|
	waiting
 | 
						|
	  The number of requests which are waiting to be transferred to
 | 
						|
	  userspace or being processed by the filesystem daemon.  If there is
 | 
						|
	  no filesystem activity and 'waiting' is non-zero, then the
 | 
						|
	  filesystem is hung or deadlocked.
 | 
						|
 | 
						|
	abort
 | 
						|
	  Writing anything into this file will abort the filesystem
 | 
						|
	  connection.  This means that all waiting requests will be aborted an
 | 
						|
	  error returned for all aborted and new requests.
 | 
						|
 | 
						|
Only the owner of the mount may read or write these files.
 | 
						|
 | 
						|
Interrupting filesystem operations
 | 
						|
##################################
 | 
						|
 | 
						|
If a process issuing a FUSE filesystem request is interrupted, the
 | 
						|
following will happen:
 | 
						|
 | 
						|
  -  If the request is not yet sent to userspace AND the signal is
 | 
						|
     fatal (SIGKILL or unhandled fatal signal), then the request is
 | 
						|
     dequeued and returns immediately.
 | 
						|
 | 
						|
  -  If the request is not yet sent to userspace AND the signal is not
 | 
						|
     fatal, then an interrupted flag is set for the request.  When
 | 
						|
     the request has been successfully transferred to userspace and
 | 
						|
     this flag is set, an INTERRUPT request is queued.
 | 
						|
 | 
						|
  -  If the request is already sent to userspace, then an INTERRUPT
 | 
						|
     request is queued.
 | 
						|
 | 
						|
INTERRUPT requests take precedence over other requests, so the
 | 
						|
userspace filesystem will receive queued INTERRUPTs before any others.
 | 
						|
 | 
						|
The userspace filesystem may ignore the INTERRUPT requests entirely,
 | 
						|
or may honor them by sending a reply to the *original* request, with
 | 
						|
the error set to EINTR.
 | 
						|
 | 
						|
It is also possible that there's a race between processing the
 | 
						|
original request and its INTERRUPT request.  There are two possibilities:
 | 
						|
 | 
						|
  1. The INTERRUPT request is processed before the original request is
 | 
						|
     processed
 | 
						|
 | 
						|
  2. The INTERRUPT request is processed after the original request has
 | 
						|
     been answered
 | 
						|
 | 
						|
If the filesystem cannot find the original request, it should wait for
 | 
						|
some timeout and/or a number of new requests to arrive, after which it
 | 
						|
should reply to the INTERRUPT request with an EAGAIN error.  In case
 | 
						|
1) the INTERRUPT request will be requeued.  In case 2) the INTERRUPT
 | 
						|
reply will be ignored.
 | 
						|
 | 
						|
Aborting a filesystem connection
 | 
						|
================================
 | 
						|
 | 
						|
It is possible to get into certain situations where the filesystem is
 | 
						|
not responding.  Reasons for this may be:
 | 
						|
 | 
						|
  a) Broken userspace filesystem implementation
 | 
						|
 | 
						|
  b) Network connection down
 | 
						|
 | 
						|
  c) Accidental deadlock
 | 
						|
 | 
						|
  d) Malicious deadlock
 | 
						|
 | 
						|
(For more on c) and d) see later sections)
 | 
						|
 | 
						|
In either of these cases it may be useful to abort the connection to
 | 
						|
the filesystem.  There are several ways to do this:
 | 
						|
 | 
						|
  - Kill the filesystem daemon.  Works in case of a) and b)
 | 
						|
 | 
						|
  - Kill the filesystem daemon and all users of the filesystem.  Works
 | 
						|
    in all cases except some malicious deadlocks
 | 
						|
 | 
						|
  - Use forced umount (umount -f).  Works in all cases but only if
 | 
						|
    filesystem is still attached (it hasn't been lazy unmounted)
 | 
						|
 | 
						|
  - Abort filesystem through the FUSE control filesystem.  Most
 | 
						|
    powerful method, always works.
 | 
						|
 | 
						|
How do non-privileged mounts work?
 | 
						|
==================================
 | 
						|
 | 
						|
Since the mount() system call is a privileged operation, a helper
 | 
						|
program (fusermount) is needed, which is installed setuid root.
 | 
						|
 | 
						|
The implication of providing non-privileged mounts is that the mount
 | 
						|
owner must not be able to use this capability to compromise the
 | 
						|
system.  Obvious requirements arising from this are:
 | 
						|
 | 
						|
 A) mount owner should not be able to get elevated privileges with the
 | 
						|
    help of the mounted filesystem
 | 
						|
 | 
						|
 B) mount owner should not get illegitimate access to information from
 | 
						|
    other users' and the super user's processes
 | 
						|
 | 
						|
 C) mount owner should not be able to induce undesired behavior in
 | 
						|
    other users' or the super user's processes
 | 
						|
 | 
						|
How are requirements fulfilled?
 | 
						|
===============================
 | 
						|
 | 
						|
 A) The mount owner could gain elevated privileges by either:
 | 
						|
 | 
						|
    1. creating a filesystem containing a device file, then opening this device
 | 
						|
 | 
						|
    2. creating a filesystem containing a suid or sgid application, then executing this application
 | 
						|
 | 
						|
    The solution is not to allow opening device files and ignore
 | 
						|
    setuid and setgid bits when executing programs.  To ensure this
 | 
						|
    fusermount always adds "nosuid" and "nodev" to the mount options
 | 
						|
    for non-privileged mounts.
 | 
						|
 | 
						|
 B) If another user is accessing files or directories in the
 | 
						|
    filesystem, the filesystem daemon serving requests can record the
 | 
						|
    exact sequence and timing of operations performed.  This
 | 
						|
    information is otherwise inaccessible to the mount owner, so this
 | 
						|
    counts as an information leak.
 | 
						|
 | 
						|
    The solution to this problem will be presented in point 2) of C).
 | 
						|
 | 
						|
 C) There are several ways in which the mount owner can induce
 | 
						|
    undesired behavior in other users' processes, such as:
 | 
						|
 | 
						|
     1) mounting a filesystem over a file or directory which the mount
 | 
						|
        owner could otherwise not be able to modify (or could only
 | 
						|
        make limited modifications).
 | 
						|
 | 
						|
        This is solved in fusermount, by checking the access
 | 
						|
        permissions on the mountpoint and only allowing the mount if
 | 
						|
        the mount owner can do unlimited modification (has write
 | 
						|
        access to the mountpoint, and mountpoint is not a "sticky"
 | 
						|
        directory)
 | 
						|
 | 
						|
     2) Even if 1) is solved the mount owner can change the behavior
 | 
						|
        of other users' processes.
 | 
						|
 | 
						|
         i) It can slow down or indefinitely delay the execution of a
 | 
						|
            filesystem operation creating a DoS against the user or the
 | 
						|
            whole system.  For example a suid application locking a
 | 
						|
            system file, and then accessing a file on the mount owner's
 | 
						|
            filesystem could be stopped, and thus causing the system
 | 
						|
            file to be locked forever.
 | 
						|
 | 
						|
         ii) It can present files or directories of unlimited length, or
 | 
						|
             directory structures of unlimited depth, possibly causing a
 | 
						|
             system process to eat up diskspace, memory or other
 | 
						|
             resources, again causing *DoS*.
 | 
						|
 | 
						|
	The solution to this as well as B) is not to allow processes
 | 
						|
	to access the filesystem, which could otherwise not be
 | 
						|
	monitored or manipulated by the mount owner.  Since if the
 | 
						|
	mount owner can ptrace a process, it can do all of the above
 | 
						|
	without using a FUSE mount, the same criteria as used in
 | 
						|
	ptrace can be used to check if a process is allowed to access
 | 
						|
	the filesystem or not.
 | 
						|
 | 
						|
	Note that the *ptrace* check is not strictly necessary to
 | 
						|
	prevent C/2/i, it is enough to check if mount owner has enough
 | 
						|
	privilege to send signal to the process accessing the
 | 
						|
	filesystem, since *SIGSTOP* can be used to get a similar effect.
 | 
						|
 | 
						|
I think these limitations are unacceptable?
 | 
						|
===========================================
 | 
						|
 | 
						|
If a sysadmin trusts the users enough, or can ensure through other
 | 
						|
measures, that system processes will never enter non-privileged
 | 
						|
mounts, it can relax the last limitation in several ways:
 | 
						|
 | 
						|
  - With the 'user_allow_other' config option. If this config option is
 | 
						|
    set, the mounting user can add the 'allow_other' mount option which
 | 
						|
    disables the check for other users' processes.
 | 
						|
 | 
						|
    User namespaces have an unintuitive interaction with 'allow_other':
 | 
						|
    an unprivileged user - normally restricted from mounting with
 | 
						|
    'allow_other' - could do so in a user namespace where they're
 | 
						|
    privileged. If any process could access such an 'allow_other' mount
 | 
						|
    this would give the mounting user the ability to manipulate
 | 
						|
    processes in user namespaces where they're unprivileged. For this
 | 
						|
    reason 'allow_other' restricts access to users in the same userns
 | 
						|
    or a descendant.
 | 
						|
 | 
						|
  - With the 'allow_sys_admin_access' module option. If this option is
 | 
						|
    set, super user's processes have unrestricted access to mounts
 | 
						|
    irrespective of allow_other setting or user namespace of the
 | 
						|
    mounting user.
 | 
						|
 | 
						|
Note that both of these relaxations expose the system to potential
 | 
						|
information leak or *DoS* as described in points B and C/2/i-ii in the
 | 
						|
preceding section.
 | 
						|
 | 
						|
Kernel - userspace interface
 | 
						|
============================
 | 
						|
 | 
						|
The following diagram shows how a filesystem operation (in this
 | 
						|
example unlink) is performed in FUSE. ::
 | 
						|
 | 
						|
 | 
						|
 |  "rm /mnt/fuse/file"               |  FUSE filesystem daemon
 | 
						|
 |                                    |
 | 
						|
 |                                    |  >sys_read()
 | 
						|
 |                                    |    >fuse_dev_read()
 | 
						|
 |                                    |      >request_wait()
 | 
						|
 |                                    |        [sleep on fc->waitq]
 | 
						|
 |                                    |
 | 
						|
 |  >sys_unlink()                     |
 | 
						|
 |    >fuse_unlink()                  |
 | 
						|
 |      [get request from             |
 | 
						|
 |       fc->unused_list]             |
 | 
						|
 |      >request_send()               |
 | 
						|
 |        [queue req on fc->pending]  |
 | 
						|
 |        [wake up fc->waitq]         |        [woken up]
 | 
						|
 |        >request_wait_answer()      |
 | 
						|
 |          [sleep on req->waitq]     |
 | 
						|
 |                                    |      <request_wait()
 | 
						|
 |                                    |      [remove req from fc->pending]
 | 
						|
 |                                    |      [copy req to read buffer]
 | 
						|
 |                                    |      [add req to fc->processing]
 | 
						|
 |                                    |    <fuse_dev_read()
 | 
						|
 |                                    |  <sys_read()
 | 
						|
 |                                    |
 | 
						|
 |                                    |  [perform unlink]
 | 
						|
 |                                    |
 | 
						|
 |                                    |  >sys_write()
 | 
						|
 |                                    |    >fuse_dev_write()
 | 
						|
 |                                    |      [look up req in fc->processing]
 | 
						|
 |                                    |      [remove from fc->processing]
 | 
						|
 |                                    |      [copy write buffer to req]
 | 
						|
 |          [woken up]                |      [wake up req->waitq]
 | 
						|
 |                                    |    <fuse_dev_write()
 | 
						|
 |                                    |  <sys_write()
 | 
						|
 |        <request_wait_answer()      |
 | 
						|
 |      <request_send()               |
 | 
						|
 |      [add request to               |
 | 
						|
 |       fc->unused_list]             |
 | 
						|
 |    <fuse_unlink()                  |
 | 
						|
 |  <sys_unlink()                     |
 | 
						|
 | 
						|
.. note:: Everything in the description above is greatly simplified
 | 
						|
 | 
						|
There are a couple of ways in which to deadlock a FUSE filesystem.
 | 
						|
Since we are talking about unprivileged userspace programs,
 | 
						|
something must be done about these.
 | 
						|
 | 
						|
**Scenario 1 -  Simple deadlock**::
 | 
						|
 | 
						|
 |  "rm /mnt/fuse/file"               |  FUSE filesystem daemon
 | 
						|
 |                                    |
 | 
						|
 |  >sys_unlink("/mnt/fuse/file")     |
 | 
						|
 |    [acquire inode semaphore        |
 | 
						|
 |     for "file"]                    |
 | 
						|
 |    >fuse_unlink()                  |
 | 
						|
 |      [sleep on req->waitq]         |
 | 
						|
 |                                    |  <sys_read()
 | 
						|
 |                                    |  >sys_unlink("/mnt/fuse/file")
 | 
						|
 |                                    |    [acquire inode semaphore
 | 
						|
 |                                    |     for "file"]
 | 
						|
 |                                    |    *DEADLOCK*
 | 
						|
 | 
						|
The solution for this is to allow the filesystem to be aborted.
 | 
						|
 | 
						|
**Scenario 2 - Tricky deadlock**
 | 
						|
 | 
						|
 | 
						|
This one needs a carefully crafted filesystem.  It's a variation on
 | 
						|
the above, only the call back to the filesystem is not explicit,
 | 
						|
but is caused by a pagefault. ::
 | 
						|
 | 
						|
 |  Kamikaze filesystem thread 1      |  Kamikaze filesystem thread 2
 | 
						|
 |                                    |
 | 
						|
 |  [fd = open("/mnt/fuse/file")]     |  [request served normally]
 | 
						|
 |  [mmap fd to 'addr']               |
 | 
						|
 |  [close fd]                        |  [FLUSH triggers 'magic' flag]
 | 
						|
 |  [read a byte from addr]           |
 | 
						|
 |    >do_page_fault()                |
 | 
						|
 |      [find or create page]         |
 | 
						|
 |      [lock page]                   |
 | 
						|
 |      >fuse_readpage()              |
 | 
						|
 |         [queue READ request]       |
 | 
						|
 |         [sleep on req->waitq]      |
 | 
						|
 |                                    |  [read request to buffer]
 | 
						|
 |                                    |  [create reply header before addr]
 | 
						|
 |                                    |  >sys_write(addr - headerlength)
 | 
						|
 |                                    |    >fuse_dev_write()
 | 
						|
 |                                    |      [look up req in fc->processing]
 | 
						|
 |                                    |      [remove from fc->processing]
 | 
						|
 |                                    |      [copy write buffer to req]
 | 
						|
 |                                    |        >do_page_fault()
 | 
						|
 |                                    |           [find or create page]
 | 
						|
 |                                    |           [lock page]
 | 
						|
 |                                    |           * DEADLOCK *
 | 
						|
 | 
						|
The solution is basically the same as above.
 | 
						|
 | 
						|
An additional problem is that while the write buffer is being copied
 | 
						|
to the request, the request must not be interrupted/aborted.  This is
 | 
						|
because the destination address of the copy may not be valid after the
 | 
						|
request has returned.
 | 
						|
 | 
						|
This is solved with doing the copy atomically, and allowing abort
 | 
						|
while the page(s) belonging to the write buffer are faulted with
 | 
						|
get_user_pages().  The 'req->locked' flag indicates when the copy is
 | 
						|
taking place, and abort is delayed until this flag is unset.
 |