forked from mirrors/linux
Christian Brauner <brauner@kernel.org> says:
Currently, it isn't possible to change the idmapping of an idmapped
mount. This is becoming an obstacle for various use-cases.
/* idmapped home directories with systemd-homed */
On newer systems /home is can be an idmapped mount such that each file
on disk is owned by 65536 and a subfolder exists for foreign id ranges
such as containers. For example, a home directory might look like this
(using an arbitrary folder as an example):
user1@localhost:~/data/mount-idmapped$ ls -al /data/
total 16
drwxrwxrwx 1 65536 65536 36 Jan 27 12:15 .
drwxrwxr-x 1 root root 184 Jan 27 12:06 ..
-rw-r--r-- 1 65536 65536 0 Jan 27 12:07 aaa
-rw-r--r-- 1 65536 65536 0 Jan 27 12:07 bbb
-rw-r--r-- 1 65536 65536 0 Jan 27 12:07 cc
drwxr-xr-x 1 2147352576 2147352576 0 Jan 27 19:06 containers
When logging in home is mounted as an idmapped mount with the following
idmappings:
65536:$(id -u):1 // uid mapping
65536:$(id -g):1 // gid mapping
2147352576:2147352576:65536 // uid mapping
2147352576:2147352576:65536 // gid mapping
So for a user with uid/gid 1000 an idmapped /home would like like this:
user1@localhost:~/data/mount-idmapped$ ls -aln /mnt/
total 16
drwxrwxrwx 1 1000 1000 36 Jan 27 12:15 .
drwxrwxr-x 1 0 0 184 Jan 27 12:06 ..
-rw-r--r-- 1 1000 1000 0 Jan 27 12:07 aaa
-rw-r--r-- 1 1000 1000 0 Jan 27 12:07 bbb
-rw-r--r-- 1 1000 1000 0 Jan 27 12:07 cc
drwxr-xr-x 1 2147352576 2147352576 0 Jan 27 19:06 containers
In other words, 65536 is mapped to the user's uid/gid and the range
2147352576 up to 2147352576 + 65536 is an identity mapping for
containers.
When a container is started a transient uid/gid range is allocated
outside of both mappings of the idmapped mount. For example, the
container might get the idmapping:
$ cat /proc/1742611/uid_map
0 537985024 65536
This container will be allowed to write to disk within the allocated
foreign id range 2147352576 to 2147352576 + 65536. To do this an
idmapped mount must be created from an already idmapped mount such that:
- The mappings for the user's uid/gid must be dropped, i.e., the
following mappings are removed:
65536:$(id -u):1 // uid mapping
65536:$(id -g):1 // gid mapping
- A mapping for the transient uid/gid range to the foreign uid/gid range
is added:
2147352576:537985024:65536
In combination this will mean that the container will write to disk
within the foreign id range 2147352576 to 2147352576 + 65536.
/* nested containers */
When the outer container makes use of idmapped mounts it isn't posssible
to create an idmapped mount for the inner container with a differen
idmapping from the outer container's idmapped mount.
There are other usecases and the two above just serve as an illustration
of the problem.
This patchset makes it possible to create a new idmapped mount from an
already idmapped mount. It aims to adhere to current performance
constraints and requirements:
- Idmapped mounts aim to have near zero performance implications for
path lookup. That is why no refernce counting, locking or any other
mechanism can be required that would impact performance.
This works be ensuring that a regular mount transitions to an idmapped
mount once going from a static nop_mnt_idmap mapping to a non-static
idmapping.
- The idmapping of a mount change anymore for the lifetime of the mount
afterwards. This not just avoids UAF issues it also avoids pitfalls
such as generating non-matching uid/gid values.
Changing idmappings could be solved by:
- Idmappings could simply be reference counted (above the simple
reference count when sharing them across multiple mounts).
This would require pairing mnt_idmap_get() with mnt_idmap_put() which
would end up being sprinkled everywhere into the VFS and some
filesystems that access idmappings directly.
It wouldn't just be quite ugly and introduce new complexity it would
have a noticeable performance impact.
- Idmappings could gain RCU protection. This would help the LOOKUP_RCU
case and avoids taking reference counts under RCU.
When not under LOOKUP_RCU reference counts need to be acquired on each
idmapping. This would require pairing mnt_idmap_get() with
mnt_idmap_put() which would end up being sprinkled everywhere into the
VFS and some filesystems that access idmappings directly.
This would have the same downsides as mentioned earlier.
- The earlier solutions work by updating the mnt->mnt_idmap pointer with
the new idmapping. Instead of this it would be possible to change the
idmapping itself to avoid UAF issues.
To do this a sequence counter would have to be added to struct mount.
When retrieving the idmapping to generate uid/gid values the sequence
counter would need to be sampled and the generation of the uid/gid
would spin until the update of the idmap is finished.
This has problems as well but the biggest issue will be that this can
lead to inconsistent permission checking and inconsistent uid/gid
pairs even more than this is already possible today. Specifically,
during creation it could happen that:
idmap = mnt_idmap(mnt);
inode_permission(idmap, ...);
may_create(idmap);
// create file with uid/gid based on @idmap
in between the permission checking and the generation of the uid/gid
value the idmapping could change leading to the permission checking
and uid/gid value that is actually used to create a file on disk being
out of sync.
Similarly if two values are generated like:
idmap = mnt_idmap(mnt)
vfsgid = make_vfsgid(idmap);
// idmapping gets update concurrently
vfsuid = make_vfsuid(idmap);
@vfsgid and @vfsuid could be out of sync if the idmapping was changed
in between. The generation of vfsgid/vfsuid could span a lot of
codelines so to guard against this a sequence count would have to be
passed around.
The performance impact of this solutio are less clear but very likely
not zero.
- Using SRCU similar to fanotify that can sleep. I find that not just
ugly but it would have memory consumption implications and is overall
pretty ugly.
/* solution */
So, to avoid all of these pitfalls creating an idmapped mount from an
already idmapped mount will be done atomically, i.e., a new detached
mount is created and a new set of mount properties applied to it without
it ever having been exposed to userspace at all.
This can be done in two ways. A new flag to open_tree() is added
OPEN_TREE_CLEAR_IDMAP that clears the old idmapping and returns a mount
that isn't idmapped. And then it is possible to set mount attributes on
it again including creation of an idmapped mount.
This has the consequence that a file descriptor must exist in userspace
that doesn't have any idmapping applied and it will thus never work in
unpriviledged scenarios. As a container would be able to remove the
idmapping of the mount it has been given. That should be avoided.
Instead, we add open_tree_attr() which works just like open_tree() but
takes an optional struct mount_attr parameter. This is useful beyond
idmappings as it fills a gap where a mount never exists in userspace
without the necessary mount properties applied.
This is particularly useful for mount options such as
MOUNT_ATTR_{RDONLY,NOSUID,NODEV,NOEXEC}.
To create a new idmapped mount the following works:
// Create a first idmapped mount
struct mount_attr attr = {
.attr_set = MOUNT_ATTR_IDMAP
.userns_fd = fd_userns
};
fd_tree = open_tree(-EBADF, "/", OPEN_TREE_CLONE, &attr, sizeof(attr));
move_mount(fd_tree, "", -EBADF, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
// Create a second idmapped mount from the first idmapped mount
attr.attr_set = MOUNT_ATTR_IDMAP;
attr.userns_fd = fd_userns2;
fd_tree2 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE, &attr, sizeof(attr));
// Create a second non-idmapped mount from the first idmapped mount:
memset(&attr, 0, sizeof(attr));
attr.attr_clr = MOUNT_ATTR_IDMAP;
fd_tree2 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE, &attr, sizeof(attr));
* patches from https://lore.kernel.org/r/20250128-work-mnt_idmap-update-v2-v1-0-c25feb0d2eb3@kernel.org:
fs: allow changing idmappings
fs: add kflags member to struct mount_kattr
fs: add open_tree_attr()
fs: add copy_mount_setattr() helper
fs: add vfs_open_tree() helper
Link: https://lore.kernel.org/r/20250128-work-mnt_idmap-update-v2-v1-0-c25feb0d2eb3@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
235 lines
9.3 KiB
C
235 lines
9.3 KiB
C
#ifndef _UAPI_LINUX_MOUNT_H
|
|
#define _UAPI_LINUX_MOUNT_H
|
|
|
|
#include <linux/types.h>
|
|
|
|
/*
|
|
* These are the fs-independent mount-flags: up to 32 flags are supported
|
|
*
|
|
* Usage of these is restricted within the kernel to core mount(2) code and
|
|
* callers of sys_mount() only. Filesystems should be using the SB_*
|
|
* equivalent instead.
|
|
*/
|
|
#define MS_RDONLY 1 /* Mount read-only */
|
|
#define MS_NOSUID 2 /* Ignore suid and sgid bits */
|
|
#define MS_NODEV 4 /* Disallow access to device special files */
|
|
#define MS_NOEXEC 8 /* Disallow program execution */
|
|
#define MS_SYNCHRONOUS 16 /* Writes are synced at once */
|
|
#define MS_REMOUNT 32 /* Alter flags of a mounted FS */
|
|
#define MS_MANDLOCK 64 /* Allow mandatory locks on an FS */
|
|
#define MS_DIRSYNC 128 /* Directory modifications are synchronous */
|
|
#define MS_NOSYMFOLLOW 256 /* Do not follow symlinks */
|
|
#define MS_NOATIME 1024 /* Do not update access times. */
|
|
#define MS_NODIRATIME 2048 /* Do not update directory access times */
|
|
#define MS_BIND 4096
|
|
#define MS_MOVE 8192
|
|
#define MS_REC 16384
|
|
#define MS_VERBOSE 32768 /* War is peace. Verbosity is silence.
|
|
MS_VERBOSE is deprecated. */
|
|
#define MS_SILENT 32768
|
|
#define MS_POSIXACL (1<<16) /* VFS does not apply the umask */
|
|
#define MS_UNBINDABLE (1<<17) /* change to unbindable */
|
|
#define MS_PRIVATE (1<<18) /* change to private */
|
|
#define MS_SLAVE (1<<19) /* change to slave */
|
|
#define MS_SHARED (1<<20) /* change to shared */
|
|
#define MS_RELATIME (1<<21) /* Update atime relative to mtime/ctime. */
|
|
#define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */
|
|
#define MS_I_VERSION (1<<23) /* Update inode I_version field */
|
|
#define MS_STRICTATIME (1<<24) /* Always perform atime updates */
|
|
#define MS_LAZYTIME (1<<25) /* Update the on-disk [acm]times lazily */
|
|
|
|
/* These sb flags are internal to the kernel */
|
|
#define MS_SUBMOUNT (1<<26)
|
|
#define MS_NOREMOTELOCK (1<<27)
|
|
#define MS_NOSEC (1<<28)
|
|
#define MS_BORN (1<<29)
|
|
#define MS_ACTIVE (1<<30)
|
|
#define MS_NOUSER (1<<31)
|
|
|
|
/*
|
|
* Superblock flags that can be altered by MS_REMOUNT
|
|
*/
|
|
#define MS_RMT_MASK (MS_RDONLY|MS_SYNCHRONOUS|MS_MANDLOCK|MS_I_VERSION|\
|
|
MS_LAZYTIME)
|
|
|
|
/*
|
|
* Old magic mount flag and mask
|
|
*/
|
|
#define MS_MGC_VAL 0xC0ED0000
|
|
#define MS_MGC_MSK 0xffff0000
|
|
|
|
/*
|
|
* open_tree() flags.
|
|
*/
|
|
#define OPEN_TREE_CLONE 1 /* Clone the target tree and attach the clone */
|
|
#define OPEN_TREE_CLOEXEC O_CLOEXEC /* Close the file on execve() */
|
|
|
|
/*
|
|
* move_mount() flags.
|
|
*/
|
|
#define MOVE_MOUNT_F_SYMLINKS 0x00000001 /* Follow symlinks on from path */
|
|
#define MOVE_MOUNT_F_AUTOMOUNTS 0x00000002 /* Follow automounts on from path */
|
|
#define MOVE_MOUNT_F_EMPTY_PATH 0x00000004 /* Empty from path permitted */
|
|
#define MOVE_MOUNT_T_SYMLINKS 0x00000010 /* Follow symlinks on to path */
|
|
#define MOVE_MOUNT_T_AUTOMOUNTS 0x00000020 /* Follow automounts on to path */
|
|
#define MOVE_MOUNT_T_EMPTY_PATH 0x00000040 /* Empty to path permitted */
|
|
#define MOVE_MOUNT_SET_GROUP 0x00000100 /* Set sharing group instead */
|
|
#define MOVE_MOUNT_BENEATH 0x00000200 /* Mount beneath top mount */
|
|
#define MOVE_MOUNT__MASK 0x00000377
|
|
|
|
/*
|
|
* fsopen() flags.
|
|
*/
|
|
#define FSOPEN_CLOEXEC 0x00000001
|
|
|
|
/*
|
|
* fspick() flags.
|
|
*/
|
|
#define FSPICK_CLOEXEC 0x00000001
|
|
#define FSPICK_SYMLINK_NOFOLLOW 0x00000002
|
|
#define FSPICK_NO_AUTOMOUNT 0x00000004
|
|
#define FSPICK_EMPTY_PATH 0x00000008
|
|
|
|
/*
|
|
* The type of fsconfig() call made.
|
|
*/
|
|
enum fsconfig_command {
|
|
FSCONFIG_SET_FLAG = 0, /* Set parameter, supplying no value */
|
|
FSCONFIG_SET_STRING = 1, /* Set parameter, supplying a string value */
|
|
FSCONFIG_SET_BINARY = 2, /* Set parameter, supplying a binary blob value */
|
|
FSCONFIG_SET_PATH = 3, /* Set parameter, supplying an object by path */
|
|
FSCONFIG_SET_PATH_EMPTY = 4, /* Set parameter, supplying an object by (empty) path */
|
|
FSCONFIG_SET_FD = 5, /* Set parameter, supplying an object by fd */
|
|
FSCONFIG_CMD_CREATE = 6, /* Create new or reuse existing superblock */
|
|
FSCONFIG_CMD_RECONFIGURE = 7, /* Invoke superblock reconfiguration */
|
|
FSCONFIG_CMD_CREATE_EXCL = 8, /* Create new superblock, fail if reusing existing superblock */
|
|
};
|
|
|
|
/*
|
|
* fsmount() flags.
|
|
*/
|
|
#define FSMOUNT_CLOEXEC 0x00000001
|
|
|
|
/*
|
|
* Mount attributes.
|
|
*/
|
|
#define MOUNT_ATTR_RDONLY 0x00000001 /* Mount read-only */
|
|
#define MOUNT_ATTR_NOSUID 0x00000002 /* Ignore suid and sgid bits */
|
|
#define MOUNT_ATTR_NODEV 0x00000004 /* Disallow access to device special files */
|
|
#define MOUNT_ATTR_NOEXEC 0x00000008 /* Disallow program execution */
|
|
#define MOUNT_ATTR__ATIME 0x00000070 /* Setting on how atime should be updated */
|
|
#define MOUNT_ATTR_RELATIME 0x00000000 /* - Update atime relative to mtime/ctime. */
|
|
#define MOUNT_ATTR_NOATIME 0x00000010 /* - Do not update access times. */
|
|
#define MOUNT_ATTR_STRICTATIME 0x00000020 /* - Always perform atime updates */
|
|
#define MOUNT_ATTR_NODIRATIME 0x00000080 /* Do not update directory access times */
|
|
#define MOUNT_ATTR_IDMAP 0x00100000 /* Idmap mount to @userns_fd in struct mount_attr. */
|
|
#define MOUNT_ATTR_NOSYMFOLLOW 0x00200000 /* Do not follow symlinks */
|
|
|
|
/*
|
|
* mount_setattr()
|
|
*/
|
|
struct mount_attr {
|
|
__u64 attr_set;
|
|
__u64 attr_clr;
|
|
__u64 propagation;
|
|
__u64 userns_fd;
|
|
};
|
|
|
|
/* List of all mount_attr versions. */
|
|
#define MOUNT_ATTR_SIZE_VER0 32 /* sizeof first published struct */
|
|
|
|
|
|
/*
|
|
* Structure for getting mount/superblock/filesystem info with statmount(2).
|
|
*
|
|
* The interface is similar to statx(2): individual fields or groups can be
|
|
* selected with the @mask argument of statmount(). Kernel will set the @mask
|
|
* field according to the supported fields.
|
|
*
|
|
* If string fields are selected, then the caller needs to pass a buffer that
|
|
* has space after the fixed part of the structure. Nul terminated strings are
|
|
* copied there and offsets relative to @str are stored in the relevant fields.
|
|
* If the buffer is too small, then EOVERFLOW is returned. The actually used
|
|
* size is returned in @size.
|
|
*/
|
|
struct statmount {
|
|
__u32 size; /* Total size, including strings */
|
|
__u32 mnt_opts; /* [str] Options (comma separated, escaped) */
|
|
__u64 mask; /* What results were written */
|
|
__u32 sb_dev_major; /* Device ID */
|
|
__u32 sb_dev_minor;
|
|
__u64 sb_magic; /* ..._SUPER_MAGIC */
|
|
__u32 sb_flags; /* SB_{RDONLY,SYNCHRONOUS,DIRSYNC,LAZYTIME} */
|
|
__u32 fs_type; /* [str] Filesystem type */
|
|
__u64 mnt_id; /* Unique ID of mount */
|
|
__u64 mnt_parent_id; /* Unique ID of parent (for root == mnt_id) */
|
|
__u32 mnt_id_old; /* Reused IDs used in proc/.../mountinfo */
|
|
__u32 mnt_parent_id_old;
|
|
__u64 mnt_attr; /* MOUNT_ATTR_... */
|
|
__u64 mnt_propagation; /* MS_{SHARED,SLAVE,PRIVATE,UNBINDABLE} */
|
|
__u64 mnt_peer_group; /* ID of shared peer group */
|
|
__u64 mnt_master; /* Mount receives propagation from this ID */
|
|
__u64 propagate_from; /* Propagation from in current namespace */
|
|
__u32 mnt_root; /* [str] Root of mount relative to root of fs */
|
|
__u32 mnt_point; /* [str] Mountpoint relative to current root */
|
|
__u64 mnt_ns_id; /* ID of the mount namespace */
|
|
__u32 fs_subtype; /* [str] Subtype of fs_type (if any) */
|
|
__u32 sb_source; /* [str] Source string of the mount */
|
|
__u32 opt_num; /* Number of fs options */
|
|
__u32 opt_array; /* [str] Array of nul terminated fs options */
|
|
__u32 opt_sec_num; /* Number of security options */
|
|
__u32 opt_sec_array; /* [str] Array of nul terminated security options */
|
|
__u64 supported_mask; /* Mask flags that this kernel supports */
|
|
__u32 mnt_uidmap_num; /* Number of uid mappings */
|
|
__u32 mnt_uidmap; /* [str] Array of uid mappings (as seen from callers namespace) */
|
|
__u32 mnt_gidmap_num; /* Number of gid mappings */
|
|
__u32 mnt_gidmap; /* [str] Array of gid mappings (as seen from callers namespace) */
|
|
__u64 __spare2[43];
|
|
char str[]; /* Variable size part containing strings */
|
|
};
|
|
|
|
/*
|
|
* Structure for passing mount ID and miscellaneous parameters to statmount(2)
|
|
* and listmount(2).
|
|
*
|
|
* For statmount(2) @param represents the request mask.
|
|
* For listmount(2) @param represents the last listed mount id (or zero).
|
|
*/
|
|
struct mnt_id_req {
|
|
__u32 size;
|
|
__u32 spare;
|
|
__u64 mnt_id;
|
|
__u64 param;
|
|
__u64 mnt_ns_id;
|
|
};
|
|
|
|
/* List of all mnt_id_req versions. */
|
|
#define MNT_ID_REQ_SIZE_VER0 24 /* sizeof first published struct */
|
|
#define MNT_ID_REQ_SIZE_VER1 32 /* sizeof second published struct */
|
|
|
|
/*
|
|
* @mask bits for statmount(2)
|
|
*/
|
|
#define STATMOUNT_SB_BASIC 0x00000001U /* Want/got sb_... */
|
|
#define STATMOUNT_MNT_BASIC 0x00000002U /* Want/got mnt_... */
|
|
#define STATMOUNT_PROPAGATE_FROM 0x00000004U /* Want/got propagate_from */
|
|
#define STATMOUNT_MNT_ROOT 0x00000008U /* Want/got mnt_root */
|
|
#define STATMOUNT_MNT_POINT 0x00000010U /* Want/got mnt_point */
|
|
#define STATMOUNT_FS_TYPE 0x00000020U /* Want/got fs_type */
|
|
#define STATMOUNT_MNT_NS_ID 0x00000040U /* Want/got mnt_ns_id */
|
|
#define STATMOUNT_MNT_OPTS 0x00000080U /* Want/got mnt_opts */
|
|
#define STATMOUNT_FS_SUBTYPE 0x00000100U /* Want/got fs_subtype */
|
|
#define STATMOUNT_SB_SOURCE 0x00000200U /* Want/got sb_source */
|
|
#define STATMOUNT_OPT_ARRAY 0x00000400U /* Want/got opt_... */
|
|
#define STATMOUNT_OPT_SEC_ARRAY 0x00000800U /* Want/got opt_sec... */
|
|
#define STATMOUNT_SUPPORTED_MASK 0x00001000U /* Want/got supported mask flags */
|
|
#define STATMOUNT_MNT_UIDMAP 0x00002000U /* Want/got uidmap... */
|
|
#define STATMOUNT_MNT_GIDMAP 0x00004000U /* Want/got gidmap... */
|
|
|
|
/*
|
|
* Special @mnt_id values that can be passed to listmount
|
|
*/
|
|
#define LSMT_ROOT 0xffffffffffffffff /* root mount */
|
|
#define LISTMOUNT_REVERSE (1 << 0) /* List later mounts first */
|
|
|
|
#endif /* _UAPI_LINUX_MOUNT_H */
|