forked from mirrors/linux
2237 commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
6669d1aaa0 |
mm: remove WARN_ON_ONCE() in file_has_valid_mmap_hooks()
Having encountered a trinity report in linux-next (Linked in the 'Closes'
tag) it appears that there are legitimate situations where a file-backed
mapping can be acquired but no file->f_op->mmap or
file->f_op->mmap_prepare is set, at which point do_mmap() should simply
error out with -ENODEV.
Since previously we did not warn in this scenario and it appears we rely
upon this, restore this situation, while retaining a WARN_ON_ONCE() for
the case where both are set, which is absolutely incorrect and must be
addressed and thus always requires a warning.
If further work is required to chase down precisely what is causing this,
then we can later restore this, but it makes no sense to hold up this
series to do so, as this is existing and apparently expected behaviour.
Link: https://lkml.kernel.org/r/20250514084024.29148-1-lorenzo.stoakes@oracle.com
Fixes:
|
||
|
|
c84bf6dd2b |
mm: introduce new .mmap_prepare() file callback
Patch series "eliminate mmap() retry merge, add .mmap_prepare hook", v2. During the mmap() of a file-backed mapping, we invoke the underlying driver file's mmap() callback in order to perform driver/file system initialisation of the underlying VMA. This has been a source of issues in the past, including a significant security concern relating to unwinding of error state discovered by Jann Horn, as fixed in commit |
||
|
|
5c2a430e85 |
Ext4 bug fixes and cleanups, including:
* hardening against maliciously fuzzed file systems
* backwards compatibility for the brief period when we attempted to
ignore zero-width characters
* avoid potentially BUG'ing if there is a file system corruption found
during the file system unmount
* fix free space reporting by statfs when project quotas are enabled
and the free space is less than the remaining project quota
Also improve performance when replaying a journal with a very large
number of revoke records (applicable for Lustre volumes).
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmflfY4ACgkQ8vlZVpUN
gaMx7Qf/akTELvyBZ7iPCCHh2HwayuO8qLhPNqrU0TmYMFvgwgYUPcQ3BLn8CE+/
j5UeT8XxNaLU4GJn3z+q6yW6PnNHfqZqKry9j/iPc3s1mjTslntr/xENlgu6i4Bp
Q58xc7Pj45vdmP+xmYhRnJcefgsZMvB/N1SEHxwIP8bntZqsEvP9pI82r9Ouc8SA
ZLQ1/K4OADmk7f3GhlPr9AtgH7O0CjlAas30h/AW77DXBQl7ZgbDsGDlgTwaGqkR
jHcvfr6hLnWy+MUVGmlNZ2HY6iUgBPItWlYCP/fsrUdnc+CONyl5E17JPSl1QQtR
CLYlo4xV8j1+zJ094DjhDWMKI2G7jw==
=oudL
-----END PGP SIGNATURE-----
Merge tag 'ext4-for_linus-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Ext4 bug fixes and cleanups, including:
- hardening against maliciously fuzzed file systems
- backwards compatibility for the brief period when we attempted to
ignore zero-width characters
- avoid potentially BUG'ing if there is a file system corruption
found during the file system unmount
- fix free space reporting by statfs when project quotas are enabled
and the free space is less than the remaining project quota
Also improve performance when replaying a journal with a very large
number of revoke records (applicable for Lustre volumes)"
* tag 'ext4-for_linus-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (71 commits)
ext4: fix OOB read when checking dotdot dir
ext4: on a remount, only log the ro or r/w state when it has changed
ext4: correct the error handle in ext4_fallocate()
ext4: Make sb update interval tunable
ext4: avoid journaling sb update on error if journal is destroying
ext4: define ext4_journal_destroy wrapper
ext4: hash: simplify kzalloc(n * 1, ...) to kzalloc(n, ...)
jbd2: add a missing data flush during file and fs synchronization
ext4: don't over-report free space or inodes in statvfs
ext4: clear DISCARD flag if device does not support discard
jbd2: remove jbd2_journal_unfile_buffer()
ext4: reorder capability check last
ext4: update the comment about mb_optimize_scan
jbd2: fix off-by-one while erasing journal
ext4: remove references to bh->b_page
ext4: goto right label 'out_mmap_sem' in ext4_setattr()
ext4: fix out-of-bound read in ext4_xattr_inode_dec_ref_all()
ext4: introduce ITAIL helper
jbd2: remove redundant function jbd2_journal_has_csum_v2or3_feature
ext4: remove redundant function ext4_has_metadata_csum
...
|
||
|
|
e41170cc5e |
vfs-6.15-rc1.pagesize
-----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90rxAAKCRCRxhvAZXjc ooIPAQCwMjDjtWegvBy8kefiRw+fa4z3ZWHrwRT9DJrD/K9WyAD+JVd0ou27SVpQ jKpRSRct2eTbyxdYiGydHQGm5F5sLg4= =0FyQ -----END PGP SIGNATURE----- Merge tag 'vfs-6.15-rc1.pagesize' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs pagesize updates from Christian Brauner: "This enables block sizes greater than the page size for block devices. With this we can start supporting block devices with logical block sizes larger than 4k. It also allows to lift the device cache sector size support to 64k. This allows filesystems which can use larger sector sizes up to 64k to ensure that the filesystem will not generate writes that are smaller than the specified sector size" * tag 'vfs-6.15-rc1.pagesize' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: bdev: add back PAGE_SIZE block size validation for sb_set_blocksize() bdev: use bdev_io_min() for statx block size block/bdev: lift block size restrictions to 64k block/bdev: enable large folio support for large logical block sizes fs/buffer fs/mpage: remove large folio restriction fs/mpage: use blocks_per_folio instead of blocks_per_page fs/mpage: avoid negative shift for large blocksize fs/buffer: remove batching from async read fs/buffer: simplify block_read_full_folio() with bh_offset() |
||
|
|
130e696aa6 |
vfs-6.15-rc1.mount.namespace
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90r2wAKCRCRxhvAZXjc
ouC6AQCk3MoqskN0WeNcaZT23dB7dHbEhf/7YXOFC9MFRMKXqQD9Fbn95+GuIe3U
nBVPbVyQfDtfXE08ml6gbDJrCsbkkQI=
=Xm1C
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.15-rc1.mount.namespace' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs mount namespace updates from Christian Brauner:
"This expands the ability of anonymous mount namespaces:
- Creating detached mounts from detached mounts
Currently, detached mounts can only be created from attached
mounts. This limitaton prevents various use-cases. For example, the
ability to mount a subdirectory without ever having to make the
whole filesystem visible first.
The current permission modelis:
(1) Check that the caller is privileged over the owning user
namespace of it's current mount namespace.
(2) Check that the caller is located in the mount namespace of the
mount it wants to create a detached copy of.
While it is not strictly necessary to do it this way it is
consistently applied in the new mount api. This model will also be
used when allowing the creation of detached mount from another
detached mount.
The (1) requirement can simply be met by performing the same check
as for the non-detached case, i.e., verify that the caller is
privileged over its current mount namespace.
To meet the (2) requirement it must be possible to infer the origin
mount namespace that the anonymous mount namespace of the detached
mount was created from.
The origin mount namespace of an anonymous mount is the mount
namespace that the mounts that were copied into the anonymous mount
namespace originate from.
In order to check the origin mount namespace of an anonymous mount
namespace the sequence number of the original mount namespace is
recorded in the anonymous mount namespace.
With this in place it is possible to perform an equivalent check
(2') to (2). The origin mount namespace of the anonymous mount
namespace must be the same as the caller's mount namespace. To
establish this the sequence number of the caller's mount namespace
and the origin sequence number of the anonymous mount namespace are
compared.
The caller is always located in a non-anonymous mount namespace
since anonymous mount namespaces cannot be setns()ed into. The
caller's mount namespace will thus always have a valid sequence
number.
The owning namespace of any mount namespace, anonymous or
non-anonymous, can never change. A mount attached to a
non-anonymous mount namespace can never change mount namespace.
If the sequence number of the non-anonymous mount namespace and the
origin sequence number of the anonymous mount namespace match, the
owning namespaces must match as well.
Hence, the capability check on the owning namespace of the caller's
mount namespace ensures that the caller has the ability to copy the
mount tree.
- Allow mount detached mounts on detached mounts
Currently, detached mounts can only be mounted onto attached
mounts. This limitation makes it impossible to assemble a new
private rootfs and move it into place. Instead, a detached tree
must be created, attached, then mounted open and then either moved
or detached again. Lift this restriction.
In order to allow mounting detached mounts onto other detached
mounts the same permission model used for creating detached mounts
from detached mounts can be used (cf. above).
Allowing to mount detached mounts onto detached mounts leaves three
cases to consider:
(1) The source mount is an attached mount and the target mount is
a detached mount. This would be equivalent to moving a mount
between different mount namespaces. A caller could move an
attached mount to a detached mount. The detached mount can now
be freely attached to any mount namespace. This changes the
current delegatioh model significantly for no good reason. So
this will fail.
(2) Anonymous mount namespaces are always attached fully, i.e., it
is not possible to only attach a subtree of an anoymous mount
namespace. This simplifies the implementation and reasoning.
Consequently, if the anonymous mount namespace of the source
detached mount and the target detached mount are the identical
the mount request will fail.
(3) The source mount's anonymous mount namespace is different from
the target mount's anonymous mount namespace.
In this case the source anonymous mount namespace of the
source mount tree must be freed after its mounts have been
moved to the target anonymous mount namespace. The source
anonymous mount namespace must be empty afterwards.
By allowing to mount detached mounts onto detached mounts a caller
may do the following:
fd_tree1 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE)
fd_tree2 = open_tree(-EBADF, "/tmp", OPEN_TREE_CLONE)
fd_tree1 and fd_tree2 refer to two different detached mount trees
that belong to two different anonymous mount namespace.
It is important to note that fd_tree1 and fd_tree2 both refer to
the root of their respective anonymous mount namespaces.
By allowing to mount detached mounts onto detached mounts the
caller may now do:
move_mount(fd_tree1, "", fd_tree2, "",
MOVE_MOUNT_F_EMPTY_PATH | MOVE_MOUNT_T_EMPTY_PATH)
This will cause the detached mount referred to by fd_tree1 to be
mounted on top of the detached mount referred to by fd_tree2.
Thus, the detached mount fd_tree1 is moved from its separate
anonymous mount namespace into fd_tree2's anonymous mount
namespace.
It also means that while fd_tree2 continues to refer to the root of
its respective anonymous mount namespace fd_tree1 doesn't anymore.
This has the consequence that only fd_tree2 can be moved to another
anonymous or non-anonymous mount namespace. Moving fd_tree1 will
now fail as fd_tree1 doesn't refer to the root of an anoymous mount
namespace anymore.
Now fd_tree1 and fd_tree2 refer to separate detached mount trees
referring to the same anonymous mount namespace.
This is conceptually fine. The new mount api does allow for this to
happen already via:
mount -t tmpfs tmpfs /mnt
mkdir -p /mnt/A
mount -t tmpfs tmpfs /mnt/A
fd_tree3 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE | AT_RECURSIVE)
fd_tree4 = open_tree(-EBADF, "/mnt/A", 0)
Both fd_tree3 and fd_tree4 refer to two different detached mount
trees but both detached mount trees refer to the same anonymous
mount namespace. An as with fd_tree1 and fd_tree2, only fd_tree3
may be moved another mount namespace as fd_tree3 refers to the root
of the anonymous mount namespace just while fd_tree4 doesn't.
However, there's an important difference between the
fd_tree3/fd_tree4 and the fd_tree1/fd_tree2 example.
Closing fd_tree4 and releasing the respective struct file will have
no further effect on fd_tree3's detached mount tree.
However, closing fd_tree3 will cause the mount tree and the
respective anonymous mount namespace to be destroyed causing the
detached mount tree of fd_tree4 to be invalid for further mounting.
By allowing to mount detached mounts on detached mounts as in the
fd_tree1/fd_tree2 example both struct files will affect each other.
Both fd_tree1 and fd_tree2 refer to struct files that have
FMODE_NEED_UNMOUNT set.
To handle this we use the fact that @fd_tree1 will have a parent
mount once it has been attached to @fd_tree2.
When dissolve_on_fput() is called the mount that has been passed in
will refer to the root of the anonymous mount namespace. If it
doesn't it would mean that mounts are leaked. So before allowing to
mount detached mounts onto detached mounts this would be a bug.
Now that detached mounts can be mounted onto detached mounts it
just means that the mount has been attached to another anonymous
mount namespace and thus dissolve_on_fput() must not unmount the
mount tree or free the anonymous mount namespace as the file
referring to the root of the namespace hasn't been closed yet.
If it had been closed yet it would be obvious because the mount
namespace would be NULL, i.e., the @fd_tree1 would have already
been unmounted. If @fd_tree1 hasn't been unmounted yet and has a
parent mount it is safe to skip any cleanup as closing @fd_tree2
will take care of all cleanup operations.
- Allow mount propagation for detached mount trees
In commit
|
||
|
|
26d8e43079 |
vfs-6.15-rc1.async.dir
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90rNwAKCRCRxhvAZXjc
onBJAP9Z8Ywmlb5KQ1E3HvDmkwyY6yOSyZ9/CmbzrkCJ8ywYkQD/d9/xt0EP/O/q
N8YtzXArHWt7u0YbcVpy9WK3F72BdwU=
=VJgY
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.15-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs async dir updates from Christian Brauner:
"This contains cleanups that fell out of the work from async directory
handling:
- Change kern_path_locked() and user_path_locked_at() to never return
a negative dentry. This simplifies the usability of these helpers
in various places
- Drop d_exact_alias() from the remaining place in NFS where it is
still used. This also allows us to drop the d_exact_alias() helper
completely
- Drop an unnecessary call to fh_update() from nfsd_create_locked()
- Change i_op->mkdir() to return a struct dentry
Change vfs_mkdir() to return a dentry provided by the filesystems
which is hashed and positive. This allows us to reduce the number
of cases where the resulting dentry is not positive to very few
cases. The code in these places becomes simpler and easier to
understand.
- Repack DENTRY_* and LOOKUP_* flags"
* tag 'vfs-6.15-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
doc: fix inline emphasis warning
VFS: Change vfs_mkdir() to return the dentry.
nfs: change mkdir inode_operation to return alternate dentry if needed.
fuse: return correct dentry for ->mkdir
ceph: return the correct dentry on mkdir
hostfs: store inode in dentry after mkdir if possible.
Change inode_operations.mkdir to return struct dentry *
nfsd: drop fh_update() from S_IFDIR branch of nfsd_create_locked()
nfs/vfs: discard d_exact_alias()
VFS: add common error checks to lookup_one_qstr_excl()
VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry
VFS: repack LOOKUP_ bit flags.
VFS: repack DENTRY_ flags.
|
||
|
|
99c21beaab |
vfs-6.15-rc1.misc
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90p4AAKCRCRxhvAZXjc
ojMIAP9atkG3u7+490+NGWLdulQlaHnD51Owa9MiW87UfKpsTQEArwi/NrJqXJNT
PFQ2xIa5TxG+9haChR89w3kjZ6b/hgs=
=iDkx
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.15-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"Features:
- Add CONFIG_DEBUG_VFS infrastucture:
- Catch invalid modes in open
- Use the new debug macros in inode_set_cached_link()
- Use debug-only asserts around fd allocation and install
- Place f_ref to 3rd cache line in struct file to resolve false
sharing
Cleanups:
- Start using anon_inode_getfile_fmode() helper in various places
- Don't take f_lock during SEEK_CUR if exclusion is guaranteed by
f_pos_lock
- Add unlikely() to kcmp()
- Remove legacy ->remount_fs method from ecryptfs after port to the
new mount api
- Remove invalidate_inodes() in favour of evict_inodes()
- Simplify ep_busy_loopER by removing unused argument
- Avoid mmap sem relocks when coredumping with many missing pages
- Inline getname()
- Inline new_inode_pseudo() and de-staticize alloc_inode()
- Dodge an atomic in putname if ref == 1
- Consistently deref the files table with rcu_dereference_raw()
- Dedup handling of struct filename init and refcounts bumps
- Use wq_has_sleeper() in end_dir_add()
- Drop the lock trip around I_NEW wake up in evict()
- Load the ->i_sb pointer once in inode_sb_list_{add,del}
- Predict not reaching the limit in alloc_empty_file()
- Tidy up do_sys_openat2() with likely/unlikely
- Call inode_sb_list_add() outside of inode hash lock
- Sort out fd allocation vs dup2 race commentary
- Turn page_offset() into a wrapper around folio_pos()
- Remove locking in exportfs around ->get_parent() call
- try_lookup_one_len() does not need any locks in autofs
- Fix return type of several functions from long to int in open
- Fix return type of several functions from long to int in ioctls
Fixes:
- Fix watch queue accounting mismatch"
* tag 'vfs-6.15-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (30 commits)
fs: sort out fd allocation vs dup2 race commentary, take 2
fs: call inode_sb_list_add() outside of inode hash lock
fs: tidy up do_sys_openat2() with likely/unlikely
fs: predict not reaching the limit in alloc_empty_file()
fs: load the ->i_sb pointer once in inode_sb_list_{add,del}
fs: drop the lock trip around I_NEW wake up in evict()
fs: use wq_has_sleeper() in end_dir_add()
VFS/autofs: try_lookup_one_len() does not need any locks
fs: dedup handling of struct filename init and refcounts bumps
fs: consistently deref the files table with rcu_dereference_raw()
exportfs: remove locking around ->get_parent() call.
fs: use debug-only asserts around fd allocation and install
fs: dodge an atomic in putname if ref == 1
vfs: Remove invalidate_inodes()
ecryptfs: remove NULL remount_fs from super_operations
watch_queue: fix pipe accounting mismatch
fs: place f_ref to 3rd cache line in struct file to resolve false sharing
epoll: simplify ep_busy_loop by removing always 0 argument
fs: Turn page_offset() into a wrapper around folio_pos()
kcmp: improve performance adding an unlikely hint to task comparisons
...
|
||
|
|
c4cff1ea37 |
vfs-6.15-rc1.mount.api
-----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90q+gAKCRCRxhvAZXjc ol6uAQCGxtqC3w0bXLF1SKsRlDlYvmAJzRVHMGIxhwpNzeyGAQEApK52+WovJy+1 zozWTCiGXF8N/IhlqVlrMb6GANmjWAQ= =CcBh -----END PGP SIGNATURE----- Merge tag 'vfs-6.15-rc1.mount.api' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount API updates from Christian Brauner: "This converts the remaining pseudo filesystems to the new mount api. The sysv conversion is a bit gratuitous because we remove sysv in another pull request. But if we have to revert the removal we at least will have it converted to the new mount api already" * tag 'vfs-6.15-rc1.mount.api' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: sysv: convert sysv to use the new mount api vfs: remove some unused old mount api code devtmpfs: replace ->mount with ->get_tree in public instance vfs: Convert devpts to use the new mount API pstore: convert to the new mount API |
||
|
|
611851010c
|
fs: dedup handling of struct filename init and refcounts bumps
No functional changes. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250313142744.1323281-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
a64e5a5960
|
bdev: add back PAGE_SIZE block size validation for sb_set_blocksize()
The commit titled "block/bdev: lift block size restrictions to 64k" lifted the block layer's max supported block size to 64k inside the helper blk_validate_block_size() now that we support large folios. However in lifting the block size we also removed the silly use cases many filesystems have to use sb_set_blocksize() to *verify* that the block size <= PAGE_SIZE. The call to sb_set_blocksize() was used to check the block size <= PAGE_SIZE since historically we've always supported userspace to create for example 64k block size filesystems even on 4k page size systems, but what we didn't allow was mounting them. Older filesystems have been using the check with sb_set_blocksize() for years. While, we could argue that such checks should be filesystem specific, there are much more users of sb_set_blocksize() than LBS enabled filesystem on upstream, so just do the easier thing and bring back the PAGE_SIZE check for sb_set_blocksize() users and only skip it for LBS enabled filesystems. This will ensure that tests such as generic/466 when run in a loop against say, ext4, won't try to try to actually mount a filesystem with a block size larger than your filesystem supports given your PAGE_SIZE and in the worst case crash. Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20250307020403.3068567-1-mcgrof@kernel.org Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
c54b386969
|
VFS: Change vfs_mkdir() to return the dentry.
vfs_mkdir() does not guarantee to leave the child dentry hashed or make it positive on success, and in many such cases the filesystem had to use a different dentry which it can now return. This patch changes vfs_mkdir() to return the dentry provided by the filesystems which is hashed and positive when provided. This reduces the number of cases where the resulting dentry is not positive to a handful which don't deserve extra efforts. The only callers of vfs_mkdir() which are interested in the resulting inode are in-kernel filesystem clients: cachefiles, nfsd, smb/server. The only filesystems that don't reliably provide the inode are: - kernfs, tracefs which these clients are unlikely to be interested in - cifs in some configurations would need to do a lookup to find the created inode, but doesn't. cifs cannot be exported via NFS, is unlikely to be used by cachefiles, and smb/server only has a soft requirement for the inode, so this is unlikely to be a problem in practice. - hostfs, nfs, cifs may need to do a lookup (rarely for NFS) and it is possible for a race to make that lookup fail. Actual failure is unlikely and providing callers handle negative dentries graceful they will fail-safe. So this patch removes the lookup code in nfsd and smb/server and adjusts them to fail safe if a negative dentry is provided: - cache-files already fails safe by restarting the task from the top - it still does with this change, though it no longer calls cachefiles_put_directory() as that will crash if the dentry is negative. - nfsd reports "Server-fault" which it what it used to do if the lookup failed. This will never happen on any file-systems that it can actually export, so this is of no consequence. I removed the fh_update() call as that is not needed and out-of-place. A subsequent nfsd_create_setattr() call will call fh_update() when needed. - smb/server only wants the inode to call ksmbd_smb_inherit_owner() which updates ->i_uid (without calling notify_change() or similar) which can be safely skipping on cifs (I hope). If a different dentry is returned, the first one is put. If necessary the fact that it is new can be determined by comparing pointers. A new dentry will certainly have a new pointer (as the old is put after the new is obtained). Similarly if an error is returned (via ERR_PTR()) the original dentry is put. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Link: https://lore.kernel.org/r/20250227013949.536172-7-neilb@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
f9fde814de
|
fs: support getname_maybe_null() in move_mount()
Allow move_mount() to work with NULL path arguments. Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-8-dbcfcb98c676@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
e249056c91
|
fs: place f_ref to 3rd cache line in struct file to resolve false sharing
When running syscall pread in a high core count system, f_ref contends with the reading of f_mode, f_op, f_mapping, f_inode, f_flags in the same cache line. This change places f_ref to the 3rd cache line where fields are not updated as frequently as the 1st cache line, and the contention is grealy reduced according to tests. In addition, the size of file object is kept in 3 cache lines. This change has been tested with rocksdb benchmark readwhilewriting case in 1 socket 64 physical core 128 logical core baremetal machine, with build config CONFIG_RANDSTRUCT_NONE=y Command: ./db_bench --benchmarks="readwhilewriting" --threads $cnt --duration 60 The throughput(ops/s) is improved up to ~21%. ===== thread baseline compare 16 100% +1.3% 32 100% +2.2% 64 100% +7.2% 128 100% +20.9% It was also tested with UnixBench: syscall, fsbuffer, fstime, fsdisk cases that has been used for file struct layout tuning, no regression was observed. Signed-off-by: Pan Deng <pan.deng@intel.com> Link: https://lore.kernel.org/r/20250228020059.3023375-1-pan.deng@intel.com Tested-by: Lipeng Zhu <lipeng.zhu@intel.com> Reviewed-by: Tianyou Li <tianyou.li@intel.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
88d5baf690
|
Change inode_operations.mkdir to return struct dentry *
Some filesystems, such as NFS, cifs, ceph, and fuse, do not have
complete control of sequencing on the actual filesystem (e.g. on a
different server) and may find that the inode created for a mkdir
request already exists in the icache and dcache by the time the mkdir
request returns. For example, if the filesystem is mounted twice the
directory could be visible on the other mount before it is on the
original mount, and a pair of name_to_handle_at(), open_by_handle_at()
calls could instantiate the directory inode with an IS_ROOT() dentry
before the first mkdir returns.
This means that the dentry passed to ->mkdir() may not be the one that
is associated with the inode after the ->mkdir() completes. Some
callers need to interact with the inode after the ->mkdir completes and
they currently need to perform a lookup in the (rare) case that the
dentry is no longer hashed.
This lookup-after-mkdir requires that the directory remains locked to
avoid races. Planned future patches to lock the dentry rather than the
directory will mean that this lookup cannot be performed atomically with
the mkdir.
To remove this barrier, this patch changes ->mkdir to return the
resulting dentry if it is different from the one passed in.
Possible returns are:
NULL - the directory was created and no other dentry was used
ERR_PTR() - an error occurred
non-NULL - this other dentry was spliced in
This patch only changes file-systems to return "ERR_PTR(err)" instead of
"err" or equivalent transformations. Subsequent patches will make
further changes to some file-systems to return a correct dentry.
Not all filesystems reliably result in a positive hashed dentry:
- NFS, cifs, hostfs will sometimes need to perform a lookup of
the name to get inode information. Races could result in this
returning something different. Note that this lookup is
non-atomic which is what we are trying to avoid. Placing the
lookup in filesystem code means it only happens when the filesystem
has no other option.
- kernfs and tracefs leave the dentry negative and the ->revalidate
operation ensures that lookup will be called to correctly populate
the dentry. This could be fixed but I don't think it is important
to any of the users of vfs_mkdir() which look at the dentry.
The recommendation to use
d_drop();d_splice_alias()
is ugly but fits with current practice. A planned future patch will
change this.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: NeilBrown <neilb@suse.de>
Link: https://lore.kernel.org/r/20250227013949.536172-2-neilb@suse.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
||
|
|
8510edf191
|
mm/filemap: fix miscalculated file range for filemap_fdatawrite_range_kick()
iocb->ki_pos has been updated with the number of written bytes since
generic_perform_write().
Besides __filemap_fdatawrite_range() accepts the inclusive end of the
data range.
Fixes:
|
||
|
|
1479be6258
|
vfs: inline new_inode_pseudo() and de-staticize alloc_inode()
The former is a no-op wrapper with the same argument. I left it in place to not lose the information who needs it -- one day "pseudo" inodes may start differing from what alloc_inode() returns. In the meantime no point taking a detour. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250212180459.1022983-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
1bb772565f
|
vfs: inline getname()
It is merely a trivial wrapper around getname_flags which adds a zeroed argument, no point paying for an extra call. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250206000105.432528-1-mjguzik@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
f326565c44
|
ioctl: Fix return type of several functions from long to int
Fix the return type of several functions from long to int to match its actu al behavior. These functions only return int values. This change improves type consistency across the filesystem code and aligns the function signatu re with its existing implementation and usage. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Yuichiro Tsuji <yuichtsu@amazon.com> Link: https://lore.kernel.org/r/20250121070844.4413-3-yuichtsu@amazon.com Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
29d80d506b
|
open: Fix return type of several functions from long to int
Fix the return type of several functions from long to int to match its actu al behavior. These functions only return int values. This change improves type consistency across the filesystem code and aligns the function signatu re with its existing implementation and usage. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Yuichiro Tsuji <yuichtsu@amazon.com> Link: https://lore.kernel.org/r/20250121070844.4413-2-yuichtsu@amazon.com Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
3eb7e95104
|
vfs: use the new debug macros in inode_set_cached_link()
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250209185523.745956-4-mjguzik@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
8b17e54096
|
vfs: add initial support for CONFIG_DEBUG_VFS
Small collection of macros taken from mmdebug.h Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250209185523.745956-2-mjguzik@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
b4c173dfbb
|
fuse: don't truncate cached, mutated symlink
Fuse allows the value of a symlink to change and this property is exploited by some filesystems (e.g. CVMFS). It has been observed, that sometimes after changing the symlink contents, the value is truncated to the old size. This is caused by fuse_getattr() racing with fuse_reverse_inval_inode(). fuse_reverse_inval_inode() updates the fuse_inode's attr_version, which results in fuse_change_attributes() exiting before updating the cached attributes This is okay, as the cached attributes remain invalid and the next call to fuse_change_attributes() will likely update the inode with the correct values. The reason this causes problems is that cached symlinks will be returned through page_get_link(), which truncates the symlink to inode->i_size. This is correct for filesystems that don't mutate symlinks, but in this case it causes bad behavior. The solution is to just remove this truncation. This can cause a regression in a filesystem that relies on supplying a symlink larger than the file size, but this is unlikely. If that happens we'd need to make this behavior conditional. Reported-by: Laura Promberger <laura.promberger@cern.ch> Tested-by: Sam Lewis <samclewis@google.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Link: https://lore.kernel.org/r/20250220100258.793363-1-mszeredi@redhat.com Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
9e28059d56 |
ext4: introduce linear search for dentries
This patch addresses an issue where some files in case-insensitive directories become inaccessible due to changes in how the kernel function, utf8_casefold(), generates case-folded strings from the commit |
||
|
|
37d11cfc63
|
vfs: sanity check the length passed to inode_set_cached_link()
This costs a strlen() call when instatianating a symlink. Preferably it would be hidden behind VFS_WARN_ON (or compatible), but there is no such facility at the moment. With the facility in place the call can be patched out in production kernels. In the meantime, since the cost is being paid unconditionally, use the result to a fixup the bad caller. This is not expected to persist in the long run (tm). Sample splat: bad length passed for symlink [/tmp/syz-imagegen43743633/file0/file0] (got 131109, expected 37) [rest of WARN blurp goes here] Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250204213207.337980-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
95101401bb
|
fsnotify: use accessor to set FMODE_NONOTIFY_*
The FMODE_NONOTIFY_* bits are a 2-bits mode. Open coding manipulation of those bits is risky. Use an accessor file_set_fsnotify_mode() to set the mode. Rename file_set_fsnotify_mode() => file_set_fsnotify_mode_from_watchers() to make way for the simple accessor name. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20250203223205.861346-2-amir73il@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
bdfa77e7c6
|
vfs: remove some unused old mount api code
Remove reconfigure_single, mount_single, and compare_single now that no users remain. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Link: https://lore.kernel.org/r/20250205213931.74614-5-sandeen@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
9c5968db9e |
The various patchsets are summarized below. Plus of course many
indivudual patches which are described in their changelogs. - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the page allocator so we end up with the ability to allocate and free zero-refcount pages. So that callers (ie, slab) can avoid a refcount inc & dec. - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use large folios other than PMD-sized ones. - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and fixes for this small built-in kernel selftest. - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of the mapletree code. - "mm: fix format issues and param types" from Keren Sun implements a few minor code cleanups. - "simplify split calculation" from Wei Yang provides a few fixes and a test for the mapletree code. - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes continues the work of moving vma-related code into the (relatively) new mm/vma.c. - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David Hildenbrand cleans up and rationalizes handling of gfp flags in the page allocator. - "readahead: Reintroduce fix for improper RA window sizing" from Jan Kara is a second attempt at fixing a readahead window sizing issue. It should reduce the amount of unnecessary reading. - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng addresses an issue where "huge" amounts of pte pagetables are accumulated (https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/). Qi's series addresses this windup by synchronously freeing PTE memory within the context of madvise(MADV_DONTNEED). - "selftest/mm: Remove warnings found by adding compiler flags" from Muhammad Usama Anjum fixes some build warnings in the selftests code when optional compiler warnings are enabled. - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David Hildenbrand tightens the allocator's observance of __GFP_HARDWALL. - "pkeys kselftests improvements" from Kevin Brodsky implements various fixes and cleanups in the MM selftests code, mainly pertaining to the pkeys tests. - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to estimate application working set size. - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn provides some cleanups to memcg's hugetlb charging logic. - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song removes the global swap cgroup lock. A speedup of 10% for a tmpfs-based kernel build was demonstrated. - "zram: split page type read/write handling" from Sergey Senozhatsky has several fixes and cleaups for zram in the area of zram_write_page(). A watchdog softlockup warning was eliminated. - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky cleans up the pagetable destructor implementations. A rare use-after-free race is fixed. - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes simplifies and cleans up the debugging code in the VMA merging logic. - "Account page tables at all levels" from Kevin Brodsky cleans up and regularizes the pagetable ctor/dtor handling. This results in improvements in accounting accuracy. - "mm/damon: replace most damon_callback usages in sysfs with new core functions" from SeongJae Park cleans up and generalizes DAMON's sysfs file interface logic. - "mm/damon: enable page level properties based monitoring" from SeongJae Park increases the amount of information which is presented in response to DAMOS actions. - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes DAMON's long-deprecated debugfs interfaces. Thus the migration to sysfs is completed. - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter Xu cleans up and generalizes the hugetlb reservation accounting. - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino removes a never-used feature of the alloc_pages_bulk() interface. - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park extends DAMOS filters to support not only exclusion (rejecting), but also inclusion (allowing) behavior. - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi "introduces a new memory descriptor for zswap.zpool that currently overlaps with struct page for now. This is part of the effort to reduce the size of struct page and to enable dynamic allocation of memory descriptors." - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and simplifies the swap allocator locking. A speedup of 400% was demonstrated for one workload. As was a 35% reduction for kernel build time with swap-on-zram. - "mm: update mips to use do_mmap(), make mmap_region() internal" from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that mmap_region() can be made MM-internal. - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU regressions and otherwise improves MGLRU performance. - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park updates DAMON documentation. - "Cleanup for memfd_create()" from Isaac Manjarres does that thing. - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand provides various cleanups in the areas of hugetlb folios, THP folios and migration. - "Uncached buffered IO" from Jens Axboe implements the new RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache reading and writing. To permite userspace to address issues with massive buildup of useless pagecache when reading/writing fast devices. - "selftests/mm: virtual_address_range: Reduce memory" from Thomas Weißschuh fixes and optimizes some of the MM selftests. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5a+cwAKCRDdBJ7gKXxA jtoyAP9R58oaOKPJuTizEKKXvh/RpMyD6sYcz/uPpnf+cKTZxQEAqfVznfWlw/Lz uC3KRZYhmd5YrxU4o+qjbzp9XWX/xAE= =Ib2s -----END PGP SIGNATURE----- Merge tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "The various patchsets are summarized below. Plus of course many indivudual patches which are described in their changelogs. - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the page allocator so we end up with the ability to allocate and free zero-refcount pages. So that callers (ie, slab) can avoid a refcount inc & dec - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use large folios other than PMD-sized ones - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and fixes for this small built-in kernel selftest - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of the mapletree code - "mm: fix format issues and param types" from Keren Sun implements a few minor code cleanups - "simplify split calculation" from Wei Yang provides a few fixes and a test for the mapletree code - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes continues the work of moving vma-related code into the (relatively) new mm/vma.c - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David Hildenbrand cleans up and rationalizes handling of gfp flags in the page allocator - "readahead: Reintroduce fix for improper RA window sizing" from Jan Kara is a second attempt at fixing a readahead window sizing issue. It should reduce the amount of unnecessary reading - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng addresses an issue where "huge" amounts of pte pagetables are accumulated: https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/ Qi's series addresses this windup by synchronously freeing PTE memory within the context of madvise(MADV_DONTNEED) - "selftest/mm: Remove warnings found by adding compiler flags" from Muhammad Usama Anjum fixes some build warnings in the selftests code when optional compiler warnings are enabled - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David Hildenbrand tightens the allocator's observance of __GFP_HARDWALL - "pkeys kselftests improvements" from Kevin Brodsky implements various fixes and cleanups in the MM selftests code, mainly pertaining to the pkeys tests - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to estimate application working set size - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn provides some cleanups to memcg's hugetlb charging logic - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song removes the global swap cgroup lock. A speedup of 10% for a tmpfs-based kernel build was demonstrated - "zram: split page type read/write handling" from Sergey Senozhatsky has several fixes and cleaups for zram in the area of zram_write_page(). A watchdog softlockup warning was eliminated - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky cleans up the pagetable destructor implementations. A rare use-after-free race is fixed - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes simplifies and cleans up the debugging code in the VMA merging logic - "Account page tables at all levels" from Kevin Brodsky cleans up and regularizes the pagetable ctor/dtor handling. This results in improvements in accounting accuracy - "mm/damon: replace most damon_callback usages in sysfs with new core functions" from SeongJae Park cleans up and generalizes DAMON's sysfs file interface logic - "mm/damon: enable page level properties based monitoring" from SeongJae Park increases the amount of information which is presented in response to DAMOS actions - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes DAMON's long-deprecated debugfs interfaces. Thus the migration to sysfs is completed - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter Xu cleans up and generalizes the hugetlb reservation accounting - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino removes a never-used feature of the alloc_pages_bulk() interface - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park extends DAMOS filters to support not only exclusion (rejecting), but also inclusion (allowing) behavior - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi introduces a new memory descriptor for zswap.zpool that currently overlaps with struct page for now. This is part of the effort to reduce the size of struct page and to enable dynamic allocation of memory descriptors - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and simplifies the swap allocator locking. A speedup of 400% was demonstrated for one workload. As was a 35% reduction for kernel build time with swap-on-zram - "mm: update mips to use do_mmap(), make mmap_region() internal" from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that mmap_region() can be made MM-internal - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU regressions and otherwise improves MGLRU performance - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park updates DAMON documentation - "Cleanup for memfd_create()" from Isaac Manjarres does that thing - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand provides various cleanups in the areas of hugetlb folios, THP folios and migration - "Uncached buffered IO" from Jens Axboe implements the new RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache reading and writing. To permite userspace to address issues with massive buildup of useless pagecache when reading/writing fast devices - "selftests/mm: virtual_address_range: Reduce memory" from Thomas Weißschuh fixes and optimizes some of the MM selftests" * tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits) mm/compaction: fix UBSAN shift-out-of-bounds warning s390/mm: add missing ctor/dtor on page table upgrade kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags() tools: add VM_WARN_ON_VMG definition mm/damon/core: use str_high_low() helper in damos_wmark_wait_us() seqlock: add missing parameter documentation for raw_seqcount_try_begin() mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh mm/page_alloc: remove the incorrect and misleading comment zram: remove zcomp_stream_put() from write_incompressible_page() mm: separate move/undo parts from migrate_pages_batch() mm/kfence: use str_write_read() helper in get_access_type() selftests/mm/mkdirty: fix memory leak in test_uffdio_copy() kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags() selftests/mm: virtual_address_range: avoid reading from VM_IO mappings selftests/mm: vm_util: split up /proc/self/smaps parsing selftests/mm: virtual_address_range: unmap chunks after validation selftests/mm: virtual_address_range: mmap() without PROT_WRITE selftests/memfd/memfd_test: fix possible NULL pointer dereference mm: add FGP_DONTCACHE folio creation flag mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue ... |
||
|
|
1d44575765 |
mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue
When a buffered write submitted with IOCB_DONTCACHE has been successfully submitted, call filemap_fdatawrite_range_kick() to kick off the IO. File systems call generic_write_sync() for any successful buffered write submission, hence add the logic here rather than needing to modify the file system. Link: https://lkml.kernel.org/r/20241220154831.1086649-12-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: Brian Foster <bfoster@redhat.com> Cc: Chris Mason <clm@meta.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
dddc559f2e |
mm/filemap: add filemap_fdatawrite_range_kick() helper
Works like filemap_fdatawrite_range(), except it's a non-integrity data writeback and hence only starts writeback on the specified range. Will help facilitate generically starting uncached writeback from generic_write_sync(), as header dependencies preclude doing this inline from fs.h. Link: https://lkml.kernel.org/r/20241220154831.1086649-11-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: Brian Foster <bfoster@redhat.com> Cc: Chris Mason <clm@meta.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
b9f958d4f1 |
fs: add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag
If a file system supports uncached buffered IO, it may set FOP_DONTCACHE and enable support for RWF_DONTCACHE. If RWF_DONTCACHE is attempted without the file system supporting it, it'll get errored with -EOPNOTSUPP. Link: https://lkml.kernel.org/r/20241220154831.1086649-8-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: Brian Foster <bfoster@redhat.com> Cc: Chris Mason <clm@meta.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
8883957b3c |
\n
-----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmePs7oACgkQnJ2qBz9k QNmHuAf9GkLnY5u1/81xP5V9ukZ4N2yeMW0dydLS5cjWj/St5ELeMAza3jeqtJtD j36vbnmy2c5pPaGLAK8BJpMXT/R2TkmmKD004zcfqF2S3SgbGzdgO1zMZzq9KJpM woRKZtLuglDajedsDEBBcKotBhlN2+C/sQlFuL1mX4zitk9ajr0qYUB1+JqOeg5f qwPsDLT077ADpxd7lVIMcm+OqbduP5KWkBKYHpn7lJcLe1eqVMMzceJroW42zhVG Dq8Iln26bbU9Wx6FSPFCUcHEzHRHUfXmu07HN9U0X++0QgWjrmBQQLooGFB/bR4a edBrPpVas6xE4/brjgFX3gOKtv8xYg== =ewDV -----END PGP SIGNATURE----- Merge tag 'fsnotify_hsm_for_v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify pre-content notification support from Jan Kara: "This introduces a new fsnotify event (FS_PRE_ACCESS) that gets generated before a file contents is accessed. The event is synchronous so if there is listener for this event, the kernel waits for reply. On success the execution continues as usual, on failure we propagate the error to userspace. This allows userspace to fill in file content on demand from slow storage. The context in which the events are generated has been picked so that we don't hold any locks and thus there's no risk of a deadlock for the userspace handler. The new pre-content event is available only for users with global CAP_SYS_ADMIN capability (similarly to other parts of fanotify functionality) and it is an administrator responsibility to make sure the userspace event handler doesn't do stupid stuff that can DoS the system. Based on your feedback from the last submission, fsnotify code has been improved and now file->f_mode encodes whether pre-content event needs to be generated for the file so the fast path when nobody wants pre-content event for the file just grows the additional file->f_mode check. As a bonus this also removes the checks whether the old FS_ACCESS event needs to be generated from the fast path. Also the place where the event is generated during page fault has been moved so now filemap_fault() generates the event if and only if there is no uptodate folio in the page cache. Also we have dropped FS_PRE_MODIFY event as current real-world users of the pre-content functionality don't really use it so let's start with the minimal useful feature set" * tag 'fsnotify_hsm_for_v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (21 commits) fanotify: Fix crash in fanotify_init(2) fs: don't block write during exec on pre-content watched files fs: enable pre-content events on supported file systems ext4: add pre-content fsnotify hook for DAX faults btrfs: disable defrag on pre-content watched files xfs: add pre-content fsnotify hook for DAX faults fsnotify: generate pre-content permission event on page fault mm: don't allow huge faults for files with pre content watches fanotify: disable readahead if we have pre-content watches fanotify: allow to set errno in FAN_DENY permission response fanotify: report file range info with pre-content events fanotify: introduce FAN_PRE_ACCESS permission event fsnotify: generate pre-content permission event on truncate fsnotify: pass optional file access range in pre-content event fsnotify: introduce pre-content permission events fanotify: reserve event bit of deprecated FAN_DIR_MODIFY fanotify: rename a misnamed constant fanotify: don't skip extra event info if no info_mode is set fsnotify: check if file is actually being watched for pre-content events on open fsnotify: opt-in for permission events at file open time ... |
||
|
|
a312e1706c |
for-6.14/io_uring-20250119
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmeNDEUQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpl5hD/4t7kWWNQDeQG9CiA3QStMJ5Yow2AgYtK8f
sJBr5/6PGEsbTreX//Kh8DtPZPRGcjG9elCo58QxWaPZ2mg3fTOR3/QYLMlaGXU2
hSht58lj32utpuzMjMo9bG3aesi03bLf+buaq7V1FaMlcTV8rXqK1s/HGtphDBRo
8tNLEk3JDJDs3vlWbNp/5Hqh9+Ro6DU8df1zWWH4Vbu8RXaGIPyJyjKvvcbfuuCf
k7Ay45XNAmTZg+rSNGv1H3Yn1LNzPMVFLWBfzRahPCzlKy2+mJMWz1PWu9naaUK+
WTM+kgiBLF24k59G/9xuxC5bYtsTjTbr4GsEE5ZvFBnhKPzLzzaJj7iQHRj83vtv
tqxNmAbA3wJoNk48Zr8+cYbfDX9Q9Pl32wIaS/LxRgF9MT4lem6pyKY7Skd12oK3
rnQ8moGtnOBxp3QUU6BZ7IX3ipb+Bgw7FhZbtVYJdlqKeKyi1QO0MuITwGXpMwk/
EWDDTsspIf+QaTu+fmO8byJavugKljW8t7hM1JpvlfOLl+rsh6/+AYz42fCvcaA0
Tu4bpUk8SuwALvZfU2R6bLkorGG6MFuGI8g3eixOcGir3YAcHBMfdg6ItpZi5qVt
ToM87BMaezOZZvSwX1JBaQ0AR5HBQYmHaiLWgPsORf3PjJ0kz+u21SK9D+yJkUtU
rT6+HvoVXA==
=ufpE
-----END PGP SIGNATURE-----
Merge tag 'for-6.14/io_uring-20250119' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:
"Not a lot in terms of features this time around, mostly just cleanups
and code consolidation:
- Support for PI meta data read/write via io_uring, with NVMe and
SCSI covered
- Cleanup the per-op structure caching, making it consistent across
various command types
- Consolidate the various user mapped features into a concept called
regions, making the various users of that consistent
- Various cleanups and fixes"
* tag 'for-6.14/io_uring-20250119' of git://git.kernel.dk/linux: (56 commits)
io_uring/fdinfo: fix io_uring_show_fdinfo() misuse of ->d_iname
io_uring: reuse io_should_terminate_tw() for cmds
io_uring: Factor out a function to parse restrictions
io_uring/rsrc: require cloned buffers to share accounting contexts
io_uring: simplify the SQPOLL thread check when cancelling requests
io_uring: expose read/write attribute capability
io_uring/rw: don't gate retry on completion context
io_uring/rw: handle -EAGAIN retry at IO completion time
io_uring/rw: use io_rw_recycle() from cleanup path
io_uring/rsrc: simplify the bvec iter count calculation
io_uring: ensure io_queue_deferred() is out-of-line
io_uring/rw: always clear ->bytes_done on io_async_rw setup
io_uring/rw: use NULL for rw->free_iovec assigment
io_uring/rw: don't mask in f_iocb_flags
io_uring/msg_ring: Drop custom destructor
io_uring: Move old async data allocation helper to header
io_uring/rw: Allocate async data through helper
io_uring/net: Allocate msghdr async data through helper
io_uring/uring_cmd: Allocate async data through generic helper
io_uring/poll: Allocate apoll with generic alloc_cache helper
...
|
||
|
|
7e587c20ad |
vfs-6.14-rc1.libfs
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ4pSLQAKCRCRxhvAZXjc
oq92AP4qTO8+FFRok2nhHlK4YNPhiqni1KabYXuHakL1ESw8OQD+O1wLgw8FUkgv
jxi+KmxMz9Asg2wdnLrSGEZJ709eOgc=
=6dn7
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.14-rc1.libfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs libfs updates from Christian Brauner:
"This improves the stable directory offset behavior in various ways.
Stable offsets are needed so that NFS can reliably read directories on
filesystems such as tmpfs:
- Improve the end-of-directory detection
According to getdents(3), the d_off field in each returned
directory entry points to the next entry in the directory. The
d_off field in the last returned entry in the readdir buffer must
contain a valid offset value, but if it points to an actual
directory entry, then readdir/getdents can loop.
Introduce a specific fixed offset value that is placed in the d_off
field of the last entry in a directory. Some user space
applications assume that the EOD offset value is larger than the
offsets of real directory entries, so the largest valid offset
value is reserved for this purpose. This new value is never
allocated by simple_offset_add().
When ->iterate_dir() returns, getdents{64} inserts the ctx->pos
value into the d_off field of the last valid entry in the readdir
buffer. When it hits EOD, offset_readdir() sets ctx->pos to the EOD
offset value so the last entry is updated to point to the EOD
marker.
When trying to read the entry at the EOD offset, offset_readdir()
terminates immediately.
- Rely on d_children to iterate stable offset directories
Instead of using the mtree to emit entries in the order of their
offset values, use it only to map incoming ctx->pos to a starting
entry. Then use the directory's d_children list, which is already
maintained properly by the dcache, to find the next child to emit.
- Narrow the range of directory offset values returned by
simple_offset_add() to 3 .. (S32_MAX - 1) on all platforms. This
means the allocation behavior is identical on 32-bit systems,
64-bit systems, and 32-bit user space on 64-bit kernels. The new
range still permits over 2 billion concurrent entries per
directory.
- Return ENOSPC when the directory offset range is exhausted. Hitting
this error is almost impossible though.
- Remove the simple_offset_empty() helper"
* tag 'vfs-6.14-rc1.libfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
libfs: Use d_children list to iterate simple_offset directories
libfs: Replace simple_offset end-of-directory detection
Revert "libfs: fix infinite directory reads for offset dir"
Revert "libfs: Add simple_offset_empty()"
libfs: Return ENOSPC when the directory offset range is exhausted
|
||
|
|
d7bde4f27c
|
Revert "libfs: Add simple_offset_empty()"
simple_empty() and simple_offset_empty() perform the same task. The latter's use as a canary to find bugs has not found any new issues. A subsequent patch will remove the use of the mtree for iterating directory contents, so revert back to using a similar mechanism for determining whether a directory is indeed empty. Only one such mechanism is ever needed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/20241228175522.1854234-3-cel@kernel.org Reviewed-by: Yang Erkun <yangerkun@huawei.com> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
d2fc0ed52a
|
Merge branch 'vfs-6.14.uncached_buffered_io'
Bring in the VFS changes for uncached buffered io. Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
af6505e574
|
fs: add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag
If a file system supports uncached buffered IO, it may set FOP_DONTCACHE and enable support for RWF_DONTCACHE. If RWF_DONTCACHE is attempted without the file system supporting it, it'll get errored with -EOPNOTSUPP. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20241220154831.1086649-8-axboe@kernel.dk Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
4de2ce04c8 |
fs: introduce IOCB_HAS_METADATA for metadata
Introduce an IOCB_HAS_METADATA flag for the kiocb struct, for handling requests containing meta payload. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241128112240.8867-6-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
ea38219907
|
vfs: support caching symlink lengths in inodes
When utilized it dodges strlen() in vfs_readlink(), giving about 1.5% speed up when issuing readlink on /initrd.img on ext4. Filesystems opt in by calling inode_set_cached_link() when creating an inode. The size is stored in a new union utilizing the same space as i_devices, thus avoiding growing the struct or taking up any more space. Churn-wise the current readlink_copy() helper is patched to accept the size instead of calculating it. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20241120112037.822078-2-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
0357ef03c9 |
fs: don't block write during exec on pre-content watched files
Commit
|
||
|
|
f156524e5d |
fsnotify: introduce pre-content permission events
The new FS_PRE_ACCESS permission event is similar to FS_ACCESS_PERM, but it meant for a different use case of filling file content before access to a file range, so it has slightly different semantics. Generate FS_PRE_ACCESS/FS_ACCESS_PERM as two seperate events, so content scanners could inspect the content filled by pre-content event handler. Unlike FS_ACCESS_PERM, FS_PRE_ACCESS is also called before a file is modified by syscalls as write() and fallocate(). FS_ACCESS_PERM is reported also on blockdev and pipes, but the new pre-content events are only reported for regular files and dirs. The pre-content events are meant to be used by hierarchical storage managers that want to fill the content of files on first access. There are some specific requirements from filesystems that could be used with pre-content events, so add a flag for fs to opt-in for pre-content events explicitly before they can be used. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/b934c5e3af205abc4e0e4709f6486815937ddfdf.1731684329.git.josef@toxicpanda.com |
||
|
|
a94204f4d4 |
fsnotify: opt-in for permission events at file open time
Legacy inotify/fanotify listeners can add watches for events on inode, parent or mount and expect to get events (e.g. FS_MODIFY) on files that were already open at the time of setting up the watches. fanotify permission events are typically used by Anti-malware sofware, that is watching the entire mount and it is not common to have more that one Anti-malware engine installed on a system. To reduce the overhead of the fsnotify_file_perm() hooks on every file access, relax the semantics of the legacy FAN_ACCESS_PERM event to generate events only if there were *any* permission event listeners on the filesystem at the time that the file was opened. The new semantic is implemented by extending the FMODE_NONOTIFY bit into two FMODE_NONOTIFY_* bits, that are used to store a mode for which of the events types to report. This is going to apply to the new fanotify pre-content events in order to reduce the cost of the new pre-content event vfs hooks. [Thanks to Bert Karwatzki <spasswolf@web.de> for reporting a bug in this code with CONFIG_FANOTIFY_ACCESS_PERMISSIONS disabled] Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wj8L=mtcRTi=NECHMGfZQgXOp_uix1YVh04fEmrKaMnXA@mail.gmail.com/ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/5ea5f8e283d1edb55aa79c35187bfe344056af14.1731684329.git.josef@toxicpanda.com |
||
|
|
ebe559609d |
fs: get rid of __FMODE_NONOTIFY kludge
All it takes to get rid of the __FMODE_NONOTIFY kludge is switching fanotify from anon_inode_getfd() to anon_inode_getfile_fmode() and adding a dentry_open_nonotify() helper to be used by fanotify on the other path. That's it - no more weird shit in OPEN_FMODE(), etc. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/linux-fsdevel/20241113043003.GH3387508@ZenIV/ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/d1231137e7b661a382459e79a764259509a4115d.1731684329.git.josef@toxicpanda.com |
||
|
|
82339c4911 |
sanitize xattr and io_uring interactions with it,
add *xattrat() syscalls, sanitize struct filename handling in there. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZzdj4gAKCRBZ7Krx/gZQ 6/02AQC8ndn9i1wLGRb5DdZYGNWUDhXCdPrZCF2nyvU2swCIPwEAm1H5F/bxBXeT 6qCLHThVw4KTJOT2aDY03ELrxbi8Vg4= =35Oj -----END PGP SIGNATURE----- Merge tag 'pull-xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull xattr updates from Al Viro: "Sanitize xattr and io_uring interactions with it, add *xattrat() syscalls, sanitize struct filename handling in there" * tag 'pull-xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: xattr: remove redundant check on variable err fs/xattr: add *at family syscalls new helpers: file_removexattr(), filename_removexattr() new helpers: file_listxattr(), filename_listxattr() replace do_getxattr() with saner helpers. replace do_setxattr() with saner helpers. new helper: import_xattr_name() fs: rename struct xattr_ctx to kernel_xattr_ctx xattr: switch to CLASS(fd) io_[gs]etxattr_prep(): just use getname() io_uring: IORING_OP_F[GS]ETXATTR is fine with REQ_F_FIXED_FILE getname_maybe_null() - the third variant of pathname copy-in teach filename_lookup() to treat NULL filename as "" |
||
|
|
241c7ed4d4 |
vfs-6.13.untorn.writes
-----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcopwAKCRCRxhvAZXjc oitWAQD68PGFI6/ES9x+qGsDFEZBH08icuO+a9dyaZXyNRosDgD/ex2zHj6F7IzS Ghgb9jiqWQ8l2+PDYfisxa/0jiqCbAk= =DmXf -----END PGP SIGNATURE----- Merge tag 'vfs-6.13.untorn.writes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs untorn write support from Christian Brauner: "An atomic write is a write issed with torn-write protection. This means for a power failure or any hardware failure all or none of the data from the write will be stored, never a mix of old and new data. This work is already supported for block devices. If a block device is opened with O_DIRECT and the block device supports atomic write, then FMODE_CAN_ATOMIC_WRITE is added to the file of the opened block device. This contains the work to expand atomic write support to filesystems, specifically ext4 and XFS. Currently, only support for writing exactly one filesystem block atomically is added. Since it's now possible to have filesystem block size > page size for XFS, it's possible to write 4K+ blocks atomically on x86" * tag 'vfs-6.13.untorn.writes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: iomap: drop an obsolete comment in iomap_dio_bio_iter ext4: Do not fallback to buffered-io for DIO atomic write ext4: Support setting FMODE_CAN_ATOMIC_WRITE ext4: Check for atomic writes support in write iter ext4: Add statx support for atomic writes xfs: Support setting FMODE_CAN_ATOMIC_WRITE xfs: Validate atomic writes xfs: Support atomic write for statx fs: iomap: Atomic write support fs: Export generic_atomic_write_valid() block: Add bdev atomic write limits helpers fs/block: Check for IOCB_DIRECT in generic_atomic_write_valid() block/fs: Pass an iocb to generic_atomic_write_valid() |
||
|
|
7956186e75 |
vfs-6.13.tmpfs
-----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcZIgAKCRCRxhvAZXjc oge4AQDxhsKW+v/jKHydzqzwG3Ks7DIxrUg/mcGfdtBwjiWgvwEA8t0QAAfKECAK B0+bNKJ8XJRUtZ10Jgm3dzURbEhBWgU= =4Lui -----END PGP SIGNATURE----- Merge tag 'vfs-6.13.tmpfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull tmpfs case folding updates from Christian Brauner: "This adds case-insensitive support for tmpfs. The work contained in here adds support for case-insensitive file names lookups in tmpfs. The main difference from other casefold filesystems is that tmpfs has no information on disk, just on RAM, so we can't use mkfs to create a case-insensitive tmpfs. For this implementation, there's a mount option for casefolding. The rest of the patchset follows a similar approach as ext4 and f2fs. The use case for this feature is similar to the use case for ext4, to better support compatibility layers (like Wine), particularly in combination with sandboxing/container tools (like Flatpak). Those containerization tools can share a subset of the host filesystem with an application. In the container, the root directory and any parent directories required for a shared directory are on tmpfs, with the shared directories bind-mounted into the container's view of the filesystem. If the host filesystem is using case-insensitive directories, then the application can do lookups inside those directories in a case-insensitive way, without this needing to be implemented in user-space. However, if the host is only sharing a subset of a case-insensitive directory with the application, then the parent directories of the mount point will be part of the container's root tmpfs. When the application tries to do case-insensitive lookups of those parent directories on a case-sensitive tmpfs, the lookup will fail" * tag 'vfs-6.13.tmpfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: tmpfs: Initialize sysfs during tmpfs init tmpfs: Fix type for sysfs' casefold attribute libfs: Fix kernel-doc warning in generic_ci_validate_strict_name docs: tmpfs: Add casefold options tmpfs: Expose filesystem features via sysfs tmpfs: Add flag FS_CASEFOLD_FL support for tmpfs dirs tmpfs: Add casefold lookup support libfs: Export generic_ci_ dentry functions unicode: Recreate utf8_parse_version() unicode: Export latest available UTF-8 version number ext4: Use generic_ci_validate_strict_name helper libfs: Create the helper function generic_ci_validate_strict_name() |
||
|
|
4c797b11a8 |
vfs-6.13.file
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcW4gAKCRCRxhvAZXjc
okF+AP9xTMb2SlnRPBOBd9yFcmVXmQi86TSCUPAEVb+wIldGYwD/RIOdvXYJlp9v
RgJkU1DC3ddkXtONNDY6gFaP+siIWA0=
=gMc7
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.13.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs file updates from Christian Brauner:
"This contains changes the changes for files for this cycle:
- Introduce a new reference counting mechanism for files.
As atomic_inc_not_zero() is implemented with a try_cmpxchg() loop
it has O(N^2) behaviour under contention with N concurrent
operations and it is in a hot path in __fget_files_rcu().
The rcuref infrastructures remedies this problem by using an
unconditional increment relying on safe- and dead zones to make
this work and requiring rcu protection for the data structure in
question. This not just scales better it also introduces overflow
protection.
However, in contrast to generic rcuref, files require a memory
barrier and thus cannot rely on *_relaxed() atomic operations and
also require to be built on atomic_long_t as having massive amounts
of reference isn't unheard of even if it is just an attack.
This adds a file specific variant instead of making this a generic
library.
This has been tested by various people and it gives consistent
improvement up to 3-5% on workloads with loads of threads.
- Add a fastpath for find_next_zero_bit(). Skip 2-levels searching
via find_next_zero_bit() when there is a free slot in the word that
contains the next fd. This improves pts/blogbench-1.1.0 read by 8%
and write by 4% on Intel ICX 160.
- Conditionally clear full_fds_bits since it's very likely that a bit
in full_fds_bits has been cleared during __clear_open_fds(). This
improves pts/blogbench-1.1.0 read up to 13%, and write up to 5% on
Intel ICX 160.
- Get rid of all lookup_*_fdget_rcu() variants. They were used to
lookup files without taking a reference count. That became invalid
once files were switched to SLAB_TYPESAFE_BY_RCU and now we're
always taking a reference count. Switch to an already existing
helper and remove the legacy variants.
- Remove pointless includes of <linux/fdtable.h>.
- Avoid cmpxchg() in close_files() as nobody else has a reference to
the files_struct at that point.
- Move close_range() into fs/file.c and fold __close_range() into it.
- Cleanup calling conventions of alloc_fdtable() and expand_files().
- Merge __{set,clear}_close_on_exec() into one.
- Make __set_open_fd() set cloexec as well instead of doing it in two
separate steps"
* tag 'vfs-6.13.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
selftests: add file SLAB_TYPESAFE_BY_RCU recycling stressor
fs: port files to file_ref
fs: add file_ref
expand_files(): simplify calling conventions
make __set_open_fd() set cloexec state as well
fs: protect backing files with rcu
file.c: merge __{set,clear}_close_on_exec()
alloc_fdtable(): change calling conventions.
fs/file.c: add fast path in find_next_fd()
fs/file.c: conditionally clear full_fds
fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
move close_range(2) into fs/file.c, fold __close_range() into it
close_files(): don't bother with xchg()
remove pointless includes of <linux/fdtable.h>
get rid of ...lookup...fdget_rcu() family
|
||
|
|
70e7730c2a |
vfs-6.13.misc
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcToAAKCRCRxhvAZXjc
osL9AP948FFumJRC28gDJ4xp+X4eohNOfkgoEG8FTbF2zU6ulwD+O0pr26FqpFli
pqlG+38UdATImpfqqWjPbb72sBYcfQg=
=wLUh
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.13.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"Features:
- Fixup and improve NLM and kNFSD file lock callbacks
Last year both GFS2 and OCFS2 had some work done to make their
locking more robust when exported over NFS. Unfortunately, part of
that work caused both NLM (for NFS v3 exports) and kNFSD (for
NFSv4.1+ exports) to no longer send lock notifications to clients
This in itself is not a huge problem because most NFS clients will
still poll the server in order to acquire a conflicted lock
It's important for NLM and kNFSD that they do not block their
kernel threads inside filesystem's file_lock implementations
because that can produce deadlocks. We used to make sure of this by
only trusting that posix_lock_file() can correctly handle blocking
lock calls asynchronously, so the lock managers would only setup
their file_lock requests for async callbacks if the filesystem did
not define its own lock() file operation
However, when GFS2 and OCFS2 grew the capability to correctly
handle blocking lock requests asynchronously, they started
signalling this behavior with EXPORT_OP_ASYNC_LOCK, and the check
for also trusting posix_lock_file() was inadvertently dropped, so
now most filesystems no longer produce lock notifications when
exported over NFS
Fix this by using an fop_flag which greatly simplifies the problem
and grooms the way for future uses by both filesystems and lock
managers alike
- Add a sysctl to delete the dentry when a file is removed instead of
making it a negative dentry
Commit
|
||
|
|
6ac81fd55e |
vfs-6.13.mgtime
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcScQAKCRCRxhvAZXjc
oj+5AP4k822a77wc/3iPFk379naIvQ4dsrgemh0/Pb6ZvzvkFQEAi3vFCfzCDR2x
SkJF/RwXXKZv6U31QXMRt2Qo6wfBuAc=
=nVlm
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.13.mgtime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs multigrain timestamps from Christian Brauner:
"This is another try at implementing multigrain timestamps. This time
with significant help from the timekeeping maintainers to reduce the
performance impact.
Thomas provided a base branch that contains the required timekeeping
interfaces for the VFS. It serves as the base for the multi-grain
timestamp work:
- Multigrain timestamps allow the kernel to use fine-grained
timestamps when an inode's attributes is being actively observed
via ->getattr(). With this support, it's possible for a file to get
a fine-grained timestamp, and another modified after it to get a
coarse-grained stamp that is earlier than the fine-grained time. If
this happens then the files can appear to have been modified in
reverse order, which breaks VFS ordering guarantees.
To prevent this, a floor value is maintained for multigrain
timestamps. Whenever a fine-grained timestamp is handed out, record
it, and when later coarse-grained stamps are handed out, ensure
they are not earlier than that value. If the coarse-grained
timestamp is earlier than the fine-grained floor, return the floor
value instead.
The timekeeper changes add a static singleton atomic64_t into
timekeeper.c that is used to keep track of the latest fine-grained
time ever handed out. This is tracked as a monotonic ktime_t value
to ensure that it isn't affected by clock jumps. Because it is
updated at different times than the rest of the timekeeper object,
the floor value is managed independently of the timekeeper via a
cmpxchg() operation, and sits on its own cacheline.
Two new public timekeeper interfaces are added:
(1) ktime_get_coarse_real_ts64_mg() fills a timespec64 with the
later of the coarse-grained clock and the floor time
(2) ktime_get_real_ts64_mg() gets the fine-grained clock value,
and tries to swap it into the floor. A timespec64 is filled
with the result.
- The VFS has always used coarse-grained timestamps when updating the
ctime and mtime after a change. This has the benefit of allowing
filesystems to optimize away a lot metadata updates, down to around
1 per jiffy, even when a file is under heavy writes.
Unfortunately, this has always been an issue when we're exporting
via NFSv3, which relies on timestamps to validate caches. A lot of
changes can happen in a jiffy, so timestamps aren't sufficient to
help the client decide when to invalidate the cache. Even with
NFSv4, a lot of exported filesystems don't properly support a
change attribute and are subject to the same problems with
timestamp granularity. Other applications have similar issues with
timestamps (e.g backup applications).
If we were to always use fine-grained timestamps, that would
improve the situation, but that becomes rather expensive, as the
underlying filesystem would have to log a lot more metadata
updates.
This adds a way to only use fine-grained timestamps when they are
being actively queried. Use the (unused) top bit in
inode->i_ctime_nsec as a flag that indicates whether the current
timestamps have been queried via stat() or the like. When it's set,
we allow the kernel to use a fine-grained timestamp iff it's
necessary to make the ctime show a different value.
This solves the problem of being able to distinguish the timestamp
between updates, but introduces a new problem: it's now possible
for a file being changed to get a fine-grained timestamp. A file
that is altered just a bit later can then get a coarse-grained one
that appears older than the earlier fine-grained time. This
violates timestamp ordering guarantees.
This is where the earlier mentioned timkeeping interfaces help. A
global monotonic atomic64_t value is kept that acts as a timestamp
floor. When we go to stamp a file, we first get the latter of the
current floor value and the current coarse-grained time. If the
inode ctime hasn't been queried then we just attempt to stamp it
with that value.
If it has been queried, then first see whether the current coarse
time is later than the existing ctime. If it is, then we accept
that value. If it isn't, then we get a fine-grained time and try to
swap that into the global floor. Whether that succeeds or fails, we
take the resulting floor time, convert it to realtime and try to
swap that into the ctime.
We take the result of the ctime swap whether it succeeds or fails,
since either is just as valid.
Filesystems can opt into this by setting the FS_MGTIME fstype flag.
Others should be unaffected (other than being subject to the same
floor value as multigrain filesystems)"
* tag 'vfs-6.13.mgtime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: reduce pointer chasing in is_mgtime() test
tmpfs: add support for multigrain timestamps
btrfs: convert to multigrain timestamps
ext4: switch to multigrain timestamps
xfs: switch to multigrain timestamps
Documentation: add a new file documenting multigrain timestamps
fs: add percpu counters for significant multigrain timestamp events
fs: tracepoints around multigrain timestamp events
fs: handle delegated timestamps in setattr_copy_mgtime
timekeeping: Add percpu counter for tracking floor swap events
timekeeping: Add interfaces for handling timestamps with a floor value
fs: have setattr_copy handle multigrain timestamps appropriately
fs: add infrastructure for multigrain timestamps
|
||
|
|
9fed2c0f2f
|
fs: reduce pointer chasing in is_mgtime() test
The is_mgtime test checks whether the FS_MGTIME flag is set in the fstype. To get there from the inode though, we have to dereference 3 pointers. Add a new IOP_MGTIME flag, and have inode_init_always() set that flag when the fstype flag is set. Then, make is_mgtime test for IOP_MGTIME instead. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20241113-mgtime-v1-1-84e256980e11@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
33b091c08e
|
libfs: Fix kernel-doc warning in generic_ci_validate_strict_name
Fix the indentation of the return values from
generic_ci_validate_strict_name() to properly render the comment and to
address a `make htmldocs` warning:
Documentation/filesystems/api-summary:14: include/linux/fs.h:3504:
WARNING: Bullet list ends without a blank line; unexpected unindent.
Fixes:
|