linux

mirror of https://github.com/torvalds/linux.git synced 2025-11-02 17:49:03 +02:00

Author	SHA1	Message	Date
Kent Overstreet	f946ce0be4	bcachefs: Make sure opts.read_only gets propagated back to VFS If we think we're read-only but the VFS doesn't, fun will ensue. And now that we know we have to be able to do this safely, just make nochanges imply ro. Reported-by: syzbot+a7d6ceaba099cc21dee4@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-06-11 23:21:30 -04:00
Kent Overstreet	09fb85ae56	bcachefs: Run may_delete_deleted_inode() checks in bch2_inode_rm() We had a bug where bch2_evict_inode() incorrectly called bch2_inode_rm() - the journal clearly showed the inode was not unlinked. We've got checks that we use in recovery when cleaning up deleted inodes, lift them to bch2_inode_rm() as well. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-06-04 16:45:41 -04:00
Kent Overstreet	09b9c72bd4	bcachefs: bch_err_throw() Add a tracepoint for any time we return an error and unwind. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-06-02 12:16:35 -04:00
Kent Overstreet	18dad454cd	bcachefs: Replace rcu_read_lock() with guards The new guard(), scoped_guard() allow for more natural code. Some of the uses with creative flow control have been left. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-06-01 00:03:12 -04:00
Kent Overstreet	f402d9710b	bcachefs: bch2_readdir() now calls str_hash_check_key() More self healing code: readdir will now notice if there are dirents hashed incorrectly, and it'll repair them if errors=fix_safe. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-31 22:03:17 -04:00
Kent Overstreet	9e2c3c2ed4	bcachefs: Fix lost rebalance wakeups Fix a missing wakeup in 'bcachefs set-file-option' -> xattr option update -> inode_write this was missing because the wakeup needs to happen after transaction commit. Also, add a 'kick' counter, to make sure we don't miss a wakeup that occured right after we finished checking the rebalance_work btree. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-27 00:02:44 -04:00
Kent Overstreet	011d644b76	bcachefs: subvol_inum_eq() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-21 20:15:08 -04:00
Kent Overstreet	a349868b5e	bcachefs: bch2_fs_open() now takes a darray Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-21 20:14:37 -04:00
Kent Overstreet	9fa4a8a3bd	bcachefs: for_each_online_member_rcu() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-21 20:14:26 -04:00
Kent Overstreet	83ecd1b122	bcachefs: Use drop_locks_do() in bch2_inode_hash_find() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-21 20:14:16 -04:00
Kent Overstreet	c02e5b5728	bcachefs: Single device mode Single device filesystems are now identified by the block device name, not the UUID - and single device filesystems with the same UUID can be mounted simultaneously, without any special options. This allocates a new bit in the superblock, BCH_SB_MULTI_DEVICE, which indicates whether a filesystem has ever been multi device. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-21 20:14:15 -04:00
Kent Overstreet	8d5ac187da	bcachefs: Fix casefold opt via xattr interface Changing the casefold option requires extra checks/work - factor out a helper from bch2_fileattr_set() for the xattr code to use. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-21 20:13:09 -04:00
Kent Overstreet	a12cb6f758	bcachefs: Fix accidental O(n^2) in fiemap Since bch2_seek_pagecache_data() searches for dirty data, we only want to call it for holes in the extents btree - otherwise we have an accidental O(n^2), as we repeatedly search the same range. Reported-by: Marcin Mirosław <marcin@mejor.pl> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-14 17:05:19 -04:00
Kent Overstreet	9c61856099	bcachefs: Call bch2_fs_start before getting vfs superblock This reverts `1fdbe0b184` bcachefs: Make sure c->vfs_sb is set before starting fs switched up bch2_fs_get_tree() so that we got a superblock before calling bch2_fs_start, so that c->vfs_sb would always be initialized while the filesystem was active. This turned out not to be necessary, because blk_holder_ops were implemented using our own locking, not vfs locking. And this had the side effect of creating a super_block and doing our full recovery (including potentially fsck) before setting SB_BORN, which causes things like sync calls to hang until our recovery is finished. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-05 16:06:35 -04:00
Kent Overstreet	c83311c5b9	bcachefs: Use generic_set_sb_d_ops for standard casefolding d_ops Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-28 16:46:12 -04:00
Kent Overstreet	a2f546330e	bcachefs: Fix losing return code in next_fiemap_extent() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-28 16:46:12 -04:00
Kent Overstreet	d1b0f9aa73	bcachefs: Rework fiemap transaction restart handling Restart handling in the previous patch was incorrect, so: move btree operations into a separate helper, and run it with a lockrestart_do(). Additionally, clarify whether pagecache or the btree takes precedence. Right now, the btree takes precedence: this is incorrect, but it's needed to pass fstests. Add a giant comment explaining why. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-24 19:10:29 -04:00
Brian Foster	b9b0494017	bcachefs: add fiemap delalloc extent detection bcachefs currently populates fiemap data from the extents btree. This works correctly when the fiemap sync flag is provided, but if not, it skips all delalloc extents that have not yet been flushed. This is because delalloc extents from buffered writes are first stored as reservation in the pagecache, and only become resident in the extents btree after writeback completes. Update the fiemap implementation to process holes between extents by scanning pagecache for data, via seek data/hole. If a valid data range is found over a hole in the extent btree, fake up an extent key and flag the extent as delalloc for reporting to userspace. Note that this does not necessarily change behavior for the case where there is dirty pagecache over already written extents, where when in COW mode, writeback will allocate new blocks for the underlying ranges. The existing behavior is consistent with btrfs and it is recommended to use the sync flag for the most up to date extent state from fiemap. Signed-off-by: Brian Foster <bfoster@redhat.com>	2025-04-24 19:10:29 -04:00
Brian Foster	2d55a63709	bcachefs: refactor fiemap processing into extent helper and struct The bulk of the loop in bch2_fiemap() involves processing the current extent key from the iter, including following indirections and trimming the extent size and such. This patch makes a few changes to reduce the size of the loop and facilitate future changes to support delalloc extents. Define a new bch_fiemap_extent structure to wrap the bkey buffer that holds the extent key to report to userspace along with associated fiemap flags. Update bch2_fill_extent() to take the bch_fiemap_extent as a param instead of the individual fields. Finally, lift the bulk of the extent processing into a bch2_fiemap_extent() helper that takes the current key and formats the bch_fiemap_extent appropriately for the fill function. No functional changes intended by this patch. Signed-off-by: Brian Foster <bfoster@redhat.com>	2025-04-24 19:10:29 -04:00
Brian Foster	d020a9fb11	bcachefs: track current fiemap offset in start variable Signed-off-by: Brian Foster <bfoster@redhat.com>	2025-04-24 19:10:28 -04:00
Brian Foster	28d2d19ccc	bcachefs: drop duplicate fiemap sync flag FIEMAP_FLAG_SYNC handling was deliberately moved into core code in commit `45dd052e67` ("fs: handle FIEMAP_FLAG_SYNC in fiemap_prep"), released in kernel v5.8. Update bcachefs accordingly. Signed-off-by: Brian Foster <bfoster@redhat.com>	2025-04-24 19:10:28 -04:00
Kent Overstreet	7cb85324c4	bcachefs: unlink: casefold d_invalidate casefolding results in additional aliases on lookup for the non-casefolded names - these need invalidating on unlink. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-24 19:09:52 -04:00
Kent Overstreet	9cdde3c7aa	bcachefs: Fix casefold lookups Add casefolding to bch2_lookup_trans: During the delay between when casefolding was written and when it was merged, the main filesystem lookup path grew self healing - which meant it was no longer using bch2_dirent_lookup_trans(), where casefolding on lookups happens. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-24 19:09:52 -04:00
Kent Overstreet	b9e1f873d2	bcachefs: Casefold is now a regular opts.h option Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-24 19:09:00 -04:00
Kent Overstreet	7a4a86618e	bcachefs: Implement fileattr_(get\|set) inode_operations.fileattr_(get\|set) didn't exist when the various flag ioctls where implemented - but they do now, which means we can delete a bunch of ioctl code in favor of standard VFS level wrappers. Closes: https://lore.kernel.org/linux-bcachefs/7ltgrgqgfummyrlvw7hnfhnu42rfiamoq3lpcvrjnlyytldmzp@yazbhusnztqn/ Cc: Petr Vorel <pvorel@suse.cz> Cc: Andrea Cervesato <andrea.cervesato@suse.de> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-21 19:50:56 -04:00
Linus Torvalds	56770e24f6	bcachefs fixes for 6.15-rc1 More notable fixes: - Fix for striping behaviour on tiering filesystems where replicas exceeds durability on destination target - Fix a race in device removal where deleting alloc info races with the discard worker - Some small stack usage improvements: this is just enough for KMSAN builds to not blow the stack, more is queued up for 6.16. -----BEGIN PGP SIGNATURE----- iQIyBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmfu7pMACgkQE6szbY3K bnY6rw/3W4dho57OPjOoHUbQ7A7IK1hI4SFvGtDDb3vX1RjF2r+RbdBupMAd0zGj T+5SzhYCQLGbyfBa6MW+iPqiHokZsG904+3mRogOf9cpz2Mup9ZOq/vV+Z7ndaF8 2i9wpQTb7GShkSaXkeTvQqnx3YAUxVcRB2ExraTXmv4wIxr1SYyJEeakmMBDasJB UanXXVHzrKo9WLiqWz0JSZCiuQW2v03P84zZo1d/GyMKlTxYDt5aAteos77lJBef 5CWVr4/HKKozt/vI2qHQ+3LJXktLjvb07zoENXwadmgQawYA3nQ+9jLT3Q0FKjXG bK28AHTtiXgWsYsbCs5sVh1+WLPdEj0UBBoFZGWo++TzaN2hXhoMsFTQfuddhaEh W63MWtelv4TGIVOEFk+ayHRgPL6ajhCsa1boHS9EKdosl2nl9Vk9Nq0i++hYZDGW KhWqENT9E5EpVCnZ6H4m1tsXprWavNqXnkOJzXW0T2F3t8+94zp1n6YXkwDdgLfs l+xTEEAL5J8lvlfSS6dW7QcMSMtMKbo3+qlerpH8J4zBZJBbb2nF1ggCtpYg6zFt 4Jgs5FPQLVqWsPXQr4CaSF2UIt3zMPnNIawL1cEpRBU1j35qo0e/kxIjEpS0Pnjt mX67gBlodY54/pwGGLfc/Vkw4xqh//dqTmYIdHkibdAEvKf0dg== =2TfM -----END PGP SIGNATURE----- Merge tag 'bcachefs-2025-04-03' of git://evilpiepirate.org/bcachefs Pull more bcachefs updates from Kent Overstreet: "More notable fixes: - Fix for striping behaviour on tiering filesystems where replicas exceeds durability on destination target - Fix a race in device removal where deleting alloc info races with the discard worker - Some small stack usage improvements: this is just enough for KMSAN builds to not blow the stack, more is queued up for 6.16" * tag 'bcachefs-2025-04-03' of git://evilpiepirate.org/bcachefs: bcachefs: Fix "journal stuck" during recovery bcachefs: backpointer_get_key: check for null from peek_slot() bcachefs: Fix null ptr deref in invalidate_one_bucket() bcachefs: Fix check_snapshot_exists() restart handling bcachefs: use nonblocking variant of print_string_as_lines in error path bcachefs: Fix scheduling while atomic from logging changes bcachefs: Add error handling for zlib_deflateInit2() bcachefs: add missing selection of XARRAY_MULTI bcachefs: bch_dev_usage_full bcachefs: Kill btree_iter.trans bcachefs: do_trace_key_cache_fill() bcachefs: Split up bch_dev.io_ref bcachefs: fix ref leak in btree_node_read_all_replicas bcachefs: Fix null ptr deref in bch2_write_endio() bcachefs: Fix field spanning write warning bcachefs: Fix striping behaviour	2025-04-03 15:39:47 -07:00
Kent Overstreet	9180ad2e16	bcachefs: Kill btree_iter.trans This was planned to be done ages ago, now finally completed; there are places where we have quite a few btree_trans objects on the stack, so this reduces stack usage somewhat. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-02 10:24:34 -04:00
Kent Overstreet	dcffc3b1ae	bcachefs: Split up bch_dev.io_ref We now have separate per device io_refs for read and write access. This fixes a device removal bug where the discard workers were still running while we're removing alloc info for that device. It's also a bit of hardening; we no longer allow writes to devices that are read-only. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-02 10:24:34 -04:00
Linus Torvalds	98fb679d19	bcachefs updates for 6.15, part 2 All bugfixes and logging improvements. Minor merge conflict, see: https://lore.kernel.org/linux-next/20250331092816.778a7c83@canb.auug.org.au/T/#u CI says the fs-next tree is good: https://evilpiepirate.org/~testdashboard/ci?user=fs-next&branch=master -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmfqyxsACgkQE6szbY3K bnY4aQ/7BylMgHZsAG2OLRRtegCsuFZ5fZt148TObofSGTTDPcVKYQWcz249Hlao RZzv9nbqq2M7fJrUK5Xloc4DA0ICuWIh9n+uRf+5od7JgtygjJpXqMRz9HrGBtTo QZJE/wAzsa1A8xBORjWVki4koHT3YivaMW2zdgbHIHWTjDJso5Es7RW0/WZYv3lW cFLFEfOnBCEXckhEtK7TAPnJpHEPw+/d0bMFU/PIHbokUwTjxCR0bmRL/RecKUXa 5U1o1x7gAo1iPi3XPGLJVVxXWgjmxzQlF/3aXva+DYaeLgxPMqKxUlC6hkV4f6Oc 9lH/w1pEiCMcANbbp2E3Q91sDFRlafFCgvsKhEz79W5WoNq+vSrxLhLaynyuBT/K lfoiig6IFRTWJDYHu2L6YHFMmp8JOxgJSJ0+dcgyVRnaDJQeGgbuv1tEldonQLsg 9DT8iRJpVDomffwPUoVhujlvJOqUi8zFkxyMCgVWExFzC3ief2B5s3D4uLXcpApO nZfb01W0ElW7qBMQxjyD0Vy+wY8EryzTht9ZKJq5Id1T/LWc9Qi+jPaY86OBC9/w GJgW9OcYLFjYdsDokk5XkwOd/IAXz6fU+vHGtahFJPVfH4T8zzdBnxfPbiR2mXo8 4EfeNmRevZP/oK7/2l2cqIzY7tYBJBUK1gFyvz1+7bcuFwVI8rc= =Udka -----END PGP SIGNATURE----- Merge tag 'bcachefs-2025-03-31' of git://evilpiepirate.org/bcachefs Pull more bcachefs updates from Kent Overstreet: "All bugfixes and logging improvements" * tag 'bcachefs-2025-03-31' of git://evilpiepirate.org/bcachefs: (35 commits) bcachefs: fix bch2_write_point_to_text() units bcachefs: Log original key being moved in data updates bcachefs: BCH_JSET_ENTRY_log_bkey bcachefs: Reorder error messages that include journal debug bcachefs: Don't use designated initializers for disk_accounting_pos bcachefs: Silence errors after emergency shutdown bcachefs: fix units in rebalance_status bcachefs: bch2_ioctl_subvolume_destroy() fixes bcachefs: Clear fs_path_parent on subvolume unlink bcachefs: Change btree_insert_node() assertion to error bcachefs: Better printing of inconsistency errors bcachefs: bch2_count_fsck_err() bcachefs: Better helpers for inconsistency errors bcachefs: Consistent indentation of multiline fsck errors bcachefs: Add an "ignore unknown" option to bch2_parse_mount_opts() bcachefs: bch2_time_stats_init_no_pcpu() bcachefs: Fix bch2_fs_get_tree() error path bcachefs: fix logging in journal_entry_err_msg() bcachefs: add missing newline in bch2_trans_updates_to_text() bcachefs: print_string_as_lines: fix extra newline ...	2025-03-31 18:33:51 -07:00
Kent Overstreet	1ece53237e	bcachefs: Consistent indentation of multiline fsck errors Add the new helper printbuf_indent_add_nextline(), and use it in __bch2_fsck_err() to centralize setting the indentation of multiline fsck errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-28 22:31:47 -04:00
Kent Overstreet	a7cdf2276e	bcachefs: Add an "ignore unknown" option to bch2_parse_mount_opts() To be used by the mount helper in userspace, where we still have options to be parsed by other layers. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-28 22:31:47 -04:00
Florian Albrechtskirchinger	7c4cb50e1a	bcachefs: Fix bch2_fs_get_tree() error path When a filesystem is mounted read-only, subsequent attempts to mount it as read-write fail with EBUSY. Previously, the error path in bch2_fs_get_tree() would unconditionally call __bch2_fs_stop(), improperly freeing resources for a filesystem that was still actively mounted. This change modifies the error path to only call __bch2_fs_stop() if the superblock has no valid root dentry, ensuring resources are not cleaned up prematurely when the filesystem is in use. Signed-off-by: Florian Albrechtskirchinger <falbrechtskirchinger@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-28 13:51:09 -04:00
Linus Torvalds	4a4b30ea80	bcachefs updates for 6.15 On disk format is now soft frozen: no more required/automatic are anticipated before taking off the experimental label. Major changes/features since 6.14: - Scrub - Blocksize greater than page size support - A number of "rebalance spinning and doing no work" issues have been fixed; we now check if the write allocation will succeed in bch2_data_update_init(), before kicking off the read. There's still more work to do in this area. Later we may want to add another bitset btree, like rebalance_work, to track "extents that rebalance was requested to move but couldn't", e.g. due to destination target having insufficient online devices. - We can now support scaling well into the petabyte range: latest bcachefs-tools will pick an appropriate bucket size at format time to ensure fsck can run in available memory (e.g. a server with 256GB of ram and 100PB of storage would want 16MB buckets). On disk format changes: - 1.21: cached backpointers (scalability improvement) Cached replicas now get backpointers, which means we no longer rely on incrementing bucket generation numbers to invalidate cached data: this lets us get rid of the bucket generation number garbage collection, which had to periodically rescan all extents to recompute bucket oldest_gen. Bucket generation numbers are now only used as a consistency check, but they're quite useful for that. - 1.22: stripe backpointers Stripes now have backpointers: erasure coded stripes have their own checksums, separate from the checksums for the extents they contain (and stripe checksums also cover the parity blocks). This is required for implementing scrub for stripes. - 1.23: stripe lru (scalability improvement) Persistent lru for stripes, ordered by "number of empty blocks". This is used by the stripe creation path, which depending on free space may create a new stripe out of a partially empty existing stripe instead of starting a brand new stripe. This replaces an in-memory heap, and means we no longer have to read in the stripes btree at startup. - 1.24: casefolding Case insensitive directory support, courtesy of Valve. This is an incompatible feature, to enable mount with -o version_upgrade=incompatible - 1.25: extent_flags Another incompatible feature requiring explicit opt-in to enable. This adds a flags entry to extents, and a flag bit that marks extents as poisoned. A poisoned extent is an extent that was unreadable due to checksum errors. We can't move such extents without giving them a new checksum, and we may have to move them (for e.g. copygc or device evacuate). We also don't want to delete them: in the future we'll have an API that lets userspace ignore checksum errors and attempt to deal with simple bitrot itself. Marking them as poisoned lets us continue to return the correct error to userspace on normal read calls. Other changes/features: - BCH_IOCTL_QUERY_COUNTERS: this is used by the new 'bcachefs fs top' command, which shows a live view of all internal filesystem counters. - Improved journal pipelining: we can now have 16 journal writes in flight concurrently, up from 4. We're logging significantly more to the journal than we used to with all the recent disk accounting changes and additions, so some users should see a performance increase on some workloads. - BCH_MEMBER_STATE_failed: previously, we would do no IO at all to devices marked as failed. Now we will attempt to read from them, but only if we have no better options. - New option, write_error_timeout: devices will be kicked out of the filesystem if all writes have been failing for x number of seconds. We now also kick devices out when notified by blk_holder_ops that they've gone offline. - Device option handling improvements: the discard option should now be working as expected (additionally, in -tools, all device options that can be set at format time can now be set at device add time, i.e. data_allowed, state). - We now try harder to read data after a checksum error: we'll do additional retries if necessary to a device after after it gave us data with a checksum error. - More self healing work: the full inode <-> dirent consistency checks that are currently run by fsck are now also run every time we do a lookup, meaning we'll be able to correct errors at runtime. Runtime self healing will be flipped on after the new changes have seen more testing, currently they're just checking for consistency. - KMSAN fixes: our KMSAN builds should be nearly clean now, which will put a massive dent in the syzbot dashboard. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmfhbnsACgkQE6szbY3K bnY6ew/9FXh3m71BvVpuqTYcUGzIC7gVrnkFy6n4W96v07OjSOoTNHOVVovajxc3 P9LvA77BHC4Xro3H7ORpsIurOZUc6yx18ZizzulVbQFuYa7LY/kNri4ZBtGHcRiV pIdQDLSNmwFjPA4x2S1qTFSF1c586lad+UNQiLam5ophBwQPEO6vG51ZEHa4wld9 +OhWTDYfrvij4D3Lt1ppvhuDP+PQBjhu/QFc0bGjHvKOjfV6sw9XU91sCYKOJIzd qzpsiQd5sepnX717Br3f5SLdxMq2lJYvRp9756vltOCaMBvJYJtHqtXCglHQEkFw yjhmPjk4r3VlKTF8K+wEJfAHwbC2kEn7csJNbt0+Nko5PPtFyrb8ok6QUbHCKscL L0VMnzaXHVqvG2VgYa31temfdz7HM/zHjQ8Al3eQPaqTHIoTXIBQxOQSea/apVMt TIlastvLoHfR8W7+LrwOmTjnBJGCJ+MrdcJzJDVk2tQmmcMA0boeZvl4aSklFuyB zNN5fxp0VMsxNyIHLJjQ3UcwVqHXC5w+f5H1ByQLUyQh+m/xaAaz7S+BTVdVbFPa 1Z1xDuvuHOTnjIOamnOD1l36afJnhq5RciPCXCNtQSB819mc+AfNGQNQTVNOTReC iTiUCcNxu0/DIPlPmeJzAlukVJUgz+/knOI/6zPs3eI7/o88ZGg= =k3cV -----END PGP SIGNATURE----- Merge tag 'bcachefs-2025-03-24' of git://evilpiepirate.org/bcachefs Pull bcachefs updates from Kent Overstreet: "On disk format is now soft frozen: no more required/automatic are anticipated before taking off the experimental label. Major changes/features since 6.14: - Scrub - Blocksize greater than page size support - A number of "rebalance spinning and doing no work" issues have been fixed; we now check if the write allocation will succeed in bch2_data_update_init(), before kicking off the read. There's still more work to do in this area. Later we may want to add another bitset btree, like rebalance_work, to track "extents that rebalance was requested to move but couldn't", e.g. due to destination target having insufficient online devices. - We can now support scaling well into the petabyte range: latest bcachefs-tools will pick an appropriate bucket size at format time to ensure fsck can run in available memory (e.g. a server with 256GB of ram and 100PB of storage would want 16MB buckets). On disk format changes: - 1.21: cached backpointers (scalability improvement) Cached replicas now get backpointers, which means we no longer rely on incrementing bucket generation numbers to invalidate cached data: this lets us get rid of the bucket generation number garbage collection, which had to periodically rescan all extents to recompute bucket oldest_gen. Bucket generation numbers are now only used as a consistency check, but they're quite useful for that. - 1.22: stripe backpointers Stripes now have backpointers: erasure coded stripes have their own checksums, separate from the checksums for the extents they contain (and stripe checksums also cover the parity blocks). This is required for implementing scrub for stripes. - 1.23: stripe lru (scalability improvement) Persistent lru for stripes, ordered by "number of empty blocks". This is used by the stripe creation path, which depending on free space may create a new stripe out of a partially empty existing stripe instead of starting a brand new stripe. This replaces an in-memory heap, and means we no longer have to read in the stripes btree at startup. - 1.24: casefolding Case insensitive directory support, courtesy of Valve. This is an incompatible feature, to enable mount with -o version_upgrade=incompatible - 1.25: extent_flags Another incompatible feature requiring explicit opt-in to enable. This adds a flags entry to extents, and a flag bit that marks extents as poisoned. A poisoned extent is an extent that was unreadable due to checksum errors. We can't move such extents without giving them a new checksum, and we may have to move them (for e.g. copygc or device evacuate). We also don't want to delete them: in the future we'll have an API that lets userspace ignore checksum errors and attempt to deal with simple bitrot itself. Marking them as poisoned lets us continue to return the correct error to userspace on normal read calls. Other changes/features: - BCH_IOCTL_QUERY_COUNTERS: this is used by the new 'bcachefs fs top' command, which shows a live view of all internal filesystem counters. - Improved journal pipelining: we can now have 16 journal writes in flight concurrently, up from 4. We're logging significantly more to the journal than we used to with all the recent disk accounting changes and additions, so some users should see a performance increase on some workloads. - BCH_MEMBER_STATE_failed: previously, we would do no IO at all to devices marked as failed. Now we will attempt to read from them, but only if we have no better options. - New option, write_error_timeout: devices will be kicked out of the filesystem if all writes have been failing for x number of seconds. We now also kick devices out when notified by blk_holder_ops that they've gone offline. - Device option handling improvements: the discard option should now be working as expected (additionally, in -tools, all device options that can be set at format time can now be set at device add time, i.e. data_allowed, state). - We now try harder to read data after a checksum error: we'll do additional retries if necessary to a device after after it gave us data with a checksum error. - More self healing work: the full inode <-> dirent consistency checks that are currently run by fsck are now also run every time we do a lookup, meaning we'll be able to correct errors at runtime. Runtime self healing will be flipped on after the new changes have seen more testing, currently they're just checking for consistency. - KMSAN fixes: our KMSAN builds should be nearly clean now, which will put a massive dent in the syzbot dashboard" * tag 'bcachefs-2025-03-24' of git://evilpiepirate.org/bcachefs: (180 commits) bcachefs: Kill unnecessary bch2_dev_usage_read() bcachefs: btree node write errors now print btree node bcachefs: Fix race in print_chain() bcachefs: btree_trans_restart_foreign_task() bcachefs: bch2_disk_accounting_mod2() bcachefs: zero init journal bios bcachefs: Eliminate padding in move_bucket_key bcachefs: Fix a KMSAN splat in btree_update_nodes_written() bcachefs: kmsan asserts bcachefs: Fix kmsan warnings in bch2_extent_crc_pack() bcachefs: Disable asm memcpys when kmsan enabled bcachefs: Handle backpointers with unknown data types bcachefs: Count BCH_DATA_parity backpointers correctly bcachefs: Run bch2_check_dirent_target() at lookup time bcachefs: Refactor bch2_check_dirent_target() bcachefs: Move bch2_check_dirent_target() to namei.c bcachefs: fs-common.c -> namei.c bcachefs: EIO cleanup bcachefs: bch2_write_prep_encoded_data() now returns errcode bcachefs: Simplify bch2_write_op_error() ...	2025-03-27 13:20:07 -07:00
Linus Torvalds	e41170cc5e	vfs-6.15-rc1.pagesize -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90rxAAKCRCRxhvAZXjc ooIPAQCwMjDjtWegvBy8kefiRw+fa4z3ZWHrwRT9DJrD/K9WyAD+JVd0ou27SVpQ jKpRSRct2eTbyxdYiGydHQGm5F5sLg4= =0FyQ -----END PGP SIGNATURE----- Merge tag 'vfs-6.15-rc1.pagesize' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs pagesize updates from Christian Brauner: "This enables block sizes greater than the page size for block devices. With this we can start supporting block devices with logical block sizes larger than 4k. It also allows to lift the device cache sector size support to 64k. This allows filesystems which can use larger sector sizes up to 64k to ensure that the filesystem will not generate writes that are smaller than the specified sector size" * tag 'vfs-6.15-rc1.pagesize' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: bdev: add back PAGE_SIZE block size validation for sb_set_blocksize() bdev: use bdev_io_min() for statx block size block/bdev: lift block size restrictions to 64k block/bdev: enable large folio support for large logical block sizes fs/buffer fs/mpage: remove large folio restriction fs/mpage: use blocks_per_folio instead of blocks_per_page fs/mpage: avoid negative shift for large blocksize fs/buffer: remove batching from async read fs/buffer: simplify block_read_full_folio() with bh_offset()	2025-03-24 12:01:29 -07:00
Kent Overstreet	04e90891be	bcachefs: Run bch2_check_dirent_target() at lookup time More on the "full online self healing" project: We now run most of the dirent <-> inode consistency checks, with repair code, at runtime - the exact same check and repair code that fsck runs. This will allow us to repair the "dirent points to inode that does not point back" inconsistencies that have been popping up at runtime. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:36 -04:00
Kent Overstreet	4fcd4de0a6	bcachefs: fs-common.c -> namei.c name <-> inode, code for managing the relationships between inodes and dirents. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:36 -04:00
Kent Overstreet	80be08cdb5	bcachefs: Filesystem discard option now propagates to devices the discard option is special, because it's both a filesystem and a device option. When set at the filesytsem level, it's supposed to propagate to (if set persistently via sysfs) or override (if non persistently as a mount option) the devices - that now works correctly. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:35 -04:00
Kent Overstreet	8dc4514d58	bcachefs: Kill bch2_remount() Single caller, so inline it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:03:16 -04:00
Kent Overstreet	1fdbe0b184	bcachefs: Make sure c->vfs_sb is set before starting fs This is necessary for the new blk_holder_ops, which want the vfs super_block available for synchronization. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Joshua Ashton	d37c14ac6f	bcachefs: bcachefs_metadata_version_casefolding This patch implements support for case-insensitive file name lookups in bcachefs. The implementation uses the same UTF-8 lowering and normalization that ext4 and f2fs is using. More information is provided in Documentation/bcachefs/casefolding.rst Compatibility notes: This uses the new versioning scheme for incompatible features where an incompatible feature is tied to a version number: the superblock says "we may use incompat features up to x" and "incompat features up to x are in use", disallowing mounting by previous versions. Additionally, and old style incompat feature bit is used, so that kernels without utf8 casefolding support know if casefolding specifically is in use and they're allowed to mount. Signed-off-by: Joshua Ashton <joshua@froggi.es> Cc: André Almeida <andrealmeid@igalia.com> Cc: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Kent Overstreet	be212d86b1	bcachefs: bs > ps support bcachefs removed most PAGE_SIZE references long ago, so this is easy; only readpage_bio_extend() has to be tweaked to respect the minimum order. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 19:45:57 -04:00
Luis Chamberlain	a64e5a5960	bdev: add back PAGE_SIZE block size validation for sb_set_blocksize() The commit titled "block/bdev: lift block size restrictions to 64k" lifted the block layer's max supported block size to 64k inside the helper blk_validate_block_size() now that we support large folios. However in lifting the block size we also removed the silly use cases many filesystems have to use sb_set_blocksize() to verify that the block size <= PAGE_SIZE. The call to sb_set_blocksize() was used to check the block size <= PAGE_SIZE since historically we've always supported userspace to create for example 64k block size filesystems even on 4k page size systems, but what we didn't allow was mounting them. Older filesystems have been using the check with sb_set_blocksize() for years. While, we could argue that such checks should be filesystem specific, there are much more users of sb_set_blocksize() than LBS enabled filesystem on upstream, so just do the easier thing and bring back the PAGE_SIZE check for sb_set_blocksize() users and only skip it for LBS enabled filesystems. This will ensure that tests such as generic/466 when run in a loop against say, ext4, won't try to try to actually mount a filesystem with a block size larger than your filesystem supports given your PAGE_SIZE and in the worst case crash. Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20250307020403.3068567-1-mcgrof@kernel.org Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-03-07 12:56:05 +01:00
NeilBrown	88d5baf690	Change inode_operations.mkdir to return struct dentry * Some filesystems, such as NFS, cifs, ceph, and fuse, do not have complete control of sequencing on the actual filesystem (e.g. on a different server) and may find that the inode created for a mkdir request already exists in the icache and dcache by the time the mkdir request returns. For example, if the filesystem is mounted twice the directory could be visible on the other mount before it is on the original mount, and a pair of name_to_handle_at(), open_by_handle_at() calls could instantiate the directory inode with an IS_ROOT() dentry before the first mkdir returns. This means that the dentry passed to ->mkdir() may not be the one that is associated with the inode after the ->mkdir() completes. Some callers need to interact with the inode after the ->mkdir completes and they currently need to perform a lookup in the (rare) case that the dentry is no longer hashed. This lookup-after-mkdir requires that the directory remains locked to avoid races. Planned future patches to lock the dentry rather than the directory will mean that this lookup cannot be performed atomically with the mkdir. To remove this barrier, this patch changes ->mkdir to return the resulting dentry if it is different from the one passed in. Possible returns are: NULL - the directory was created and no other dentry was used ERR_PTR() - an error occurred non-NULL - this other dentry was spliced in This patch only changes file-systems to return "ERR_PTR(err)" instead of "err" or equivalent transformations. Subsequent patches will make further changes to some file-systems to return a correct dentry. Not all filesystems reliably result in a positive hashed dentry: - NFS, cifs, hostfs will sometimes need to perform a lookup of the name to get inode information. Races could result in this returning something different. Note that this lookup is non-atomic which is what we are trying to avoid. Placing the lookup in filesystem code means it only happens when the filesystem has no other option. - kernfs and tracefs leave the dentry negative and the ->revalidate operation ensures that lookup will be called to correctly populate the dentry. This could be fixed but I don't think it is important to any of the users of vfs_mkdir() which look at the dentry. The recommendation to use d_drop();d_splice_alias() is ugly but fits with current practice. A planned future patch will change this. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: NeilBrown <neilb@suse.de> Link: https://lore.kernel.org/r/20250227013949.536172-2-neilb@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-27 20:00:17 +01:00
Hongbo Li	e614a6c52d	bcachefs: make directory i_size meaningful The isize of directory is 0 in bcachefs if the directory is empty. With more child dirents created, its size ought to change. Many other filesystems changed as that (ie. xfs and btrfs). And many of them changed as the size of child dirent name. Although the directory size may not seem to convey much, we can still give it some meaning. The formula of dentry size as follow: occupied_size = 40 + ALIGN(9 + namelen, 8) Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-13 14:58:38 -05:00
Kent Overstreet	ce70157112	bcachefs: kill flags param to bch2_subvolume_get() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	525be09e63	bcachefs: Use separate rhltable for bch2_inode_or_descendents_is_open() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	7579c85d9c	bcachefs: Don't delete reflink pointers to missing indirect extents To avoid tragic loss in the event of transient errors (i.e., a btree node topology error that was later corrected by btree node scan), we can't delete reflink pointers to correct errors. This adds a new error bit to bch_reflink_p, indicating that it is known to point to a missing indirect extent, and the error has already been reported. Indirect extent lookups now use bch2_lookup_indirect_extent(), which on error reports it as a fsck_err() and sets the error bit, and clears it if necessary on succesful lookup. This also gets rid of the bch2_inconsistent_error() call in __bch2_read_indirect_extent, and in the reflink_p trigger: part of the online self healing project. An on disk format change isn't necessary here: setting the error bit will be interpreted by older versions as pointing to a different index, which will also be missing - which is fine. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Youling Tang	924e81c530	bcachefs: Remove redundant initialization in bch2_vfs_inode_init() `inode->v.i_ino` has been initialized to `inum.inum`. If `inum.inum` and `bi->bi_inum` are not equal, BUG_ON() is triggered in bch2_inode_update_after_write(). Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Kent Overstreet	dc003efbc7	bcachefs: Add support for FS_IOC_GETFSSYSFSPATH [TEST]: ``` $ cat ioctl_getsysfspath.c #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <sys/ioctl.h> #include <linux/fs.h> #include <unistd.h> int main(int argc, char *argv[]) { int fd; struct fs_sysfs_path sysfs_path = {}; if (argc != 2) { fprintf(stderr, "Usage: %s <path_to_file_or_directory>\n", argv[0]); exit(EXIT_FAILURE); } fd = open(argv[1], O_RDONLY); if (fd == -1) { perror("open"); exit(EXIT_FAILURE); } if (ioctl(fd, FS_IOC_GETFSSYSFSPATH, &sysfs_path) == -1) { perror("ioctl FS_IOC_GETFSSYSFSPATH"); close(fd); exit(EXIT_FAILURE); } printf("FS_IOC_GETFSSYSFSPATH: %s\n", sysfs_path.name); close(fd); return 0; } $ gcc ioctl_getsysfspath.c $ sudo bcachefs format /dev/sda $ sudo mount.bcachefs /dev/sda /mnt $ sudo ./a.out /mnt FS_IOC_GETFSSYSFSPATH: bcachefs/c380b4ab-fbb6-41d2-b805-7a89cae9cadb ``` Original patch link: [1]: https://lore.kernel.org/all/20240207025624.1019754-8-kent.overstreet@linux.dev/ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Youling Tang <youling.tang@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Kent Overstreet	4f1a6b0ab4	bcachefs: Add support for FS_IOC_GETFSUUID Use super_set_uuid() to set `sb->s_uuid_len` to avoid returning `-ENOTTY` with sb->s_uuid_len being 0. Original patch link: [1]: https://lore.kernel.org/all/20240207025624.1019754-2-kent.overstreet@linux.dev/ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00

1 2 3 4 5 ...

279 commits