Commit graph

14188 commits

Author SHA1 Message Date
Qu Wenruo
4c14d5c855 btrfs: subpage: make btrfs_is_subpage() check against a folio
To support large data folios, we can no longer assume every filemap
folio is page sized.

So btrfs_is_subpage() check must be done against a folio.

Thankfully for metadata folios, we have the full control and ensure a
large folio will not be large than nodesize, so
btrfs_meta_is_subpage() doesn't need this change.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:52 +01:00
Qu Wenruo
19e60b2a95 btrfs: add extra warning if delayed iput is added when it's not allowed
Since I have triggered the ASSERT() on the delayed iput too many times,
now is the time to add some extra debug warnings for delayed iput.

All delayed iputs should be queued after all ordered extents finish
their IO and all involved workqueues are flushed.

Thus after the btrfs_run_delayed_iputs() inside close_ctree(), there
should be no more delayed puts added.

So introduce a new BTRFS_FS_STATE_NO_DELAYED_IPUT, set after the above
mentioned timing.  And all btrfs_add_delayed_iput() will check that flag
and give a WARN_ON_ONCE().

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:52 +01:00
Sun YangKai
0aaaf10ae9 btrfs: avoid redundant path slot assignment in btrfs_search_forward()
Move path slot assignment before the condition check to prevent
duplicate assignment. Previously, the slot was set both inside and after
the 'slot >= nritems' block with no change in its value, which is
unnecessary.

Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:52 +01:00
Sun YangKai
10de00c7d4 btrfs: remove unnecessary btrfs_key local variable in btrfs_search_forward()
The 'found_key' variable was only used to temporarily store the found key
before copying it to 'min_key' at the end of the function when returning
success.

Eliminate the 'found_key' variable, and directly store the key into
'min_key' at the exact loop exit points where ret=0 is set, maintaining
identical functionality.

Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:52 +01:00
Sun YangKai
140ac522de btrfs: simplify the return value handling in search_ioctl()
Move the assignment of -EFAULT to within the error condition check
in fault_in_subpage_writeable(). The previous placement outside the
condition could lead to the error value being overwritten by subsequent
assignments, cause unnecessary assignments.

Simplify loop exit logic by removing redundant goto.
The original code used 'goto err' to bypass post-loop processing after
handling errors from btrfs_search_forward(). However, the loop's
termination naturally falls through to the post-loop section, which
already handles 'ret' values. Replacing 'goto err' with 'break'
eliminates redundant control flow, consolidates error handling, and
makes the loop's exit conditions explicit.

Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:51 +01:00
Filipe Manana
009ca35848 btrfs: tests: fix chunk map leak after failure to add it to the tree
If we fail to add the chunk map to the fs mapping tree we exit
test_rmap_block() without freeing the chunk map. Fix this by adding a
call to btrfs_free_chunk_map() before exiting the test function if the
call to btrfs_add_chunk_map() failed.

Fixes: 7dc66abb5a ("btrfs: use a dedicated data structure for chunk maps")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:51 +01:00
Boris Burkov
0497dfba98 btrfs: codify pattern for adding block_group to bg_list
Similar to mark_bg_unused() and mark_bg_to_reclaim(), we have a few
places that use bg_list with refcounting, mostly for retrying failures
to reclaim/delete unused.

These have custom logic for handling locking and refcounting the bg_list
properly, but they actually all want to do the same thing, so pull that
logic out into a helper. Unfortunately, mark_bg_unused() does still need
the NEW flag to avoid prematurely marking stuff unused (even if refcount
is fine, we don't want to mess with bg creation), so it cannot use the
new helper.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:51 +01:00
Boris Burkov
7cbce3cb4c btrfs: explicitly ref count block_group on new_bgs list
All other users of the bg_list list_head increment the refcount when
adding to a list and decrement it when deleting from the list. Just for
the sake of uniformity and to try to avoid refcounting bugs, do it for
this list as well.

This does not fix any known ref-counting bug, as the reference belongs
to a single task (trans_handle is not shared and this represents
trans_handle->new_bgs linkage) and will not lose its original refcount
while that thread is running. And BLOCK_GROUP_FLAG_NEW protects against
ref-counting errors "moving" the block group to the unused list without
taking a ref.

With that said, I still believe it is simpler to just hold the extra ref
count for this list user as well.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:51 +01:00
Boris Burkov
895c6721d3 btrfs: make btrfs_discard_workfn() block_group ref explicit
Currently, the async discard machinery owns a ref to the block_group
when the block_group is queued on a discard list. However, to handle
races with discard cancellation and the discard workfn, we have a
specific logic to detect that the block_group is *currently* running in
the workfn, to protect the workfn's usage amidst cancellation.

As far as I can tell, this doesn't have any overt bugs (though
finish_discard_pass() and remove_from_discard_list() racing can have a
surprising outcome for the caller of remove_from_discard_list() in that
it is again added at the end).

But it is needlessly complicated to rely on locking and the nullity of
discard_ctl->block_group. Simplify this significantly by just taking a
refcount while we are in the workfn and unconditionally drop it in both
the remove and workfn paths, regardless of if they race.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:51 +01:00
Boris Burkov
7511e29cf1 btrfs: harden block_group::bg_list against list_del() races
As far as I can tell, these calls of list_del_init() on bg_list cannot
run concurrently with btrfs_mark_bg_unused() or btrfs_mark_bg_to_reclaim(),
as they are in transaction error paths and situations where the block
group is readonly.

However, if there is any chance at all of racing with mark_bg_unused(),
or a different future user of bg_list, better to be safe than sorry.

Otherwise we risk the following interleaving (bg_list refcount in parens)

T1 (some random op)                       T2 (btrfs_mark_bg_unused)
                                        !list_empty(&bg->bg_list); (1)
list_del_init(&bg->bg_list); (1)
                                        list_move_tail (1)
btrfs_put_block_group (0)
                                        btrfs_delete_unused_bgs
                                             bg = list_first_entry
                                             list_del_init(&bg->bg_list);
                                             btrfs_put_block_group(bg); (-1)

Ultimately, this results in a broken ref count that hits zero one deref
early and the real final deref underflows the refcount, resulting in a WARNING.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:51 +01:00
Boris Burkov
2d8e5168d4 btrfs: fix block group refcount race in btrfs_create_pending_block_groups()
Block group creation is done in two phases, which results in a slightly
unintuitive property: a block group can be allocated/deallocated from
after btrfs_make_block_group() adds it to the space_info with
btrfs_add_bg_to_space_info(), but before creation is completely completed
in btrfs_create_pending_block_groups(). As a result, it is possible for a
block group to go unused and have 'btrfs_mark_bg_unused' called on it
concurrently with 'btrfs_create_pending_block_groups'. This causes a
number of issues, which were fixed with the block group flag
'BLOCK_GROUP_FLAG_NEW'.

However, this fix is not quite complete. Since it does not use the
unused_bg_lock, it is possible for the following race to occur:

btrfs_create_pending_block_groups            btrfs_mark_bg_unused
                                           if list_empty // false
        list_del_init
        clear_bit
                                           else if (test_bit) // true
                                                list_move_tail

And we get into the exact same broken ref count and invalid new_bgs
state for transaction cleanup that BLOCK_GROUP_FLAG_NEW was designed to
prevent.

The broken refcount aspect will result in a warning like:

  [1272.943527] refcount_t: underflow; use-after-free.
  [1272.943967] WARNING: CPU: 1 PID: 61 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110
  [1272.944731] Modules linked in: btrfs virtio_net xor zstd_compress raid6_pq null_blk [last unloaded: btrfs]
  [1272.945550] CPU: 1 UID: 0 PID: 61 Comm: kworker/u32:1 Kdump: loaded Tainted: G        W          6.14.0-rc5+ #108
  [1272.946368] Tainted: [W]=WARN
  [1272.946585] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
  [1272.947273] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs]
  [1272.947788] RIP: 0010:refcount_warn_saturate+0xba/0x110
  [1272.949532] RSP: 0018:ffffbf1200247df0 EFLAGS: 00010282
  [1272.949901] RAX: 0000000000000000 RBX: ffffa14b00e3f800 RCX: 0000000000000000
  [1272.950437] RDX: 0000000000000000 RSI: ffffbf1200247c78 RDI: 00000000ffffdfff
  [1272.950986] RBP: ffffa14b00dc2860 R08: 00000000ffffdfff R09: ffffffff90526268
  [1272.951512] R10: ffffffff904762c0 R11: 0000000063666572 R12: ffffa14b00dc28c0
  [1272.952024] R13: 0000000000000000 R14: ffffa14b00dc2868 R15: 000001285dcd12c0
  [1272.952850] FS:  0000000000000000(0000) GS:ffffa14d33c40000(0000) knlGS:0000000000000000
  [1272.953458] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [1272.953931] CR2: 00007f838cbda000 CR3: 000000010104e000 CR4: 00000000000006f0
  [1272.954474] Call Trace:
  [1272.954655]  <TASK>
  [1272.954812]  ? refcount_warn_saturate+0xba/0x110
  [1272.955173]  ? __warn.cold+0x93/0xd7
  [1272.955487]  ? refcount_warn_saturate+0xba/0x110
  [1272.955816]  ? report_bug+0xe7/0x120
  [1272.956103]  ? handle_bug+0x53/0x90
  [1272.956424]  ? exc_invalid_op+0x13/0x60
  [1272.956700]  ? asm_exc_invalid_op+0x16/0x20
  [1272.957011]  ? refcount_warn_saturate+0xba/0x110
  [1272.957399]  btrfs_discard_cancel_work.cold+0x26/0x2b [btrfs]
  [1272.957853]  btrfs_put_block_group.cold+0x5d/0x8e [btrfs]
  [1272.958289]  btrfs_discard_workfn+0x194/0x380 [btrfs]
  [1272.958729]  process_one_work+0x130/0x290
  [1272.959026]  worker_thread+0x2ea/0x420
  [1272.959335]  ? __pfx_worker_thread+0x10/0x10
  [1272.959644]  kthread+0xd7/0x1c0
  [1272.959872]  ? __pfx_kthread+0x10/0x10
  [1272.960172]  ret_from_fork+0x30/0x50
  [1272.960474]  ? __pfx_kthread+0x10/0x10
  [1272.960745]  ret_from_fork_asm+0x1a/0x30
  [1272.961035]  </TASK>
  [1272.961238] ---[ end trace 0000000000000000 ]---

Though we have seen them in the async discard workfn as well. It is
most likely to happen after a relocation finishes which cancels discard,
tears down the block group, etc.

Fix this fully by taking the lock around the list_del_init + clear_bit
so that the two are done atomically.

Fixes: 0657b20c5a ("btrfs: fix use-after-free of new block group that became unused")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:51 +01:00
Filipe Manana
f378b4c3e3 btrfs: remove unnecessary fs_info argument from btrfs_add_block_group_cache()
The fs_info can be taken from the given block group, so there is no need
to pass it as an argument. Also rename the local variable from 'info' to
'fs_info' which is more widely used, more clear and to be more consistent.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:51 +01:00
Filipe Manana
20faaab2c3 btrfs: remove unnecessary fs_info argument from delete_block_group_cache()
The fs_info can be taken from the given block group, so there is no need
to pass it as an argument.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:50 +01:00
Filipe Manana
f75a043737 btrfs: remove unnecessary fs_info argument from create_reloc_inode()
The fs_info can be taken from the given block group, so there is no need
to pass it as an argument.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:50 +01:00
Filipe Manana
92be661a57 btrfs: make btrfs_iget_path() return a btrfs inode instead
It's an internal function and btrfs_iget() is now returning a btrfs inode,
so change btrfs_iget_path() to also return a btrfs inode instead of a VFS
inode.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:50 +01:00
Filipe Manana
b204e5c7d4 btrfs: make btrfs_iget() return a btrfs inode instead
It's an internal function and most of the time the callers are doing a lot
of BTRFS_I() calls on the returned VFS inode to get the btrfs inode, so
change the return type to struct btrfs_inode instead.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:50 +01:00
Filipe Manana
14d063ec85 btrfs: pass a btrfs_inode to fixup_inode_link_count()
fixup_inode_link_count() mostly wants to use a btrfs_inode, plus it's an
internal function so it should take btrfs_inode instead of a VFS inode.
Change the argument type to btrfs_inode, avoiding several BTRFS_I() calls
too.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:50 +01:00
Filipe Manana
b4c50cbb01 btrfs: return a btrfs_inode from read_one_inode()
All callers of read_one_inode() are mostly interested in the btrfs_inode
structure rather than the VFS inode, so make read_one_inode() return
the btrfs_inode instead, avoiding lots of BTRFS_I() calls.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:50 +01:00
Filipe Manana
a488d8ac2c btrfs: return a btrfs_inode from btrfs_iget_logging()
All callers of btrfs_iget_logging() are interested in the btrfs_inode
structure rather than the VFS inode, so make btrfs_iget_logging() return
the btrfs_inode instead, avoiding lots of BTRFS_I() calls.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:50 +01:00
Mark Harmstone
7ef3cbf17d btrfs: avoid linker error in btrfs_find_create_tree_block()
The inline function btrfs_is_testing() is hardcoded to return 0 if
CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not set. Currently we're relying on
the compiler optimizing out the call to alloc_test_extent_buffer() in
btrfs_find_create_tree_block(), as it's not been defined (it's behind an
 #ifdef).

Add a stub version of alloc_test_extent_buffer() to avoid linker errors
on non-standard optimization levels. This problem was seen on GCC 14
with -O0 and is helps to see symbols that would be otherwise optimized
out.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:50 +01:00
Qu Wenruo
df94a342ef btrfs: run btrfs_error_commit_super() early
[BUG]
Even after all the error fixes related the
"ASSERT(list_empty(&fs_info->delayed_iputs));" in close_ctree(), I can
still hit it reliably with my experimental 2K block size.

[CAUSE]
In my case, all the error is triggered after the fs is already in error
status.

I find the following call trace to be the cause of race:

           Main thread                       |     endio_write_workers
---------------------------------------------+---------------------------
close_ctree()                                |
|- btrfs_error_commit_super()                |
|  |- btrfs_cleanup_transaction()            |
|  |  |- btrfs_destroy_all_ordered_extents() |
|  |     |- btrfs_wait_ordered_roots()       |
|  |- btrfs_run_delayed_iputs()              |
|                                            | btrfs_finish_ordered_io()
|                                            | |- btrfs_put_ordered_extent()
|                                            |    |- btrfs_add_delayed_iput()
|- ASSERT(list_empty(delayed_iputs))         |
   !!! Triggered !!!

The root cause is that, btrfs_wait_ordered_roots() only wait for
ordered extents to finish their IOs, not to wait for them to finish and
removed.

[FIX]
Since btrfs_error_commit_super() will flush and wait for all ordered
extents, it should be executed early, before we start flushing the
workqueues.

And since btrfs_error_commit_super() now runs early, there is no need to
run btrfs_run_delayed_iputs() inside it, so just remove the
btrfs_run_delayed_iputs() call from btrfs_error_commit_super().

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:50 +01:00
Daniel Vacek
fc5c0c5825 btrfs: defrag: extend ioctl to accept compression levels
The zstd and zlib compression types support setting compression levels.
Enhance the defrag interface to specify the levels as well. For zstd the
negative (realtime) levels are also accepted.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:50 +01:00
Filipe Manana
08f340767d btrfs: send: simplify return logic from send_encoded_extent()
The 'out' label is pointless as we don't have anything to cleanup anymore
(we used to have an inode to iput), so remove it and make error paths
directly return an error.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:50 +01:00
Filipe Manana
0c8337c220 btrfs: send: remove unnecessary inode lookup at send_encoded_inline_extent()
We are doing a lookup of the inode but we don't use it at all. So just
remove this pointless lookup.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:50 +01:00
Filipe Manana
9024b744e7 btrfs: avoid unnecessary bio dereference at run_one_async_done()
We have dereferenced the async_submit_bio structure and extracted the bio
pointer into a local variable, so there's no need to dereference it again
when calling btrfs_bio_end_io(). Just use "bio->bi_status" instead of the
longer expression "async->bbio->bio.bi_status".

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:50 +01:00
Filipe Manana
cda76788f8 btrfs: fix non-empty delayed iputs list on unmount due to async workers
At close_ctree() after we have ran delayed iputs either explicitly through
calling btrfs_run_delayed_iputs() or later during the call to
btrfs_commit_super() or btrfs_error_commit_super(), we assert that the
delayed iputs list is empty.

We have (another) race where this assertion might fail because we have
queued an async write into the fs_info->workers workqueue. Here's how it
happens:

1) We are submitting a data bio for an inode that is not the data
   relocation inode, so we call btrfs_wq_submit_bio();

2) btrfs_wq_submit_bio() submits a work for the fs_info->workers queue
   that will run run_one_async_done();

3) We enter close_ctree(), flush several work queues except
   fs_info->workers, explicitly run delayed iputs with a call to
   btrfs_run_delayed_iputs() and then again shortly after by calling
   btrfs_commit_super() or btrfs_error_commit_super(), which also run
   delayed iputs;

4) run_one_async_done() is executed in the work queue, and because there
   was an IO error (bio->bi_status is not 0) it calls btrfs_bio_end_io(),
   which drops the final reference on the associated ordered extent by
   calling btrfs_put_ordered_extent() - and that adds a delayed iput for
   the inode;

5) At close_ctree() we find that after stopping the cleaner and
   transaction kthreads the delayed iputs list is not empty, failing the
   following assertion:

      ASSERT(list_empty(&fs_info->delayed_iputs));

Fix this by flushing the fs_info->workers workqueue before running delayed
iputs at close_ctree().

David reported this when running generic/648, which exercises IO error
paths by using the DM error table.

Reported-by: David Sterba <dsterba@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:50 +01:00
Qu Wenruo
7ca3e84980 btrfs: reject out-of-band dirty folios during writeback
[OUT-OF-BAND DIRTY FOLIOS]
An out-of-band folio means the folio is marked dirty but without
notifying the filesystem.

This can lead to various problems, not limited to:

- No folio::private to track per block status

- No proper space reserved for such a dirty folio

[HISTORY IN BTRFS]
This used to be a problem related to get_user_page(), but with the
introduction of pin_user_pages*(), we should no longer hit such
case anymore.

In btrfs, we have a long history of catching such out-of-band dirty
folios by:

- Mark the folio ordered during delayed allocation

- Check the folio ordered flag during writeback
  If the folio has no ordered flag, it means it doesn't go through
  delayed allocation, thus it's definitely an out-of-band
  one.

  If we got one, we go through COW fixup, which will re-dirty the folio
  with proper handling in another workqueue.

[PROBLEMS OF COW-FIXUP]
Such workaround is a blockage for us to migrate to iomap (it requires
extra flags to trace if a folio is dirtied by the fs or not) and I'd
argue it's not data checksum safe, since if a folio can be marked dirty
without informing the fs, the content can also change at any time.

But with the introduction of pin_user_pages*() during v5.8 merge
window, such out-of-band dirty folio such be treated as a bug.
Ext4 has treated such case by warning and erroring out even before
pin_user_pages*().

Furthermore, there are already proofs that such folio ordered flag
tracking can be screwed up by incorrect error handling, check the commit
messages of the following commits:

 06f3642847 ("btrfs: do proper folio cleanup when cow_file_range() failed")
 c2b47df81c ("btrfs: do proper folio cleanup when run_delalloc_nocow() failed")

[FIXES]
Unlike btrfs, ext4 and xfs (iomap) never bother handling such
out-of-band dirty folios.

- Ext4 just warns and errors out
- Iomap always follows the folio/block dirty flags

And there is nothing really COW specific, xfs also supports COW too.

Here we take one step towards ext4 by doing warning and erroring out.
But since the cow fixup thing is introduced from the beginning, we keep
the old behavior for non-experimental builds, and only do the new warning
for experimental builds before we're 100% sure and remove cow fixup.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:50 +01:00
Dan Carpenter
c01b7114b8 btrfs: return a literal instead of a variable in btrfs_init_dev_replace()
This is just a small clean up, it doesn't change how the code works.
Originally this code had a goto so we needed to set "ret = 0;" but now
it returns directly and so we can simplify it a bit by doing a
"return 0;" and removing the assignment.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:49 +01:00
Filipe Manana
477a7a9c1f btrfs: move btrfs_cleanup_bio() code into its single caller
The btrfs_cleanup_bio() helper is trivial and has a single caller, there's
no point in having a dedicated helper function. So get rid of it and move
its code into the caller (btrfs_bio_end_io()).

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:49 +01:00
Filipe Manana
530b601b91 btrfs: move __btrfs_bio_end_io() code into its single caller
The __btrfs_bio_end_io() helper is trivial and has a single caller, so
there's no point in having a dedicated helper function. Further the double
underscore prefix in the name is discouraged. So get rid of it and move
its code into the caller (btrfs_bio_end_io()).

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:49 +01:00
Filipe Manana
4c782247b8 btrfs: fix non-empty delayed iputs list on unmount due to compressed write workers
At close_ctree() after we have ran delayed iputs either through explicitly
calling btrfs_run_delayed_iputs() or later during the call to
btrfs_commit_super() or btrfs_error_commit_super(), we assert that the
delayed iputs list is empty.

When we have compressed writes this assertion may fail because delayed
iputs may have been added to the list after we last ran delayed iputs.
This happens like this:

1) We have a compressed write bio executing;

2) We enter close_ctree() and flush the fs_info->endio_write_workers
   queue which is the queue used for running ordered extent completion;

3) The compressed write bio finishes and enters
   btrfs_finish_compressed_write_work(), where it calls
   btrfs_finish_ordered_extent() which in turn calls
   btrfs_queue_ordered_fn(), which queues a work item in the
   fs_info->endio_write_workers queue that we have flushed before;

4) At close_ctree() we proceed, run all existing delayed iputs and
   call btrfs_commit_super() (which also runs delayed iputs), but before
   we run the following assertion below:

      ASSERT(list_empty(&fs_info->delayed_iputs))

   A delayed iput is added by the step below...

5) The ordered extent completion job queued in step 3 runs and results in
   creating a delayed iput when dropping the last reference of the ordered
   extent (a call to btrfs_put_ordered_extent() made from
   btrfs_finish_one_ordered());

6) At this point the delayed iputs list is not empty, so the assertion at
   close_ctree() fails.

Fix this by flushing the fs_info->compressed_write_workers queue at
close_ctree() before flushing the fs_info->endio_write_workers queue,
respecting the queue dependency as the later is responsible for the
execution of ordered extent completion.

CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:49 +01:00
David Sterba
f6e8a43611 btrfs: unify inode variable naming
Rename binode to inode in local variables or parameters so it's more
unified with the rest of the code.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:49 +01:00
David Sterba
f272c004d2 btrfs: pass struct to btrfs_ioctl_subvol_getflags()
Pass a struct btrfs_inode to btrfs_ioctl_subvol_getflags() as it's an
internal interface, allowing to remove some use of BTRFS_I.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:49 +01:00
David Sterba
f6c2ccfc3b btrfs: simplify local variables in btrfs_ioctl_resize()
Remove some redundant variables and assignments, move variable
declarations to their closest scope.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:49 +01:00
David Sterba
4f27a69394 btrfs: pass struct btrfs_inode to btrfs_sync_inode_flags_to_i_flags()
Pass a struct btrfs_inode to btrfs_sync_inode_flags_to_i_flags() as it's
an internal interface.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:49 +01:00
David Sterba
68dc1cb231 btrfs: pass root pointers to search tree ioctl helpers
The search tree ioctl use btrfs_root so change that from btrfs_inode
pointers so we don't have to do the conversion.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:49 +01:00
David Sterba
4e043cd196 btrfs: pass btrfs_root pointers to send ioctl parameters
The ioctl switch btrfs_ioctl() provides several parameter types for
convenience so we don't have to do the conversion in the callbacks.
Pass root pointers to the send related functions.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:49 +01:00
David Sterba
5e54f9420f btrfs: parameter constification in ioctl.c
Add const to function parameters that are not changed.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:49 +01:00
Qu Wenruo
306a75e647 btrfs: allow debug builds to accept 2K block size
Currently we only support two block sizes, 4K and PAGE_SIZE.

This means on the most common architecture x86_64, we have no way to
test subpage block size.  And that's exactly I have an aarch64 machine
dedicated for subpage tests.

But this is still a hurdle for a lot of btrfs developers, and to improve
the test coverage mostly on x86_64, here we enable debug builds to
accept 2K block size.

This involves:

- Introduce a dedicated minimal block size macro
  BTRFS_MIN_BLOCKSIZE, which depends on if CONFIG_BTRFS_DEBUG is set.
  If so it's 2K, otherwise it's 4K as usual.

- Allow 4K, PAGE_SIZE and BTRFS_MIN_BLOCKSIZE as block size

- Update subpage block size checks to be based on BTRFS_MIN_BLOCKSIZE

- Export the new supported blocksize through sysfs interfaces

As most of the subpage support is already pretty mature, there is no
extra work needed to support the extra 2K block size.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:49 +01:00
Qu Wenruo
23019d3e66 btrfs: properly limit inline data extent according to block size
Btrfs utilizes inline data extent for the following cases:

- Regular small files
- Symlinks

And "btrfs check" detects any file extents that are too large as an
error.

It's not a problem for 4K block size, but for the incoming smaller
block sizes (2K), it can cause problems due to bad limits:

- Non-compressed inline data extents
  We do not allow a non-compressed inline data extent to be as large as
  block size.

- Symlinks
  Currently the only real limit on symlinks are 4K, which can be larger
  than 2K block size.

These will result btrfs-check to report too large file extents.

Fix it by adding proper size checks for the above cases.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:49 +01:00
Qu Wenruo
2ef9d73f2b btrfs: remove the subpage related warning message
Since the initial enablement of block size < page size support for
btrfs in v5.15, we have hit several milestones for block size < page
size (subpage) support:

- RAID56 subpage support
  In v5.19

- Refactored scrub support to support subpage better
  In v6.4

- Block perfect (previously requires page aligned ranges) compressed write
  In v6.13

- Various error handling fixes involving subpage
  In v6.14

Finally the only missing feature is the pretty simple and harmless
inlined data extent creation, just added in previous patches.

Now btrfs has all of its features ready for both regular and subpage
cases, there is no reason to output a warning about the experimental
subpage support, and we can finally remove it now.

Acked-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:49 +01:00
Qu Wenruo
9951ec02f2 btrfs: allow inline data extents creation if block size < page size
Previously inline data extents creation was disabled if the block size
(previously called sector size) is smaller than the page size, for the
following reasons:

- Possible mixed inline and regular data extents
  However this is also the same if the block size matches the page size,
  thus we do not treat mixed inline and regular extents as an error.

  And the chance to cause mixed inline and regular data extents are not
  even increased, it has the same requirement (compressed inline data
  extent covering the whole first block, followed by regular extents).

- Inability to handle async/inline delalloc range for block size < page
  size cases
  This is already fixed since commit 1d2fbb7f1f ("btrfs: allow
  compression even if the range is not page aligned").

  This was the major technical obstacle, but it's not anymore.

With that removed, we can enable inline data extents creation no matter
the block size nor the page size, allowing btrfs to have the same
capacity for all block sizes.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:49 +01:00
Qu Wenruo
0d31ca6584 btrfs: allow buffered write to avoid full page read if it's block aligned
[BUG]
Since the support of block size (sector size) < page size for btrfs,
test case generic/563 fails with 4K block size and 64K page size:

  --- tests/generic/563.out	2024-04-25 18:13:45.178550333 +0930
  +++ /home/adam/xfstests-dev/results//generic/563.out.bad	2024-09-30 09:09:16.155312379 +0930
  @@ -3,7 +3,8 @@
   read is in range
   write is in range
   write -> read/write
  -read is in range
  +read has value of 8388608
  +read is NOT in range -33792 .. 33792
   write is in range
  ...

[CAUSE]
The test case creates a 8MiB file, then does buffered write into the 8MiB
using 4K block size, to overwrite the whole file.

On 4K page sized systems, since the write range covers the full block and
page, btrfs will not bother reading the page, just like what XFS and EXT4
do.

But on 64K page sized systems, although the 4K sized write is still block
aligned, it's not page aligned anymore, thus btrfs will read the full
page, which will be accounted by cgroup and fail the test.

As the test case itself expects such 4K block aligned write should not
trigger any read.

Such expected behavior is an optimization to reduce folio reads when
possible, and unfortunately btrfs does not implement such optimization.

[FIX]
To skip the full page read, we need to do the following modification:

- Do not trigger full page read as long as the buffered write is block
  aligned
  This is pretty simple by modifying the check inside
  prepare_uptodate_page().

- Skip already uptodate blocks during full page read
  Or we can lead to the following data corruption:

  0       32K        64K
  |///////|          |

  Where the file range [0, 32K) is dirtied by buffered write, the
  remaining range [32K, 64K) is not.

  When reading the full page, since [0,32K) is only dirtied but not
  written back, there is no data extent map for it, but a hole covering
  [0, 64k).

  If we continue reading the full page range [0, 64K), the dirtied range
  will be filled with 0 (since there is only a hole covering the whole
  range).
  This causes the dirtied range to get lost.

With this optimization, btrfs can pass generic/563 even if the page size
is larger than fs block size.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:48 +01:00
Qu Wenruo
b2e743927f btrfs: make btrfs_do_readpage() to do block-by-block read
Currently if btrfs has its block size (the older sector size) smaller
than the page size, btrfs_do_readpage() will handle the range extent by
extent, this is good for performance as it doesn't need to re-lookup the
same extent map again and again.
(Although get_extent_map() already does extra cached em check, thus
the optimization is not that obvious.)

This is totally fine and is a valid optimization, but it has an
assumption that there is no partial uptodate range in the page.

Meanwhile there is an incoming feature, requiring btrfs to skip the full
page read if a buffered write range covers a full block but not a full
page.

In that case, we can have a page that is partially uptodate, and the
current per-extent lookup cannot handle such case.

So here we change btrfs_do_readpage() to do block-by-block read, this
simplifies the following things:

- Remove the need for @iosize variable
  Because we just use sectorsize as our increment.

- Remove @pg_offset, and calculate it inside the loop when needed
  It's just offset_in_folio().

- Use a for() loop instead of a while() loop

This will slightly reduce the read performance for subpage cases, but for
the future where we need to skip already uptodate blocks, it should still
be worth.

For block size == page size, this brings no performance change.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:48 +01:00
Qu Wenruo
d2da21a6e0 btrfs: introduce a read path dedicated extent lock helper
Currently we're using btrfs_lock_and_flush_ordered_range() for both
btrfs_read_folio() and btrfs_readahead(), but it has one critical
problem for future subpage optimizations:

- It will call btrfs_start_ordered_extent() to writeback the involved
  folios

  But remember we're calling btrfs_lock_and_flush_ordered_range() at
  read paths, meaning the folio is already locked by read path.

  If we really trigger writeback for those already locked folios, this
  will lead to a deadlock and writeback cannot get the folio lock.

  Such dead lock is prevented by the fact that btrfs always keeps a
  dirty folio also uptodate, by either dirtying all blocks of the folio,
  or by reading the whole folio before dirtying.

To prepare for the incoming patch which allows btrfs to skip full folio
read if the buffered write is block aligned, we have to start by solving
the possible deadlock first.

Instead of blindly calling btrfs_start_ordered_extent(), introduce a
new helper, which is smarter in the following ways:

- Only wait and flush the ordered extent if
  * The folio doesn't even have private bit set
  * Part of the blocks of the ordered extent are not uptodate

  This can happen by:
  * The folio writeback finished, then got invalidated.
    There are a lot of reasons that a folio can get invalidated,
    from memory pressure to direct IO (which invalidates all folios
    of the range).
    But OE not yet finished.

  We have to wait for the ordered extent, as the OE may contain
  to-be-inserted data checksum.
  Without waiting, our read can fail due to the missing checksum.

  But either way, the OE should not need any extra flush inside the
  locked folio range.

- Skip the ordered extent completely if
  * All the blocks are dirty
    This happens when OE creation is caused by a folio writeback whose
    file offset is before our folio.

    E.g. 16K page size and 4K block size

    0      8K      16K      24K     32K
    |//////////////||///////|       |

    The writeback of folio 0 created an OE for range [0, 24K), but since
    folio 16K is not fully uptodate, a read is triggered for folio 16K.

    The writeback will never happen (we're holding the folio lock for
    read), nor will the OE finish.

    Thus we must skip the range.

  * All the blocks are uptodate
    This happens when the writeback finished, but OE not yet finished.

    Since the blocks are already uptodate, we can skip the OE range.

The new helper lock_extents_for_read() will do a loop for the target
range by:

1) Lock the full range

2) If there is no ordered extent in the remaining range, exit

3) If there is an ordered extent that we can skip
   Skip to the end of the OE, and continue checking
   We do not trigger writeback nor wait for the OE.

4) If there is an ordered extent that we cannot skip
   Unlock the whole extent range and start the ordered extent.

And also update btrfs_start_ordered_extent() to add two more parameters:
@nowriteback_start and @nowriteback_len, to prevent triggering flush for
a certain range.

This will allow us to handle the following case properly in the future:

  16K page size, 4K btrfs block size:

  0     4K      8K     12K      16K      20K      24K     28K      32K
  |/////////////////////////////||////////////////|       |        |
  |<-------------------- OE 2 ------------------->|       |< OE 1 >|

  The folio has been written back before, thus we have an OE at
  [28K, 32K).
  Although the OE 1 finished its IO, the OE is not yet removed from IO
  tree.
  The folio got invalidated after writeback completed and before the
  ordered extent finished.

  And [16K, 24K) range is dirty and uptodate, caused by a block aligned
  buffered write (and future enhancements allowing btrfs to skip full
  folio read for such case).
  But writeback for folio 0 has began, thus it generated OE 2, covering
  range [0, 24K).

  Since the full folio 16K is not uptodate, if we want to read the folio,
  the existing btrfs_lock_and_flush_ordered_range() will dead lock, by:

  btrfs_read_folio()
  | Folio 16K is already locked
  |- btrfs_lock_and_flush_ordered_range()
     |- btrfs_start_ordered_extent() for range [16K, 24K)
        |- filemap_fdatawrite_range() for range [16K, 24K)
           |- extent_write_cache_pages()
              folio_lock() on folio 16K, deadlock.

  But now we will have the following sequence:

  btrfs_read_folio()
  | Folio 16K is already locked
  |- lock_extents_for_read()
     |- can_skip_ordered_extent() for range [16K, 24K)
     |  Returned true, the range [16K, 24K) will be skipped.
     |- can_skip_ordered_extent() for range [28K, 32K)
     |  Returned false.
     |- btrfs_start_ordered_extent() for range [28K, 32K) with
        [16K, 32K) as no writeback range
        No writeback for folio 16K will be triggered.

  And there will be no more possible deadlock on the same folio.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:48 +01:00
Qu Wenruo
0bb067ca64 btrfs: fix the qgroup data free range for inline data extents
Inside function __cow_file_range_inline() since the inlined data no
longer take any data space, we need to free up the reserved space.

However the code is still using the old page size == sector size
assumption, and will not handle subpage case well.

Thankfully it is not going to cause any problems because we have two extra
safe nets:

- Inline data extents creation is disabled for sector size < page size
  cases for now
  But it won't stay that for long.

- btrfs_qgroup_free_data() will only clear ranges which have been already
  reserved
  So even if we pass a range larger than what we need, it should still
  be fine, especially there is only reserved space for a single block at
  file offset 0 of an inline data extent.

But just for the sake of consistency, fix the call site to use
sectorsize instead of page size.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:48 +01:00
Qu Wenruo
1a5b5668d7 btrfs: prevent inline data extents read from touching blocks beyond its range
Currently reading an inline data extent will zero out the remaining
range in the page.

This is not yet causing problems even for block size < page size
(subpage) cases because:

1) An inline data extent always starts at file offset 0
   Meaning at page read, we always read the inline extent first, before
   any other blocks in the page. Then later blocks are properly read out
   and re-fill the zeroed out ranges.

2) Currently btrfs will read out the whole page if a buffered write is
   not page aligned
   So a page is either fully uptodate at buffered write time (covers the
   whole page), or we will read out the whole page first.
   Meaning there is nothing to lose for such an inline extent read.

But it's still not ideal:

- We're zeroing out the page twice
  Once done by read_inline_extent()/uncompress_inline(), once done by
  btrfs_do_readpage() for ranges beyond i_size.

- We're touching blocks that don't belong to the inline extent
  In the incoming patches, we can have a partial uptodate folio, of
  which some dirty blocks can exist while the page is not fully uptodate:

  The page size is 16K and block size is 4K:

  0         4K        8K        12K        16K
  |         |         |/////////|          |

  And range [8K, 12K) is dirtied by a buffered write, the remaining
  blocks are not uptodate.

  If range [0, 4K) contains an inline data extent, and we try to read
  the whole page, the current behavior will overwrite range [8K, 12K)
  with zero and cause data loss.

So to make the behavior more consistent and in preparation for future
changes, limit the inline data extents read to only zero out the range
inside the first block, not the whole page.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:48 +01:00
Anand Jain
a66b39f699 btrfs: sysfs: accept size suffixes for read policy values
We now parse human-friendly size values (e.g. '1G', '2M') when setting
read policies.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:48 +01:00
David Sterba
19eaf5fd8c btrfs: use BTRFS_PATH_AUTO_FREE in load_free_space_tree()
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:48 +01:00
David Sterba
2e70d126f9 btrfs: use BTRFS_PATH_AUTO_FREE in clear_free_space_tree()
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:48 +01:00
David Sterba
3bfd9ead81 btrfs: use BTRFS_PATH_AUTO_FREE in populate_free_space_tree()
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.
This applies to both path and path2.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:48 +01:00
David Sterba
c42c0db1bb btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_remove_free_space_inode()
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:48 +01:00
David Sterba
3349ae34b7 btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_lookup_bio_sums()
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:48 +01:00
David Sterba
e5344080cf btrfs: use BTRFS_PATH_AUTO_FREE in run_delayed_extent_op()
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:48 +01:00
David Sterba
2267214a05 btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_lookup_extent_info()
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:48 +01:00
David Sterba
899c8798b5 btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_get_name()
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with some return simplifications.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:48 +01:00
David Sterba
72f2bae3c1 btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_init_root_free_objectid()
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:47 +01:00
David Sterba
aaa5ae8f6d btrfs: use BTRFS_PATH_AUTO_FREE in load_global_roots()
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:47 +01:00
David Sterba
50833146eb btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_check_dir_item_collision()
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:47 +01:00
David Sterba
3e21e8e941 btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_run_dev_replace()
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:47 +01:00
David Sterba
073dd51f1f btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_init_dev_replace()
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:47 +01:00
David Sterba
efac576c22 btrfs: do trivial BTRFS_PATH_AUTO_FREE conversions
The most trivial pattern for the auto freeing when the variable is
declared with the macro and the final btrfs_free_path() is removed.
There are almost none goto -> return conversions and there's no other
function cleanup.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:47 +01:00
David Sterba
bd06bce1b3 btrfs: use num_extent_folios() in for loop bounds
As the helper num_extent_folios() is now __pure, we can use it in for
loop without storing its value in a variable explicitly, the compiler
will do this for us.

The effects on btrfs.ko is -200 bytes and there are stack space savings
too:

btrfs_clone_extent_buffer                               -8 (32 -> 24)
btrfs_clear_buffer_dirty                                -8 (48 -> 40)
clear_extent_buffer_uptodate                            -8 (40 -> 32)
set_extent_buffer_dirty                                 -8 (32 -> 24)
write_one_eb                                            -8 (88 -> 80)
set_extent_buffer_uptodate                              -8 (40 -> 32)
read_extent_buffer_pages_nowait                        -16 (64 -> 48)
find_extent_buffer                                      -8 (32 -> 24)

Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:47 +01:00
David Sterba
6e7c283863 btrfs: add __pure attribute to eb page and folio counters
The functions qualify for the pure attribute as they always return the
same value for the same argument (in the given scope). This allows to
optimize the calls and cache the value.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:47 +01:00
David Sterba
b7226ce6c4 btrfs: simplify parameters of metadata folio helpers
Unlike folio helpers for date the ones for metadata always take the
extent buffer start and length, so they can be simplified to take the
eb only.  The fs_info can be obtained from eb too so it can be dropped
as parameter.

Added in patch "btrfs: use metadata specific helpers to simplify extent
buffer helpers".

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:47 +01:00
Filipe Manana
6207687043 btrfs: fix reclaimed bytes accounting after automatic block group reclaim
We are considering the used bytes counter of a block group as the amount
to update the space info's reclaim bytes counter after relocating the
block group, but this value alone is often not enough. This is because we
may have a reserved extent (or more) and in that case its size is
reflected in the reserved counter of the block group - the size of the
extent is only transferred from the reserved counter to the used counter
of the block group when the delayed ref for the extent is run - typically
when committing the transaction (or when flushing delayed refs due to
ENOSPC on space reservation). Such call chain for data extents is:

   btrfs_run_delayed_refs_for_head()
       run_one_delayed_ref()
           run_delayed_data_ref()
               alloc_reserved_file_extent()
                   alloc_reserved_extent()
                       btrfs_update_block_group()
                          -> transfers the extent size from the reserved
                             counter to the used counter

For metadata extents:

   btrfs_run_delayed_refs_for_head()
       run_one_delayed_ref()
           run_delayed_tree_ref()
               alloc_reserved_tree_block()
                   alloc_reserved_extent()
                       btrfs_update_block_group()
                           -> transfers the extent size from the reserved
                              counter to the used counter

Since relocation flushes delalloc, waits for ordered extent completion
and commits the current transaction before doing the actual relocation
work, the correct amount of reclaimed space is therefore the sum of the
"used" and "reserved" counters of the block group before we call
btrfs_relocate_chunk() at btrfs_reclaim_bgs_work().

So fix this by taking the "reserved" counter into consideration.

Fixes: 243192b676 ("btrfs: report reclaim stats in sysfs")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:47 +01:00
Filipe Manana
ba5d06440c btrfs: get used bytes while holding lock at btrfs_reclaim_bgs_work()
At btrfs_reclaim_bgs_work(), we are grabbing twice the used bytes counter
of the block group while not holding the block group's spinlock. This can
result in races, reported by KCSAN and similar tools, since a concurrent
task can be updating that counter while at btrfs_update_block_group().

So avoid these races by grabbing the counter in a critical section
delimited by the block group's spinlock after setting the block group to
RO mode. This also avoids using two different values of the counter in
case it changes in between each read. This silences KCSAN and is required
for the next patch in the series too.

Fixes: 243192b676 ("btrfs: report reclaim stats in sysfs")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:47 +01:00
Filipe Manana
1283b8c125 btrfs: get zone unusable bytes while holding lock at btrfs_reclaim_bgs_work()
At btrfs_reclaim_bgs_work(), we are grabbing a block group's zone unusable
bytes while not under the protection of the block group's spinlock, so
this can trigger race reports from KCSAN (or similar tools) since that
field is typically updated while holding the lock, such as at
__btrfs_add_free_space_zoned() for example.

Fix this by grabbing the zone unusable bytes while we are still in the
critical section holding the block group's spinlock, which is right above
where we are currently grabbing it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:47 +01:00
David Sterba
db3a1bac3f btrfs: merge alloc_dummy_extent_buffer() helpers
After previous patch removing nodesize from parameters,
__alloc_dummy_extent_buffer() and alloc_dummy_extent_buffer() are
identical so we can drop one.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:47 +01:00
David Sterba
551561c346 btrfs: don't pass nodesize to __alloc_extent_buffer()
All callers pass a valid fs_info so we can read the nodesize from that
instead of passing it as parameter.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:47 +01:00
Filipe Manana
25aff7b964 btrfs: send: simplify return logic from send_set_xattr()
There's no longer any need for the 'out' label as there are no resources
to cleanup anymore in case of an error and we can directly return if
begin_cmd() fails.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:46 +01:00
Filipe Manana
374d45af64 btrfs: send: avoid path allocation for the current inode when issuing commands
Whenever we issue a command we allocate a path and then compute it. For
the current inode this is not necessary since we have one preallocated
and computed in the send context structure, so we can use it instead
and avoid allocating and freeing a path.

For example if we have 100 extents to send (100 write commands) for a
file, we are allocating and freeing paths 100 times.

So improve on this by avoiding path allocation and freeing whenever a
command is for the current inode by using the current inode's path
stored in the send context structure.

A test was run before applying this patch and the previous one in the
series:

  "btrfs: send: keep the current inode's path cached"

The test script is the following:

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/nullb0
  MNT=/mnt/nullb0

  mkfs.btrfs -f $DEV > /dev/null
  mount $DEV $MNT

  DIR="$MNT/one/two/three/four"
  FILE="$DIR/foobar"

  mkdir -p $DIR

  # Create some empty files to get a deeper btree and therefore make
  # path computations slower.
  for ((i = 1; i <= 30000; i++)); do
      echo -n > "$DIR/filler_$i"
  done

  for ((i = 0; i < 10000; i += 2)); do
     offset=$(( i * 4096 ))
     xfs_io -f -c "pwrite -S 0xab $offset 4K" $FILE > /dev/null
  done

  btrfs subvolume snapshot -r $MNT $MNT/snap

  start=$(date +%s%N)
  btrfs send -f /dev/null $MNT/snap
  end=$(date +%s%N)

  echo -e "\nsend took $(( (end - start) / 1000000 )) milliseconds"

  umount $MNT

Result before applying the 2 patches:  1121 milliseconds
Result after applying the 2 patches:    815 milliseconds  (-31.6%)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:46 +01:00
Filipe Manana
fc746acb7a btrfs: send: keep the current inode's path cached
Whenever we need to send a command for the current inode, like sending
writes, xattr updates, truncates, utimes, etc, we compute the inode's
path each time, which implies doing some memory allocations and traversing
the inode hierarchy to extract the name of the inode and each ancestor
directory, and that implies doing lookups in the subvolume tree amongst
other operations.

Most of the time, by far, the current inode's path doesn't change while
we are processing it (like if we need to issue 100 write commands, the
path remains the same and it's pointless to compute it 100 times).

To avoid this keep the current inode's path cached in the send context
and invalidate it or update it whenever it's needed (after unlinks or
renames).

A performance test, and its results, is mentioned in the next patch in
the series (subject: "btrfs: send: avoid path allocation for the current
inode when issuing commands").

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:46 +01:00
Filipe Manana
d7d56ccf10 btrfs: send: simplify return logic from send_rmdir()
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:46 +01:00
Filipe Manana
26605cc9d0 btrfs: send: simplify return logic from send_unlink()
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:46 +01:00
Filipe Manana
711584496f btrfs: send: simplify return logic from send_link()
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:46 +01:00
Filipe Manana
264515c7cb btrfs: send: simplify return logic from send_rename()
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:46 +01:00
Filipe Manana
cb474665f9 btrfs: send: simplify return logic from send_verity()
There's no need for the 'out' label as there are no resources to cleanup
in case of an error and we can directly return if begin_cmd() fails.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:46 +01:00
Filipe Manana
ab12858161 btrfs: send: simplify return logic from process_changed_xattr()
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:46 +01:00
Filipe Manana
31db3e17e2 btrfs: send: remove unnecessary return variable from process_new_xattr()
There's no need for the 'ret' variable, we can just return directly the
result of the call to iterate_dir_item().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:46 +01:00
Filipe Manana
892772c389 btrfs: send: simplify return logic from record_changed_ref()
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:46 +01:00
Filipe Manana
43090f2ca9 btrfs: send: simplify return logic from record_deleted_ref()
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:46 +01:00
Filipe Manana
de6d3a5b78 btrfs: send: simplify return logic from record_new_ref()
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:46 +01:00
Filipe Manana
39a1c41fa6 btrfs: send: simplify return logic from record_deleted_ref_if_needed()
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:46 +01:00
Filipe Manana
25e5dee510 btrfs: send: simplify return logic from record_new_ref_if_needed()
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
 make this simpler by removing the label and goto and instead return
directly.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:45 +01:00
Filipe Manana
9435159f28 btrfs: send: simplify return logic from send_remove_xattr()
There's no need for the 'out' label as there are no resources to cleanup
in case of an error and we can directly return if begin_cmd() fails.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:45 +01:00
Filipe Manana
ec666c84de btrfs: send: add and use helper to rename current inode when processing refs
Extract the logic to rename the current inode at process_recorded_refs()
into a helper function and use it, therefore removing duplicated logic
and making it easier for an upcoming patch by avoiding yet more duplicated
logic.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:45 +01:00
Filipe Manana
9453fe3297 btrfs: send: only use boolean variables at process_recorded_refs()
We have several local variables at process_recorded_refs() that are used
as booleans, with some of them having a 'bool' type while two of them
having an 'int' type. Change this to make them all use the 'bool' type
which is more clear and to make everything more consistent.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:45 +01:00
Filipe Manana
17f6a74d0b btrfs: send: factor out common logic when sending xattrs
We always send xattrs for the current inode only and both callers of
send_set_xattr() pass a path for the current inode. So move the path
allocation and computation to send_set_xattr(), reducing duplicated
code. This also facilitates an upcoming patch.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:45 +01:00
Filipe Manana
91e9139e5b btrfs: send: simplify return logic from get_cur_inode_state()
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:45 +01:00
Filipe Manana
6bb09d0c12 btrfs: send: simplify return logic from is_inode_existent()
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:45 +01:00
Filipe Manana
dbee3fc55a btrfs: send: simplify return logic from __get_cur_name_and_parent()
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:45 +01:00
Filipe Manana
a77749b3e2 btrfs: send: return -ENAMETOOLONG when attempting a path that is too long
When attempting to build a too long path we are currently returning
-ENOMEM, which is very odd and misleading. So update fs_path_ensure_buf()
to return -ENAMETOOLONG instead. Also, while at it, move the WARN_ON()
into the if statement's expression, as it makes it clear what is being
tested and also has the effect of adding 'unlikely' to the statement,
which allows the compiler to generate better code as this condition is
never expected to happen.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:45 +01:00
Filipe Manana
78843d7e4e btrfs: send: simplify return logic from fs_path_add_from_extent_buffer()
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:45 +01:00
Filipe Manana
a3d37502e7 btrfs: send: implement fs_path_add_path() using fs_path_add()
The helper fs_path_add_path() is basically a copy of fs_path_add() and it
can be made a wrapper around fs_path_add(). So do that and also make it
inline and constify its second argument.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:45 +01:00
Filipe Manana
c727371879 btrfs: send: simplify return logic from fs_path_add()
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:45 +01:00
Filipe Manana
147ff86860 btrfs: send: simplify return logic from fs_path_prepare_for_add()
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:45 +01:00
Filipe Manana
1f63d4b610 btrfs: send: always use fs_path_len() to determine a path's length
Several places are hardcoding the path length calculation instead of using
the helper fs_path_len() for that. Update all those places to instead use
fs_path_len().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:45 +01:00
Filipe Manana
920e8ee2bf btrfs: send: make fs_path_len() inline and constify its argument
The helper function fs_path_len() is trivial and doesn't need to change
its path argument, so make it inline and constify the argument.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:44 +01:00
Filipe Manana
75dfc5d0ca btrfs: send: remove duplicated logic from fs_path_reset()
There's duplicated logic in both branches of the if statement, so move it
outside the branches.

This also reduces the object code size.

Before this change:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  1746279	 163600	  16920	1926799	 1d668f	fs/btrfs/btrfs.ko

After this change:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  1746047	 163592	  16920	1926559	 1d659f	fs/btrfs/btrfs.ko

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:44 +01:00
David Sterba
a4bb776cbe btrfs: use struct btrfs_inode inside btrfs_get_name()
Use a struct btrfs_inode in btrfs_get_name() as it's an internal
helper, allowing to remove some use of BTRFS_I.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:44 +01:00
David Sterba
8dddeb53ab btrfs: use struct btrfs_inode inside btrfs_get_parent()
Use a struct btrfs_inode to btrfs_get_parent() as it's an internal
helper, allowing to remove some use of BTRFS_I.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:44 +01:00
David Sterba
2d4323ced5 btrfs: use struct btrfs_inode inside btrfs_remap_file_range_prep()
Use a struct btrfs_inode in btrfs_remap_file_range_prep() as it's an
internal helper, allowing to remove some use of BTRFS_I.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:44 +01:00
David Sterba
8b044e17e5 btrfs: use struct btrfs_inode inside btrfs_remap_file_range()
Use a struct btrfs_inode to btrfs_remap_file_range() as it's an internal
helper, allowing to remove some use of BTRFS_I.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:44 +01:00
David Sterba
61dbdeb870 btrfs: pass struct btrfs_inode to btrfs_extent_same_range()
Pass a struct btrfs_inode to btrfs_extent_same_range() as it's an
internal interface, allowing to remove some use of BTRFS_I.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:44 +01:00
David Sterba
651cef4611 btrfs: pass struct btrfs_inode to btrfs_double_mmap_unlock()
Pass a struct btrfs_inode to btrfs_double_mmap_unlock() as it's an
internal interface, allowing to remove some use of BTRFS_I.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:44 +01:00
David Sterba
0061ba125b btrfs: pass struct btrfs_inode to btrfs_double_mmap_lock()
Pass a struct btrfs_inode to btrfs_double_mmap_lock() as it's an
internal interface, allowing to remove some use of BTRFS_I.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:44 +01:00
David Sterba
65a66afd1e btrfs: pass struct btrfs_inode to clone_copy_inline_extent()
Pass a struct btrfs_inode to clone_copy_inline_extent() as it's an
internal interface, allowing to remove some use of BTRFS_I.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:44 +01:00
David Sterba
41c5a5dc73 btrfs: props: switch prop_handler::extract to struct btrfs_inode
Pass a struct btrfs_inode to the extract() callback as it's an internal
interface, allowing to remove some use of BTRFS_I.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:44 +01:00
David Sterba
7e027b767d btrfs: props: switch prop_handler::apply to struct btrfs_inode
Pass a struct btrfs_inode to the apply() callback as it's an internal
interface, allowing to remove some use of BTRFS_I.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:44 +01:00
David Sterba
101ab6d1ff btrfs: pass struct btrfs_inode to btrfs_inode_inherit_props()
Pass a struct btrfs_inode to btrfs_inherit_props() as it's an internal
interface, allowing to remove some use of BTRFS_I.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:44 +01:00
David Sterba
308a02a447 btrfs: pass struct btrfs_inode to btrfs_load_inode_props()
Pass a struct btrfs_inode to btrfs_load_inode_props() as it's an
internal interface, allowing to remove some use of BTRFS_I.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:44 +01:00
David Sterba
a0680a946f btrfs: pass struct btrfs_inode to btrfs_fill_inode()
Pass a struct btrfs_inode to btrfs_fill_inode() as it's an internal
interface, allowing to remove some use of BTRFS_I.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:43 +01:00
David Sterba
9882e1d100 btrfs: pass struct btrfs_inode to fill_stack_inode_item()
Pass a struct btrfs_inode to fill_stack_inode_item() as it's an internal
interface, allowing to remove some use of BTRFS_I.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:43 +01:00
David Sterba
cb9a1f5ffa btrfs: use struct btrfs_inode inside create_pending_snapshot()
Use a struct btrfs_inode in create_pending_snapshot() as it's an
internal helper, allowing to remove some use of BTRFS_I.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:43 +01:00
David Sterba
fc11fd0cb8 btrfs: pass struct btrfs_inode to btrfs_defrag_file()
Pass a struct btrfs_inode to btrfs_defrag_file() as it's an internal
interface, allowing to remove some use of BTRFS_I.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:43 +01:00
David Sterba
01b2e7de3a btrfs: pass struct btrfs_inode to btrfs_inode_type()
Pass a struct btrfs_inode to btrfs_inode() as it's an internal
interface, allowing to remove some use of BTRFS_I.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:43 +01:00
David Sterba
11af82b02b btrfs: pass struct btrfs_inode to new_simple_dir()
Pass a struct btrfs_inode to new_simple_dir() as it's an internal
interface, allowing to remove some use of BTRFS_I.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:43 +01:00
David Sterba
4ea2fb9c62 btrfs: pass struct btrfs_inode to btrfs_iget_locked()
Pass a struct btrfs_inode to btrfs_inode() as it's an internal
interface, allowing to remove some use of BTRFS_I.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:43 +01:00
David Sterba
d36f84a849 btrfs: pass struct btrfs_inode to btrfs_read_locked_inode()
Pass a struct btrfs_inode to btrfs_read_locked_inode() as it's an
internal interface, allowing to remove some use of BTRFS_I.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:43 +01:00
David Sterba
0d12afad24 btrfs: pass struct btrfs_inode to extent_range_clear_dirty_for_io()
Pass a struct btrfs_inode to extent_range_clear_dirty_for_io() as it's
an internal interface, allowing to remove some use of BTRFS_I.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:43 +01:00
David Sterba
44dddd493e btrfs: pass struct btrfs_inode to can_nocow_extent()
Pass a struct btrfs_inode to can_nocow_extent() as it's an internal
interface, allowing to remove some use of BTRFS_I.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:43 +01:00
David Sterba
6149c82bda btrfs: update include and forward declarations in headers
Pass over all header files and add missing forward declarations,
includes or fix include types.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:43 +01:00
David Sterba
f867ccabb8 btrfs: simplify returns and labels in btrfs_init_fs_root()
There's a label that does nothing else than return, so remove it and
also change other gotos to immediate returns as the function is short
enough for this pattern.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:43 +01:00
David Sterba
dba6ae0b43 btrfs: unify ordering of btrfs_key initializations
The btrfs_key is defined as objectid/type/offset and the keys are also
printed like that. For better readability, update all key
initializations to match this order.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:42 +01:00
David Sterba
17b9824922 btrfs: zstd: remove local variable for storing page offsets
When using offset_in_page() it's clear what it means, we don't need to
store it in the local variable just to use it right away. There's no
change in the generated code, but keeps the declarations smaller.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:42 +01:00
David Sterba
cfb999b81a btrfs: zstd: move zstd_parameters to the workspace
Reduce stack consumption of zstd_compress_folios() by 40 bytes
(10*sizeof(int)) as we can store struct zstd_parameters in the workspace
that is reused for each call.

typedef struct {
	ZSTD_compressionParameters cParams;
	ZSTD_frameParameters fParams;
} ZSTD_parameters;

typedef struct {
    unsigned windowLog;
    unsigned chainLog;
    unsigned hashLog;
    unsigned searchLog;
    unsigned minMatch;
    unsigned targetLength;
    ZSTD_strategy strategy;
} ZSTD_compressionParameters;

typedef struct {
    int contentSizeFlag;
    int checksumFlag;
    int noDictIDFlag;
} ZSTD_frameParameters;

Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:42 +01:00
David Sterba
a8511baf32 btrfs: async-thread: switch local variables need_order bool
Use bool for 0/1 indicators in thresh_exec_hook() and
btrfs_work_helper().

Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:42 +01:00
David Sterba
0128c9a7cd btrfs: add __cold attribute to extent_io_tree_panic()
This is a wrapper that leads to a panic, so add the annotation like the
other similar functions have.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:42 +01:00
Johannes Thumshirn
26b38e2816 btrfs: zoned: exit btrfs_can_activate_zone if BTRFS_FS_NEED_ZONE_FINISH is set
If BTRFS_FS_NEED_ZONE_FINISH is already set for the whole filesystem, exit
early in btrfs_can_activate_zone(). There's no need to check if
BTRFS_FS_NEED_ZONE_FINISH needs to be set if it is already set.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:42 +01:00
Qu Wenruo
fcc384be06 btrfs: require strict data/metadata split for subpage checks
Since we have btrfs_meta_is_subpage(), we should make btrfs_is_subpage()
to be data inode specific.

This change involves:

- Simplify btrfs_is_subpage()
  Now we only need to do a very simple sectorsize check against
  PAGE_SIZE.
  And since the function is pretty simple now, just make it an inline
  function.

- Add an extra ASSERT() to make sure btrfs_is_subpage() is only called
  on data inode mapping

- Migrate btree_csum_one_bio() to use btrfs_meta_folio_*() helpers
- Migrate alloc_extent_buffer() to use btrfs_meta_folio_*() helpers
- Migrate end_bbio_meta_write() to use btrfs_meta_folio_*() helpers
  Or we will trigger the ASSERT() due to calling btrfs_folio_*() on
  metadata folios.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:42 +01:00
Qu Wenruo
67ebd7a1f1 btrfs: simplify subpage handling of read_extent_buffer_pages_nowait()
By using a shared bio_add_folio_nofail() with calculated
range_start/range_len, so no more explicit subpage routine needed.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:42 +01:00
Qu Wenruo
6c6201278e btrfs: simplify subpage handling of write_one_eb()
Currently inside write_one_eb() we have two different ways of handling
subpage and regular metadata.

The differences are:

- Extra offset/length calculation when adding the folio range to bio for
  subpage cases
- Only decrease wbc->nr_to_write if the whole page is no longer dirty
  for subpage cases
- Use subpage helper for subpage cases

Merge the tow ways into a shared one:

- Always calculate the to-be-queued range
  So that bio_add_folio() can use the same calculated resulted length
  and offset for both cases.

- Use btrfs_meta_folio_clear_dirty() and
  btrfs_meta_folio_set_writeback() helpers
  This will cover both cases.

- Only decrease wbc->nr_to_write if the folio is no longer dirty
  Since we have the folio locked, no one else can modify the folio dirty
  flags (set_extent_buffer_dirty() will also lock the folio for subpage
  cases).

  Thus after our btrfs_meta_folio_clear_dirty() call, if the whole folio
  is no longer dirty, we're submitting the last dirty eb of the folio,
  and can decrease wbc->nr_to_write properly.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:42 +01:00
Qu Wenruo
7895817b31 btrfs: simplify subpage handling of btrfs_clear_buffer_dirty()
The function btrfs_clear_buffer_dirty() is called on dirty extent buffer
that will not be written back.

The function will call btree_clear_folio_dirty() to clear the folio
dirty flag and also clear PAGECACHE_TAG_DIRTY flag.

And we split the subpage and regular handling, as for subpage cases we
should only clear PAGECACHE_TAG_DIRTY if the last dirty extent buffer in
the page is cleared.

So here we can simplify the function by:

- Use the newly introduced btrfs_meta_folio_clear_and_test_dirty() helper
  The helper will return true if we cleared the folio dirty flag.
  With that we can use the same helper for both subpage and regular
  cases.

- Rename btree_clear_folio_dirty() to btree_clear_folio_dirty_tag()
  As we move the folio dirty clearing in the btrfs_clear_buffer_dirty().

- Call btrfs_meta_folio_clear_and_test_dirty() to clear the dirty flags
  for both regular and subpage metadata cases

- Only call btree_clear_folio_dirty_tag() when the folio is no longer
  dirty

- Update the comment inside set_extent_buffer_dirty()
  As there is no separate clear_subpage_extent_buffer_dirty() anymore.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:42 +01:00
Qu Wenruo
ee76e5a742 btrfs: use metadata specific helpers to simplify extent buffer helpers
The following functions are doing metadata specific checks:

- set_extent_buffer_uptodate()
- clear_extent_buffer_uptodate()

The reason why we do not use btrfs_folio_*() helpers for those helpers
is, btrfs_is_subpage() cannot handle dummy extent buffer if nodesize >=
PAGE_SIZE but block size < PAGE_SIZE.

In that case, we do not need to attach extra bitmaps to the extent
buffer folio. But since dummy extent buffer folios are not attached to
btree inode, btrfs_is_subpage() will return true, causing problems.

And the following are using btrfs_folio_*() helpers for metadata, but
in theory we should use metadata specific checks:

- set_extent_buffer_dirty()

This is not causing problems because a dummy extent buffer should never
be marked dirty.

To make code simpler, introduce btrfs_meta_folio_*() helpers, to do
the metadata specific handling, so that we do not to open-code such
checks in above involved functions.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:41 +01:00
Qu Wenruo
57a3212674 btrfs: make subpage attach and detach handle metadata properly
Currently subpage attach/detach is not doing proper dummy extent buffer
subpage check, as btrfs_is_subpage() is not reliable for dummy extent
buffer folios.

Since we have a metadata specific check now, use that for
btrfs_attach_subpage() first.

Then enhance btrfs_detach_subpage() to accept a type parameter, so that
we can do extra checks for dummy extent buffers properly.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:41 +01:00
Qu Wenruo
f64e818153 btrfs: factor out metadata subpage detection into a dedicated helper
Currently we have only one btrfs_is_subpage() to cover both data and
metadata.

But there is a special case for metadata:

- dummy extent buffer, sector size < PAGE_SIZE and node size >= PAGE_SIZE

In such case, btrfs_is_subpage() will return true for extent buffer
folio.

But that is not correct, and that's exactly why we have some open-coded
checks for functions like set_extent_buffer_uptodate() and
clear_extent_buffer_uptodate().

Just extract the metadata specific checks into a helper, and replace
those call sites.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:41 +01:00
Qu Wenruo
619611e87f btrfs: remove btrfs_fs_info::sectors_per_page
For the future large folio support, our filemap can have folios with
different sizes, thus we can no longer rely on a fixed blocks_per_page
value.

To prepare for that future, here we do:

- Remove btrfs_fs_info::sectors_per_page

- Introduce a helper, btrfs_blocks_per_folio()
  Which uses the folio size to calculate the number of blocks for each
  folio.

- Migrate the existing btrfs_fs_info::sectors_per_page to use that
  helper
  There are some exceptions:

  * Metadata nodesize < page size support
    In the future, even if we support large folios, we will only
    allocate a folio that matches our nodesize.
    Thus we won't have a folio covering multiple metadata unless
    nodesize < page size.

  * Existing subpage bitmap dump
    We use a single unsigned long to store the bitmap.
    That means until we change the bitmap dumping code, our upper limit
    for folio size will only be 256K (4K block size, 64 bit unsigned
    long).

  * btrfs_is_subpage() check
    This will be migrated into a future patch.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:41 +01:00
Daniel Vacek
da798fa519 btrfs: zstd: enable negative compression levels mount option
Allow using the fast modes (negative compression levels) of zstd as a
mount option.

As per the results, the compression ratio is (expectedly) lower:

for level in {-15..-1} 1 2 3; \
do printf "level %3d\n" $level; \
  mount -o compress=zstd:$level /dev/sdb /mnt/test/; \
  grep sdb /proc/mounts; \
  cp -r /usr/bin       /mnt/test/; sync; compsize /mnt/test/bin; \
  cp -r /usr/share/doc /mnt/test/; sync; compsize /mnt/test/doc; \
  cp    enwik9         /mnt/test/; sync; compsize /mnt/test/enwik9; \
  cp    linux-6.13.tar /mnt/test/; sync; compsize /mnt/test/linux-6.13.tar; \
  rm -r /mnt/test/{bin,doc,enwik9,linux-6.13.tar}; \
  umount /mnt/test/; \
done |& tee results | \
awk '/^level/{print}/^TOTAL/{print$3"\t"$2"  |"}' | paste - - - - -

		266M	bin  |	45M	doc  |	953M	wiki |	1.4G	source
=============================+===============+===============+===============+
level -15	180M	67%  |	30M	68%  |	694M	72%  |	598M	40%  |
level -14	180M	67%  |	30M	67%  |	683M	71%  |	581M	39%  |
level -13	177M	66%  |	29M	66%  |	671M	70%  |	566M	38%  |
level -12	174M	65%  |	29M	65%  |	658M	69%  |	548M	37%  |
level -11	174M	65%  |	28M	64%  |	645M	67%  |	530M	35%  |
level -10	171M	64%  |	28M	62%  |	631M	66%  |	512M	34%  |
level  -9	165M	62%  |	27M	61%  |	615M	64%  |	493M	33%  |
level  -8	161M	60%  |	27M	59%  |	598M	62%  |	475M	32%  |
level  -7	155M	58%  |	26M	58%  |	582M	61%  |	457M	30%  |
level  -6	151M	56%  |	25M	56%  |	565M	59%  |	437M	29%  |
level  -5	145M	54%  |	24M	55%  |	545M	57%  |	417M	28%  |
level  -4	139M	52%  |	23M	52%  |	520M	54%  |	391M	26%  |
level  -3	135M	50%  |	22M	50%  |	495M	51%  |	369M	24%  |
level  -2	127M	47%  |	22M	48%  |	470M	49%  |	349M	23%  |
level  -1	120M	45%  |	21M	47%  |	452M	47%  |	332M	22%  |
level   1	110M	41%  |	17M	39%  |	362M	38%  |	290M	19%  |
level   2	106M	40%  |	17M	38%  |	349M	36%  |	288M	19%  |
level   3	104M	39%  |	16M	37%  |	340M	35%  |	276M	18%  |

The samples represent some data sets that can be commonly found and show
approximate compressibility. The fast levels trade off speed for ratio
and are best suitable for highly compressible data.

As can be seen above, comparing the results to the current default zstd
level 3, the negative levels are roughly 2x worse at -15 and the
ratio increases almost linearly with each level.

Signed-off-by: Daniel Vacek <neelx@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:41 +01:00
Qu Wenruo
94f6c5c17e btrfs: move ordered extent cleanup to where they are allocated
The ordered extent cleanup is hard to grasp because it doesn't follow
the common cleanup-asap pattern.

E.g. run_delalloc_nocow() and cow_file_range() allocate one or more
ordered extent, but if any error is hit, the cleanup is done later inside
btrfs_run_delalloc_range().

To change the existing delayed cleanup:

- Update the comment on error handling of run_delalloc_nocow()
  There are in fact 3 different cases other than 2 if we are doing
  ordered extents cleanup inside run_delalloc_nocow():

  1) @cow_start and @cow_end not set
     No fallback to COW at all.
     Before @cur_offset we need to cleanup the OE and page dirty.
     After @cur_offset just clear all involved page and extent flags.

  2) @cow_start set but @cow_end not set.
     This means we failed before even calling fallback_to_cow().
     It's just a variant of case 1), where it's @cow_start splitting
     the two parts (and we should just ignore @cur_offset since it's
     advanced without any new ordered extent).

  3) @cow_start and @cow_end both set
     This means fallback_to_cow() failed, meaning [start, cow_start)
     needs the regular OE and dirty folio cleanup, and skip range
     [cow_start, cow_end) as cow_file_range() has done the cleanup,
     and eventually cleanup [cow_end, end) range.

- Only reset @cow_start after fallback_to_cow() succeeded
  As above case 2) and 3) are both relying on @cow_start to determine
  the cleanup range.

- Move btrfs_cleanup_ordered_extents() into run_delalloc_nocow(),
  cow_file_range() and nocow_one_range()

  For cow_file_range() it's pretty straightforward and easy.

  For run_delalloc_nocow() refer to the above 3 different error cases.

  For nocow_one_range() if we hit an error, we need to cleanup the
  ordered extents by ourselves.
  And then it will fallback to case 1), since @cur_offset is not yet
  advanced, the existing cleanup will co-operate with nocow_one_range()
  well.

- Remove the btrfs_cleanup_ordered_extents() inside submit_uncompressed_range()
  As failed cow_file_range() will do all the proper cleanup now.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:41 +01:00
Qu Wenruo
10326fdcb3 btrfs: factor out nocow ordered extent and extent map generation into a helper
Currently we're doing all the ordered extent and extent map generation
inside a while() loop of run_delalloc_nocow().  This makes it pretty
hard to read, nor doing proper error handling.

So move that part of code into a helper, nocow_one_range().

This should not change anything, but there is a tiny timing change where
btrfs_dec_nocow_writers() is only called after nocow_one_range() helper
exits.

This timing change is small, and makes error handling easier, thus
should be fine.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:41 +01:00
Qu Wenruo
ecde48a1a6 btrfs: expose per-inode stable writes flag
The address space flag AS_STABLE_WRITES determine if FGP_STABLE for will
wait for the folio to finish its writeback.

For btrfs, due to the default data checksum behavior, if we modify the
folio while it's still under writeback, it will cause data checksum
mismatch.  Thus for quite some call sites we manually call
folio_wait_writeback() to prevent such problem from happening.

Currently there is only one call site inside btrfs really utilizing
FGP_STABLE, and in that case we also manually call folio_wait_writeback()
to do the waiting.

But it's better to properly expose the stable writes flag to a per-inode
basis, to allow call sites to fully benefit from FGP_STABLE flag.
E.g. for inodes with NODATASUM allowing beginning dirtying the page
without waiting for writeback.

This involves:

- Update the mapping's stable write flag when setting/clearing NODATASUM
  inode flag using ioctl
  This only works for empty files, so it should be fine.

- Update the mapping's stable write flag when reading an inode from disk

- Remove the explicit folio_wait_writeback() for FGP_BEGINWRITE call
  site

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:41 +01:00
Qu Wenruo
aa60fe12b4 btrfs: zlib: refactor S390x HW acceleration buffer preparation
Currently for s390x HW zlib compression, to get the best performance we
need a buffer size which is larger than a page.

This means we need to copy multiple pages into workspace->buf, then use
that buffer as zlib compression input.

Currently it's hardcoded using page sized folio, and all the handling
are deep inside a loop.

Refactor the code by:

- Introduce a dedicated helper to do the buffer copy
  The new helper will be called copy_data_into_buffer().

- Add extra ASSERT()s
  * Make sure we only go into the function for hardware acceleration
  * Make sure we still get page sized folio

- Prepare for future large folios
  This means we will rely on the folio size, other than PAGE_SIZE to do
  the copy.

- Handle the folio mapping and unmapping inside the helper function
  For S390x hardware acceleration case, it never utilize the @data_in
  pointer, thus we can do folio mapping/unmapping all inside the function.

Acked-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Tested-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:41 +01:00
Filipe Manana
87d6aaf79b btrfs: avoid assigning twice to block_start at btrfs_do_readpage()
At btrfs_do_readpage() if we get an extent map for a prealloc extent we
end up assigning twice to the 'block_start' variable, first the value
returned by extent_map_block_start() and then EXTENT_MAP_HOLE. This is
pointless so make it more clear by using an if-else statement and doing
only one assignment. Also, while at it, move the declaration of
'block_start' into the while loop's scope, since it's not used outside of
it and the related 'disk_bytenr' is also declared in this scope.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:41 +01:00
Qu Wenruo
968f19c5b1 btrfs: always fallback to buffered write if the inode requires checksum
[BUG]
It is a long known bug that VM image on btrfs can lead to data csum
mismatch, if the qemu is using direct-io for the image (this is commonly
known as cache mode 'none').

[CAUSE]
Inside the VM, if the fs is EXT4 or XFS, or even NTFS from Windows, the
fs is allowed to dirty/modify the folio even if the folio is under
writeback (as long as the address space doesn't have AS_STABLE_WRITES
flag inherited from the block device).

This is a valid optimization to improve the concurrency, and since these
filesystems have no extra checksum on data, the content change is not a
problem at all.

But the final write into the image file is handled by btrfs, which needs
the content not to be modified during writeback, or the checksum will
not match the data (checksum is calculated before submitting the bio).

So EXT4/XFS/NTRFS assume they can modify the folio under writeback, but
btrfs requires no modification, this leads to the false csum mismatch.

This is only a controlled example, there are even cases where
multi-thread programs can submit a direct IO write, then another thread
modifies the direct IO buffer for whatever reason.

For such cases, btrfs has no sane way to detect such cases and leads to
false data csum mismatch.

[FIX]
I have considered the following ideas to solve the problem:

- Make direct IO to always skip data checksum
  This not only requires a new incompatible flag, as it breaks the
  current per-inode NODATASUM flag.
  But also requires extra handling for no csum found cases.

  And this also reduces our checksum protection.

- Let hardware handle all the checksum
  AKA, just nodatasum mount option.
  That requires trust for hardware (which is not that trustful in a lot
  of cases), and it's not generic at all.

- Always fallback to buffered write if the inode requires checksum
  This was suggested by Christoph, and is the solution utilized by this
  patch.

  The cost is obvious, the extra buffer copying into page cache, thus it
  reduces the performance.
  But at least it's still user configurable, if the end user still wants
  the zero-copy performance, just set NODATASUM flag for the inode
  (which is a common practice for VM images on btrfs).

  Since we cannot trust user space programs to keep the buffer
  consistent during direct IO, we have no choice but always falling back
  to buffered IO.  At least by this, we avoid the more deadly false data
  checksum mismatch error.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:32 +01:00
Qu Wenruo
c5e8f2924a btrfs: remove duplicated metadata folio flag update in end_bbio_meta_read()
In that function we call set_extent_buffer_uptodate() or
clear_extent_buffer_uptodate(), which will already update the uptodate
flag for all the involved extent buffer folios.

Thus there is no need to update the folio uptodate flags again.

Just remove the open-coded part.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-17 14:44:43 +01:00
Matthew Wilcox (Oracle)
8be4cb04cb btrfs: convert io_ctl_prepare_pages() to work on folios
Retrieve folios instead of pages and work on them throughout.  Removes
a few calls to compound_head() and a reference to page->mapping.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-17 14:44:43 +01:00
Matthew Wilcox (Oracle)
b9967834ab btrfs: update some folio related comments
Remove references to the page lock and page->mapping.  Also btrfs folios
can never be swizzled into swap (mentioned in extent_write_cache_pages()).

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-17 14:44:42 +01:00
Daniel Vacek
96b2854de8 btrfs: keep private struct on stack for sync reads in btrfs_encoded_read_regular_fill_pages()
Only allocate the btrfs_encoded_read_private structure for asynchronous
(io_uring) mode.

There's no need to allocate an object from slab in the synchronous mode. In
such a case stack can be happily used as it used to be before 68d3b27e05
("btrfs: move priv off stack in btrfs_encoded_read_regular_fill_pages()")
which was a preparation for the async mode.

While at it, fix the comment to reflect the atomic => refcount change in
d29662695e ("btrfs: fix use-after-free waiting for encoded read endios").

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-17 14:44:42 +01:00
Easwar Hariharan
8ef9019ed2 btrfs: convert timeouts to secs_to_jiffies()
Commit b35108a51c ("jiffies: Define secs_to_jiffies()") introduced
secs_to_jiffies().  As the value here is a multiple of 1000, use
secs_to_jiffies() instead of msecs_to_jiffies() to avoid the
multiplication

This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with
the following Coccinelle rules:

@depends on patch@
expression E;
@@

-msecs_to_jiffies
+secs_to_jiffies
(E
- * \( 1000 \| MSEC_PER_SEC \)
)

Link: https://lkml.kernel.org/r/20250225-converge-secs-to-jiffies-part-two-v3-5-a43967e36c88@linux.microsoft.com
Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Acked-by: David Sterba <dsterba@suse.com>
Cc: Carlos Maiolino <cem@kernel.org>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Chris Mason <clm@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Maol <dlemoal@kernel.org>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dick Kennedy <dick.kennedy@broadcom.com>
Cc: Dongsheng Yang <dongsheng.yang@easystack.cn>
Cc: Fabio Estevam <festevam@gmail.com>
Cc: Frank Li <frank.li@nxp.com>
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Cc: Ilpo Jarvinen <ilpo.jarvinen@linux.intel.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: James Bottomley <james.bottomley@HansenPartnership.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Kalesh Anakkur Purayil <kalesh-anakkur.purayil@broadcom.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: Mark Brown <broonie@kernel.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Nicolas Palix <nicolas.palix@imag.fr>
Cc: Niklas Cassel <cassel@kernel.org>
Cc: Oded Gabbay <ogabbay@kernel.org>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Sascha Hauer <s.hauer@pengutronix.de>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Selvin Thyparampil Xavier <selvin.xavier@broadcom.com>
Cc: Shawn Guo <shawnguo@kernel.org>
Cc: Shyam-sundar S-k <Shyam-sundar.S-k@amd.com>
Cc: Takashi Iwai <tiwai@suse.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16 23:24:16 -07:00
Linus Torvalds
6ceb6346b0 for-6.14-rc5-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmfLM5wACgkQxWXV+ddt
 WDsK3A/7BEIUzin4CpmhBkFQamPCLjLL+Zz2etmoYWCrKnNPRMutVbsgeRM43cBt
 NXMD4RSoeXO/aYzrPhe4KMP4a5PkI02v2CEpPJqMRPmbADGyExx5Vnh68ioZWQbi
 N54Sd5LqhMT9FcViG46VJXr+MOBKIzO8++TxswIrCDO+6X/Y39+xZGxj4DXrnF38
 zgvxbILbiH+7vC1m9NV8K7Vl0jp36hQKcCjJYCfohbVoFQiyvmuh2x0LDL2HnIfH
 VpREP+eo/a3ZO8vPo7+4HZ5DVf5AolulbEC6myxsvFScLhWlh218plVyuv4QyACW
 RYDm9MqLqfqOkEDgj+Tb0C4s6uyVon5xbRL3aNbSE73KnUVeb/bB77qAejjzAkIr
 MvEEeEJp0H34OZm2fnUyFIu3ShDcSif1qH0rCOm1rBeqYZZsX7ny7TvKIqkgrsKk
 JbzgpYLyzzqTHs9QERw3OUhIBuefFCs4HlUeukLbUCdqI+ruPp5s76jfHQnT3dzG
 ad5CUW8eHf6mkU93dUlQIeDJSVPdaanf0Whomk3eOKgBeu8+gNp9R41kKJ7UtoA9
 GG504bqNjSe8t0sVmSyuE30BWAQWYnyCSY/9u46JrB6MtfWv+wikU/Nox4qZjM4d
 UhhWkDTELaTngcYkbm5+MD0DkkglTeqEbR9gCM21c9xiJrojhcw=
 =v6KI
 -----END PGP SIGNATURE-----

Merge tag 'for-6.14-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - fix leaked extent map after error when reading chunks

 - replace use of deprecated strncpy

 - in zoned mode, fixed range when ulocking extent range, causing a hang

* tag 'for-6.14-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix a leaked chunk map issue in read_one_chunk()
  btrfs: replace deprecated strncpy() with strscpy()
  btrfs: zoned: fix extent range end unlock in cow_file_range()
2025-03-07 11:17:30 -10:00
Haoxiang Li
35d99c68af btrfs: fix a leaked chunk map issue in read_one_chunk()
Add btrfs_free_chunk_map() to free the memory allocated
by btrfs_alloc_chunk_map() if btrfs_add_chunk_map() fails.

Fixes: 7dc66abb5a ("btrfs: use a dedicated data structure for chunk maps")
CC: stable@vger.kernel.org
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-06 14:40:09 +01:00
NeilBrown
88d5baf690
Change inode_operations.mkdir to return struct dentry *
Some filesystems, such as NFS, cifs, ceph, and fuse, do not have
complete control of sequencing on the actual filesystem (e.g.  on a
different server) and may find that the inode created for a mkdir
request already exists in the icache and dcache by the time the mkdir
request returns.  For example, if the filesystem is mounted twice the
directory could be visible on the other mount before it is on the
original mount, and a pair of name_to_handle_at(), open_by_handle_at()
calls could instantiate the directory inode with an IS_ROOT() dentry
before the first mkdir returns.

This means that the dentry passed to ->mkdir() may not be the one that
is associated with the inode after the ->mkdir() completes.  Some
callers need to interact with the inode after the ->mkdir completes and
they currently need to perform a lookup in the (rare) case that the
dentry is no longer hashed.

This lookup-after-mkdir requires that the directory remains locked to
avoid races.  Planned future patches to lock the dentry rather than the
directory will mean that this lookup cannot be performed atomically with
the mkdir.

To remove this barrier, this patch changes ->mkdir to return the
resulting dentry if it is different from the one passed in.
Possible returns are:
  NULL - the directory was created and no other dentry was used
  ERR_PTR() - an error occurred
  non-NULL - this other dentry was spliced in

This patch only changes file-systems to return "ERR_PTR(err)" instead of
"err" or equivalent transformations.  Subsequent patches will make
further changes to some file-systems to return a correct dentry.

Not all filesystems reliably result in a positive hashed dentry:

- NFS, cifs, hostfs will sometimes need to perform a lookup of
  the name to get inode information.  Races could result in this
  returning something different. Note that this lookup is
  non-atomic which is what we are trying to avoid.  Placing the
  lookup in filesystem code means it only happens when the filesystem
  has no other option.
- kernfs and tracefs leave the dentry negative and the ->revalidate
  operation ensures that lookup will be called to correctly populate
  the dentry.  This could be fixed but I don't think it is important
  to any of the users of vfs_mkdir() which look at the dentry.

The recommendation to use
    d_drop();d_splice_alias()
is ugly but fits with current practice.  A planned future patch will
change this.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: NeilBrown <neilb@suse.de>
Link: https://lore.kernel.org/r/20250227013949.536172-2-neilb@suse.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-27 20:00:17 +01:00
Thorsten Blum
2df2c6ed89 btrfs: replace deprecated strncpy() with strscpy()
strncpy() is deprecated for NUL-terminated destination buffers. Use
strscpy() instead and don't zero-initialize the param array.

Link: https://github.com/KSPP/linux/issues/90
Cc: linux-hardening@vger.kernel.org
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-02-26 09:24:01 +01:00
Naohiro Aota
5a4041f2c4 btrfs: zoned: fix extent range end unlock in cow_file_range()
Running generic/751 on the for-next branch often results in a hang like
below. They are both stack by locking an extent. This suggests someone
forget to unlock an extent.

  INFO: task kworker/u128:1:12 blocked for more than 323 seconds.
        Not tainted 6.13.0-BTRFS-ZNS+ #503
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  task:kworker/u128:1  state:D stack:0     pid:12    tgid:12    ppid:2      flags:0x00004000
  Workqueue: btrfs-fixup btrfs_work_helper [btrfs]
  Call Trace:
   <TASK>
   __schedule+0x534/0xdd0
   schedule+0x39/0x140
   __lock_extent+0x31b/0x380 [btrfs]
   ? __pfx_autoremove_wake_function+0x10/0x10
   btrfs_writepage_fixup_worker+0xf1/0x3a0 [btrfs]
   btrfs_work_helper+0xff/0x480 [btrfs]
   ? lock_release+0x178/0x2c0
   process_one_work+0x1ee/0x570
   ? srso_return_thunk+0x5/0x5f
   worker_thread+0x1d1/0x3b0
   ? __pfx_worker_thread+0x10/0x10
   kthread+0x10b/0x230
   ? __pfx_kthread+0x10/0x10
   ret_from_fork+0x30/0x50
   ? __pfx_kthread+0x10/0x10
   ret_from_fork_asm+0x1a/0x30
   </TASK>
  INFO: task kworker/u134:0:184 blocked for more than 323 seconds.
        Not tainted 6.13.0-BTRFS-ZNS+ #503
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  task:kworker/u134:0  state:D stack:0     pid:184   tgid:184   ppid:2      flags:0x00004000
  Workqueue: writeback wb_workfn (flush-btrfs-4)
  Call Trace:
   <TASK>
   __schedule+0x534/0xdd0
   schedule+0x39/0x140
   __lock_extent+0x31b/0x380 [btrfs]
   ? __pfx_autoremove_wake_function+0x10/0x10
   find_lock_delalloc_range+0xdb/0x260 [btrfs]
   writepage_delalloc+0x12f/0x500 [btrfs]
   ? srso_return_thunk+0x5/0x5f
   extent_write_cache_pages+0x232/0x840 [btrfs]
   btrfs_writepages+0x72/0x130 [btrfs]
   do_writepages+0xe7/0x260
   ? srso_return_thunk+0x5/0x5f
   ? lock_acquire+0xd2/0x300
   ? srso_return_thunk+0x5/0x5f
   ? find_held_lock+0x2b/0x80
   ? wbc_attach_and_unlock_inode.part.0+0x102/0x250
   ? wbc_attach_and_unlock_inode.part.0+0x102/0x250
   __writeback_single_inode+0x5c/0x4b0
   writeback_sb_inodes+0x22d/0x550
   __writeback_inodes_wb+0x4c/0xe0
   wb_writeback+0x2f6/0x3f0
   wb_workfn+0x32a/0x510
   process_one_work+0x1ee/0x570
   ? srso_return_thunk+0x5/0x5f
   worker_thread+0x1d1/0x3b0
   ? __pfx_worker_thread+0x10/0x10
   kthread+0x10b/0x230
   ? __pfx_kthread+0x10/0x10
   ret_from_fork+0x30/0x50
   ? __pfx_kthread+0x10/0x10
   ret_from_fork_asm+0x1a/0x30
   </TASK>

This happens because we have another success path for the zoned mode. When
there is no active zone available, btrfs_reserve_extent() returns
-EAGAIN. In this case, we have two reactions.

(1) If the given range is never allocated, we can only wait for someone
    to finish a zone, so wait on BTRFS_FS_NEED_ZONE_FINISH bit and retry
    afterward.

(2) Or, if some allocations are already done, we must bail out and let
    the caller to send IOs for the allocation. This is because these IOs
    may be necessary to finish a zone.

The commit 06f3642847 ("btrfs: do proper folio cleanup when
cow_file_range() failed") moved the unlock code from the inside of the
loop to the outside. So, previously, the allocated extents are unlocked
just after the allocation and so before returning from the function.
However, they are no longer unlocked on the case (2) above. That caused
the hang issue.

Fix the issue by modifying the 'end' to the end of the allocated
range. Then, we can exit the loop and the same unlock code can properly
handle the case.

Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Fixes: 06f3642847 ("btrfs: do proper folio cleanup when cow_file_range() failed")
CC: stable@vger.kernel.org
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-02-26 09:23:57 +01:00
Linus Torvalds
cc8a0934d0 for-6.14-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAme95g8ACgkQxWXV+ddt
 WDvi3g//V55iBXnPv0Jrs7b95GRskYv8A4vJsZhGtub4PlcEh8S6Q1IoU3qwiKHv
 E2THDA/A14qetxh3tSo73+RdS3JHpIH4QKjO54k74gOh45OEUs4Lq8NBAujmpz4b
 BMZZnM5iyZipNfbebUa/XxlPLvHg8D2rUqwycS/A0c5BE56HTvVzmKL3RdUfkAvA
 uZaJa6FOKfr6ge3ikl/dm+Rl7f+ZymIK4T9XsW3Lt223siYvcLJvWEIL0tk9B1y/
 ZUQNqPOCHY0mX/zPC0425LoeH3LWDPyZPCakaY8tiwI20p/sP+hPLBC8WDrJvoam
 losu6v8EqkYK9zND/ETVq3d1Y9mzub/soKuM+aDQ/UM0JXz1vI3RYQcpskECR0Gf
 ZPq5tv+dSBbMmscvkxnkuNBaTr3IbOhkxaKwOvdoRN9F4HbmhgxTscshaQHklmiG
 4qRx2HtW9Zw8ufyLUFUYaRAj45eFDZMQStQMCNSECD8X+fS6CPGUqGFcuXrm+kLL
 v6k0cbvh1NOLSchqtfR4rochJFUp5veUNHoYQ7YRy3CqV1yrF7iM1e0G1rvyOQYQ
 9tpN93IYwLItRdUjtqyS/q8WOddRTo0LTqh5HDXPnLd3jc/kO7KjHv9dJna7wyhO
 MUJmLlpy1dRDHCvTl70oF0Nxe4Ve20n7U2QayF5bMGtCmQnzGL0=
 =4+6s
 -----END PGP SIGNATURE-----

Merge tag 'for-6.14-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - extent map shrinker fixes:
     - fix potential use after free accessing an inode to reach fs_info,
       the shrinker could do iput() in the meantime
     - skip unnecessary scanning of inodes without extent maps
     - do direct iput(), no need for indirection via workqueue

 - in block < page mode, fix race when extending i_size in buffered mode

 - fix minor memory leak in selftests

 - print descriptive error message when seeding device is not found

* tag 'for-6.14-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix data overwriting bug during buffered write when block size < page size
  btrfs: output an error message if btrfs failed to find the seed fsid
  btrfs: do regular iput instead of delayed iput during extent map shrinking
  btrfs: skip inodes without loaded extent maps when shrinking extent maps
  btrfs: fix use-after-free on inode when scanning root during em shrinking
  btrfs: selftests: fix btrfs_test_delayed_refs() leak of transaction
2025-02-25 09:42:15 -08:00
Qu Wenruo
efa11fd269 btrfs: fix data overwriting bug during buffered write when block size < page size
[BUG]
When running generic/418 with a btrfs whose block size < page size
(subpage cases), it always fails.

And the following minimal reproducer is more than enough to trigger it
reliably:

workload()
{
        mkfs.btrfs -s 4k -f $dev > /dev/null
        dmesg -C
        mount $dev $mnt
        $fsstree_dir/src/dio-invalidate-cache -r -b 4096 -n 3 -i 1 -f $mnt/diotest
        ret=$?
        umount $mnt
        stop_trace
        if [ $ret -ne 0 ]; then
                fail
        fi
}

for (( i = 0; i < 1024; i++)); do
        echo "=== $i/$runtime ==="
        workload
done

[CAUSE]
With extra trace printk added to the following functions:
- btrfs_buffered_write()
  * Which folio is touched
  * The file offset (start) where the buffered write is at
  * How many bytes are copied
  * The content of the write (the first 2 bytes)

- submit_one_sector()
  * Which folio is touched
  * The position inside the folio
  * The content of the page cache (the first 2 bytes)

- pagecache_isize_extended()
  * The parameters of the function itself
  * The parameters of the folio_zero_range()

Which are enough to show the problem:

  22.158114: btrfs_buffered_write: folio pos=0 start=0 copied=4096 content=0x0101
  22.158161: submit_one_sector: r/i=5/257 folio=0 pos=0 content=0x0101
  22.158609: btrfs_buffered_write: folio pos=0 start=4096 copied=4096 content=0x0101
  22.158634: btrfs_buffered_write: folio pos=0 start=8192 copied=4096 content=0x0101
  22.158650: pagecache_isize_extended: folio=0 from=4096 to=8192 bsize=4096 zero off=4096 len=8192
  22.158682: submit_one_sector: r/i=5/257 folio=0 pos=4096 content=0x0000
  22.158686: submit_one_sector: r/i=5/257 folio=0 pos=8192 content=0x0101

The tool dio-invalidate-cache will start 3 threads, each doing a buffered
write with 0x01 at offset 0, 4096 and 8192, do a fsync, then do a direct read,
and compare the read buffer with the write buffer.

Note that all 3 btrfs_buffered_write() are writing the correct 0x01 into
the page cache.

But at submit_one_sector(), at file offset 4096, the content is zeroed
out, by pagecache_isize_extended().

The race happens like this:
 Thread A is writing into range [4K, 8K).
 Thread B is writing into range [8K, 12k).

               Thread A              |         Thread B
-------------------------------------+------------------------------------
btrfs_buffered_write()               | btrfs_buffered_write()
|- old_isize = 4K;                   | |- old_isize = 4096;
|- btrfs_inode_lock()                | |
|- write into folio range [4K, 8K)   | |
|- pagecache_isize_extended()        | |
|  extend isize from 4096 to 8192    | |
|  no folio_zero_range() called      | |
|- btrfs_inode_lock()                | |
                                     | |- btrfs_inode_lock()
				     | |- write into folio range [8K, 12K)
				     | |- pagecache_isize_extended()
				     | |  calling folio_zero_range(4K, 8K)
				     | |  This is caused by the old_isize is
				     | |  grabbed too early, without any
				     | |  inode lock.
				     | |- btrfs_inode_unlock()

The @old_isize is grabbed without inode lock, causing race between two
buffered write threads and making pagecache_isize_extended() to zero
range which is still containing cached data.

And this is only affecting subpage btrfs, because for regular blocksize
== page size case, the function pagecache_isize_extended() will do
nothing if the block size >= page size.

[FIX]
Grab the old i_size while holding the inode lock.
This means each buffered write thread will have a stable view of the
old inode size, thus avoid the above race.

CC: stable@vger.kernel.org # 5.15+
Fixes: 5e8b9ef303 ("btrfs: move pos increment and pagecache extension to btrfs_buffered_write")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-02-21 09:32:24 +01:00
Qu Wenruo
b1bf18223a btrfs: output an error message if btrfs failed to find the seed fsid
[BUG]
If btrfs failed to locate the seed device for whatever reason, mounting
the sprouted device will fail without any meaning error message:

  # mkfs.btrfs -f /dev/test/scratch1
  # btrfstune -S1 /dev/test/scratch1
  # mount /dev/test/scratch1 /mnt/btrfs
  # btrfs dev add -f /dev/test/scratch2 /mnt/btrfs
  # umount /mnt/btrfs
  # btrfs dev scan -u
  # btrfs mount /dev/test/scratch2 /mnt/btrfs
  mount: /mnt/btrfs: fsconfig system call failed: No such file or directory.
        dmesg(1) may have more information after failed mount system call.
  # dmesg -t | tail -n6
  BTRFS info (device dm-5): first mount of filesystem 64252ded-5953-4868-b962-cea48f7ac4ea
  BTRFS info (device dm-5): using crc32c (crc32c-generic) checksum algorithm
  BTRFS info (device dm-5): using free-space-tree
  BTRFS error (device dm-5): failed to read chunk tree: -2
  BTRFS error (device dm-5): open_ctree failed: -2

[CAUSE]
The failure to mount is pretty straight forward, just unable to find the
seed device and its fsid, caused by `btrfs dev scan -u`.

But the lack of any useful info is a problem.

[FIX]
Just add an extra error message in open_seed_devices() to indicate the
error.

Now the error message would look like this:

 BTRFS info (device dm-4): first mount of filesystem 7769223d-4db1-4e4c-ac29-0a96f53576ab
 BTRFS info (device dm-4): using crc32c (crc32c-generic) checksum algorithm
 BTRFS info (device dm-4): using free-space-tree
 BTRFS error (device dm-4): failed to find fsid e87c12e6-584b-4e98-8b88-962c33a619ff when attempting to open seed devices
 BTRFS error (device dm-4): failed to read chunk tree: -2
 BTRFS error (device dm-4): open_ctree failed: -2

Link: https://github.com/kdave/btrfs-progs/issues/959
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-02-21 09:32:16 +01:00
Filipe Manana
15b3b3254d btrfs: do regular iput instead of delayed iput during extent map shrinking
The extent map shrinker now runs in the system unbound workqueue and no
longer in kswapd context so it can directly do an iput() on inodes even
if that blocks or needs to acquire any lock (we aren't holding any locks
when requesting the delayed iput from the shrinker). So we don't need to
add a delayed iput, wake up the cleaner and delegate the iput() to the
cleaner, which also adds extra contention on the spinlock that protects
the delayed iputs list.

Reported-by: Ivan Shapovalov <intelfx@intelfx.name>
Tested-by: Ivan Shapovalov <intelfx@intelfx.name>
Link: https://lore.kernel.org/linux-btrfs/0414d690ac5680d0d77dfc930606cdc36e42e12f.camel@intelfx.name/
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-02-21 09:32:11 +01:00
Filipe Manana
c6c9c4d564 btrfs: skip inodes without loaded extent maps when shrinking extent maps
If there are inodes that don't have any loaded extent maps, we end up
grabbing a reference on them and later adding a delayed iput, which wakes
up the cleaner and makes it do unnecessary work. This is common when for
example the inodes were open only to run stat(2) or all their extent maps
were already released through the folio release callback
(btrfs_release_folio()) or released by a previous run of the shrinker, or
directories which never have extent maps.

Reported-by: Ivan Shapovalov <intelfx@intelfx.name>
Tested-by: Ivan Shapovalov <intelfx@intelfx.name>
Link: https://lore.kernel.org/linux-btrfs/0414d690ac5680d0d77dfc930606cdc36e42e12f.camel@intelfx.name/
CC: stable@vger.kernel.org # 6.13+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-02-21 09:32:07 +01:00
Filipe Manana
59f37036bb btrfs: fix use-after-free on inode when scanning root during em shrinking
At btrfs_scan_root() we are accessing the inode's root (and fs_info) in a
call to btrfs_fs_closing() after we have scheduled the inode for a delayed
iput, and that can result in a use-after-free on the inode in case the
cleaner kthread does the iput before we dereference the inode in the call
to btrfs_fs_closing().

Fix this by using the fs_info stored already in a local variable instead
of doing inode->root->fs_info.

Fixes: 1020443840 ("btrfs: make the extent map shrinker run asynchronously as a work queue job")
CC: stable@vger.kernel.org # 6.13+
Tested-by: Ivan Shapovalov <intelfx@intelfx.name>
Link: https://lore.kernel.org/linux-btrfs/0414d690ac5680d0d77dfc930606cdc36e42e12f.camel@intelfx.name/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-02-21 09:31:48 +01:00
David Disseldorp
290237fde9 btrfs: selftests: fix btrfs_test_delayed_refs() leak of transaction
The btrfs_transaction struct leaks, which can cause sporadic fstests
failures when kmemleak checking is enabled:

kmemleak: 5 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
> cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff88810fdc6c00 (size 512):
  comm "modprobe", pid 203, jiffies 4294892552
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace (crc 6736050f):
    __kmalloc_cache_noprof+0x133/0x2c0
    btrfs_test_delayed_refs+0x6f/0xbb0 [btrfs]
    btrfs_run_sanity_tests.cold+0x91/0xf9 [btrfs]
    0xffffffffa02fd055
    do_one_initcall+0x49/0x1c0
    do_init_module+0x5b/0x1f0
    init_module_from_file+0x70/0x90
    idempotent_init_module+0xe8/0x2c0
    __x64_sys_finit_module+0x6b/0xd0
    do_syscall_64+0x54/0x110
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

The transaction struct was initially stack-allocated but switched to
heap following frame size compiler warnings.

Fixes: 2b34879d97 ("btrfs: selftests: add delayed ref self test cases")
Signed-off-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-02-17 17:24:14 +01:00
Linus Torvalds
945ce413ac for-6.14-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmeuSwwACgkQxWXV+ddt
 WDtQ8Q//fsTAu1DLjeVrzMhVNswGwr3PgWzLk3PqWTDEG9UeR/jJYntIPglVyhhP
 Mp2E3CYe2rlWwK0K3PITDu179tLrnCvbfKEPwWvyVZw5D0EDjPQYs9/H5ztSE8O4
 4i3kv2LjlXHE3h62tjNoeHL4NK1SRJcFeH69XhhIe0ELvTQVarvfJupZwdQQivWg
 sDlQXklXxl1kEtHVGnmz6jd09a0vti7xw8MAG6QiIP83Hvt6Ie+NLfTfTCkRIWSK
 95mPM+1YhmLQe15sD8xjHyYmH5E0cEXQh1Pvlz6xqQWRvZERG8Pmj+iwFTLaw4iA
 JR6sN2/KFgXE9OIGbFqQ+dvm++2hWcnPwW+h6EdOSj0DQkupbJm4VeBK0WQ4YZ+x
 Q0OQXPTfGpcjp7KyJrT6EZFq5VxeEfOz4hozhiCSTs+Xpx7Oh/2THL01N/dUMn0C
 SNR9E4/Rlq7rWV7euGwicwo/tZZIdCr4ihUGk4jpamlUbIXj+2SrOc4cpQdypmsO
 DeYvwzIXnPe8/Eo3rZ5ej0DK7GxfEFyd6v6l0oS6HepvMJ6y6/eiOYteVbGpvhXv
 J2M6PLstiZc152VHPApN9+ZlXBeGjyMfxLcsweblpSBBt/57otY6cMhqNuIp0j9B
 0zP3KKOwrIJ8tzcwjMSH+2OZsDQ7oc7eiJI08r0IcpCbCBTIeyE=
 =0hI+
 -----END PGP SIGNATURE-----

Merge tag 'for-6.14-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - fix stale page cache after race between readahead and direct IO write

 - fix hole expansion when writing at an offset beyond EOF, the range
   will not be zeroed

 - use proper way to calculate offsets in folio ranges

* tag 'for-6.14-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix hole expansion when writing at an offset beyond EOF
  btrfs: fix stale page cache after race between readahead and direct IO write
  btrfs: fix two misuses of folio_shift()
2025-02-13 12:06:29 -08:00
Filipe Manana
da2dccd745 btrfs: fix hole expansion when writing at an offset beyond EOF
At btrfs_write_check() if our file's i_size is not sector size aligned and
we have a write that starts at an offset larger than the i_size that falls
within the same page of the i_size, then we end up not zeroing the file
range [i_size, write_offset).

The code is this:

    start_pos = round_down(pos, fs_info->sectorsize);
    oldsize = i_size_read(inode);
    if (start_pos > oldsize) {
        /* Expand hole size to cover write data, preventing empty gap */
        loff_t end_pos = round_up(pos + count, fs_info->sectorsize);

        ret = btrfs_cont_expand(BTRFS_I(inode), oldsize, end_pos);
        if (ret)
            return ret;
    }

So if our file's i_size is 90269 bytes and a write at offset 90365 bytes
comes in, we get 'start_pos' set to 90112 bytes, which is less than the
i_size and therefore we don't zero out the range [90269, 90365) by
calling btrfs_cont_expand().

This is an old bug introduced in commit 9036c10208 ("Btrfs: update hole
handling v2"), from 2008, and the buggy code got moved around over the
years.

Fix this by discarding 'start_pos' and comparing against the write offset
('pos') without any alignment.

This bug was recently exposed by test case generic/363 which tests this
scenario by polluting ranges beyond EOF with an mmap write and than verify
that after a file increases we get zeroes for the range which is supposed
to be a hole and not what we wrote with the previous mmaped write.

We're only seeing this exposed now because generic/363 used to run only
on xfs until last Sunday's fstests update.

The test was failing like this:

   $ ./check generic/363
   FSTYP         -- btrfs
   PLATFORM      -- Linux/x86_64 debian0 6.13.0-rc7-btrfs-next-185+ #17 SMP PREEMPT_DYNAMIC Mon Feb  3 12:28:46 WET 2025
   MKFS_OPTIONS  -- /dev/sdc
   MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1

   generic/363 0s ... [failed, exit status 1]- output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/363.out.bad)
       --- tests/generic/363.out	2025-02-05 15:31:14.013646509 +0000
       +++ /home/fdmanana/git/hub/xfstests/results//generic/363.out.bad	2025-02-05 17:25:33.112630781 +0000
       @@ -1 +1,46 @@
        QA output created by 363
       +READ BAD DATA: offset = 0xdcad, size = 0xd921, fname = /home/fdmanana/btrfs-tests/dev/junk
       +OFFSET      GOOD    BAD     RANGE
       +0x1609d     0x0000  0x3104  0x0
       +operation# (mod 256) for the bad data may be 4
       +0x1609e     0x0000  0x0472  0x1
       +operation# (mod 256) for the bad data may be 4
       ...
       (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/363.out /home/fdmanana/git/hub/xfstests/results//generic/363.out.bad'  to see the entire diff)
   Ran: generic/363
   Failures: generic/363
   Failed 1 of 1 tests

Fixes: 9036c10208 ("Btrfs: update hole handling v2")
CC: stable@vger.kernel.org
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-02-11 23:09:03 +01:00
Filipe Manana
acc18e1c1d btrfs: fix stale page cache after race between readahead and direct IO write
After commit ac325fc2aa ("btrfs: do not hold the extent lock for entire
read") we can now trigger a race between a task doing a direct IO write
and readahead. When this race is triggered it results in tasks getting
stale data when they attempt do a buffered read (including the task that
did the direct IO write).

This race can be sporadically triggered with test case generic/418, failing
like this:

   $ ./check generic/418
   FSTYP         -- btrfs
   PLATFORM      -- Linux/x86_64 debian0 6.13.0-rc7-btrfs-next-185+ #17 SMP PREEMPT_DYNAMIC Mon Feb  3 12:28:46 WET 2025
   MKFS_OPTIONS  -- /dev/sdc
   MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1

   generic/418 14s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad)
       --- tests/generic/418.out	2020-06-10 19:29:03.850519863 +0100
       +++ /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad	2025-02-03 15:42:36.974609476 +0000
       @@ -1,2 +1,5 @@
        QA output created by 418
       +cmpbuf: offset 0: Expected: 0x1, got 0x0
       +[6:0] FAIL - comparison failed, offset 24576
       +diotest -wp -b 4096 -n 8 -i 4 failed at loop 3
        Silence is golden
       ...
       (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/418.out /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad'  to see the entire diff)
   Ran: generic/418
   Failures: generic/418
   Failed 1 of 1 tests

The race happens like this:

1) A file has a prealloc extent for the range [16K, 28K);

2) Task A starts a direct IO write against file range [24K, 28K).
   At the start of the direct IO write it invalidates the page cache at
   __iomap_dio_rw() with kiocb_invalidate_pages() for the 4K page at file
   offset 24K;

3) Task A enters btrfs_dio_iomap_begin() and locks the extent range
   [24K, 28K);

4) Task B starts a readahead for file range [16K, 28K), entering
   btrfs_readahead().

   First it attempts to read the page at offset 16K by entering
   btrfs_do_readpage(), where it calls get_extent_map(), locks the range
   [16K, 20K) and gets the extent map for the range [16K, 28K), caching
   it into the 'em_cached' variable declared in the local stack of
   btrfs_readahead(), and then unlocks the range [16K, 20K).

   Since the extent map has the prealloc flag, at btrfs_do_readpage() we
   zero out the page's content and don't submit any bio to read the page
   from the extent.

   Then it attempts to read the page at offset 20K entering
   btrfs_do_readpage() where we reuse the previously cached extent map
   (decided by get_extent_map()) since it spans the page's range and
   it's still in the inode's extent map tree.

   Just like for the previous page, we zero out the page's content since
   the extent map has the prealloc flag set.

   Then it attempts to read the page at offset 24K entering
   btrfs_do_readpage() where we reuse the previously cached extent map
   (decided by get_extent_map()) since it spans the page's range and
   it's still in the inode's extent map tree.

   Just like for the previous pages, we zero out the page's content since
   the extent map has the prealloc flag set. Note that we didn't lock the
   extent range [24K, 28K), so we didn't synchronize with the ongoing
   direct IO write being performed by task A;

5) Task A enters btrfs_create_dio_extent() and creates an ordered extent
   for the range [24K, 28K), with the flags BTRFS_ORDERED_DIRECT and
   BTRFS_ORDERED_PREALLOC set;

6) Task A unlocks the range [24K, 28K) at btrfs_dio_iomap_begin();

7) The ordered extent enters btrfs_finish_one_ordered() and locks the
   range [24K, 28K);

8) Task A enters fs/iomap/direct-io.c:iomap_dio_complete() and it tries
   to invalidate the page at offset 24K by calling
   kiocb_invalidate_post_direct_write(), resulting in a call chain that
   ends up at btrfs_release_folio().

   The btrfs_release_folio() call ends up returning false because the range
   for the page at file offset 24K is currently locked by the task doing
   the ordered extent completion in the previous step (7), so we have:

   btrfs_release_folio() ->
      __btrfs_release_folio() ->
         try_release_extent_mapping() ->
	     try_release_extent_state()

   This last function checking that the range is locked and returning false
   and propagating it up to btrfs_release_folio().

   So this results in a failure to invalidate the page and
   kiocb_invalidate_post_direct_write() triggers this message logged in
   dmesg:

     Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!

   After this we leave the page cache with stale data for the file range
   [24K, 28K), filled with zeroes instead of the data written by direct IO
   write (all bytes with a 0x01 value), so any task attempting to read with
   buffered IO, including the task that did the direct IO write, will get
   all bytes in the range with a 0x00 value instead of the written data.

Fix this by locking the range, with btrfs_lock_and_flush_ordered_range(),
at the two callers of btrfs_do_readpage() instead of doing it at
get_extent_map(), just like we did before commit ac325fc2aa ("btrfs: do
not hold the extent lock for entire read"), and unlocking the range after
all the calls to btrfs_do_readpage(). This way we never reuse a cached
extent map without flushing any pending ordered extents from a concurrent
direct IO write.

Fixes: ac325fc2aa ("btrfs: do not hold the extent lock for entire read")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-02-11 23:09:03 +01:00
Matthew Wilcox (Oracle)
01af106a07 btrfs: fix two misuses of folio_shift()
It is meaningless to shift a byte count by folio_shift().  The folio index
is in units of PAGE_SIZE, not folio_size().  We can use folio_contains()
to make this work for arbitrary-order folios, so remove the assertion
that the folios are of order 0.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-02-07 20:51:18 +01:00
Linus Torvalds
92514ef226 for-6.14-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmejbmkACgkQxWXV+ddt
 WDuXuQ/8DETuhqww7hwrDgHyDHRgj/783Oy1+V/jJPQZ1hpjAbSJBU7aKxneAMMc
 Gj4ExbDfjd8kRAvsE71/1McJqVBd6CiuBG/k65vjqVS4lM8mEasVefCa4OsOR3iB
 gGzQeNbrDzIs3IOg8hM1l1iPDkI/AyOyeysD/qNMdWO4mcKEMwrBYkQJhL6+DzfE
 BKVP+NAdK4iv/W/EngAqvcd7mq2RdxeR9nBesHnKTzPCVFf6bBT3b3Qem/ovY6Td
 ZNQLjx9GXumDj0jSiyRI5rxjSYOrqSU4JV+1C7ghOwBj2uD2SZxAvghuRSuUTKnN
 9/9x1RNO7i+FE3GHzYWShhHkNuuLZspmi2J1neSttQ5Jy/PFJR/tUAED5zY6Nl0X
 73GWSzLlmaJ+E77DkTJksBYmRrnwtA6plMKD4umDCHw0lNu/0GCvxtZY1T1ZiKy2
 yK37Ja559+k7RPIdKoyHo81A7num4gLeBZTqd9F/XPjU26b57Qnqk1LetgGeT8xk
 IZFQtz9HdmvLwwbxWwNvp1ttRf+1dj1lpVnb5n6r0d8Uyta9tgQpDvwhNjluBsEx
 AQxK9yUZ5kAXEEbEyDwoOsZ5yjwkaMzqRpQWWappb0jCLm5dADI87odFRaOSlgWS
 WoXL6Vbod8G5vaEVwPDl6yuSS+609c7M8ftBlgvcx3XZ5/N6y2M=
 =YjCE
 -----END PGP SIGNATURE-----

Merge tag 'for-6.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - add lockdep annotation for relocation root to fix a splat warning
   while merging roots

 - fix assertion failure when splitting ordered extent after transaction
   abort

 - don't print 'qgroup inconsistent' message when rescan process updates
   qgroup data sooner than the subvolume deletion process

 - fix use-after-free (accessing the error number) when attempting to
   join an aborted transaction

 - avoid starting new transaction if not necessary when cleaning qgroup
   during subvolume drop

* tag 'for-6.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: avoid starting new transaction when cleaning qgroup during subvolume drop
  btrfs: fix use-after-free when attempting to join an aborted transaction
  btrfs: do not output error message if a qgroup has been already cleaned up
  btrfs: fix assertion failure when splitting ordered extent after transaction abort
  btrfs: fix lockdep splat while merging a relocation root
2025-02-05 08:13:07 -08:00
Linus Torvalds
9c5968db9e The various patchsets are summarized below. Plus of course many
indivudual patches which are described in their changelogs.
 
 - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the
   page allocator so we end up with the ability to allocate and free
   zero-refcount pages.  So that callers (ie, slab) can avoid a refcount
   inc & dec.
 
 - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use
   large folios other than PMD-sized ones.
 
 - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and
   fixes for this small built-in kernel selftest.
 
 - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of
   the mapletree code.
 
 - "mm: fix format issues and param types" from Keren Sun implements a
   few minor code cleanups.
 
 - "simplify split calculation" from Wei Yang provides a few fixes and a
   test for the mapletree code.
 
 - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes
   continues the work of moving vma-related code into the (relatively) new
   mm/vma.c.
 
 - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
   Hildenbrand cleans up and rationalizes handling of gfp flags in the page
   allocator.
 
 - "readahead: Reintroduce fix for improper RA window sizing" from Jan
   Kara is a second attempt at fixing a readahead window sizing issue.  It
   should reduce the amount of unnecessary reading.
 
 - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
   addresses an issue where "huge" amounts of pte pagetables are
   accumulated
   (https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/).
   Qi's series addresses this windup by synchronously freeing PTE memory
   within the context of madvise(MADV_DONTNEED).
 
 - "selftest/mm: Remove warnings found by adding compiler flags" from
   Muhammad Usama Anjum fixes some build warnings in the selftests code
   when optional compiler warnings are enabled.
 
 - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David
   Hildenbrand tightens the allocator's observance of __GFP_HARDWALL.
 
 - "pkeys kselftests improvements" from Kevin Brodsky implements various
   fixes and cleanups in the MM selftests code, mainly pertaining to the
   pkeys tests.
 
 - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
   estimate application working set size.
 
 - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
   provides some cleanups to memcg's hugetlb charging logic.
 
 - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
   removes the global swap cgroup lock.  A speedup of 10% for a tmpfs-based
   kernel build was demonstrated.
 
 - "zram: split page type read/write handling" from Sergey Senozhatsky
   has several fixes and cleaups for zram in the area of zram_write_page().
   A watchdog softlockup warning was eliminated.
 
 - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky
   cleans up the pagetable destructor implementations.  A rare
   use-after-free race is fixed.
 
 - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
   simplifies and cleans up the debugging code in the VMA merging logic.
 
 - "Account page tables at all levels" from Kevin Brodsky cleans up and
   regularizes the pagetable ctor/dtor handling.  This results in
   improvements in accounting accuracy.
 
 - "mm/damon: replace most damon_callback usages in sysfs with new core
   functions" from SeongJae Park cleans up and generalizes DAMON's sysfs
   file interface logic.
 
 - "mm/damon: enable page level properties based monitoring" from
   SeongJae Park increases the amount of information which is presented in
   response to DAMOS actions.
 
 - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes
   DAMON's long-deprecated debugfs interfaces.  Thus the migration to sysfs
   is completed.
 
 - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter
   Xu cleans up and generalizes the hugetlb reservation accounting.
 
 - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
   removes a never-used feature of the alloc_pages_bulk() interface.
 
 - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
   extends DAMOS filters to support not only exclusion (rejecting), but
   also inclusion (allowing) behavior.
 
 - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
   "introduces a new memory descriptor for zswap.zpool that currently
   overlaps with struct page for now.  This is part of the effort to reduce
   the size of struct page and to enable dynamic allocation of memory
   descriptors."
 
 - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and
   simplifies the swap allocator locking.  A speedup of 400% was
   demonstrated for one workload.  As was a 35% reduction for kernel build
   time with swap-on-zram.
 
 - "mm: update mips to use do_mmap(), make mmap_region() internal" from
   Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
   mmap_region() can be made MM-internal.
 
 - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU
   regressions and otherwise improves MGLRU performance.
 
 - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park
   updates DAMON documentation.
 
 - "Cleanup for memfd_create()" from Isaac Manjarres does that thing.
 
 - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand
   provides various cleanups in the areas of hugetlb folios, THP folios and
   migration.
 
 - "Uncached buffered IO" from Jens Axboe implements the new
   RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache
   reading and writing.  To permite userspace to address issues with
   massive buildup of useless pagecache when reading/writing fast devices.
 
 - "selftests/mm: virtual_address_range: Reduce memory" from Thomas
   Weißschuh fixes and optimizes some of the MM selftests.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5a+cwAKCRDdBJ7gKXxA
 jtoyAP9R58oaOKPJuTizEKKXvh/RpMyD6sYcz/uPpnf+cKTZxQEAqfVznfWlw/Lz
 uC3KRZYhmd5YrxU4o+qjbzp9XWX/xAE=
 =Ib2s
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "The various patchsets are summarized below. Plus of course many
  indivudual patches which are described in their changelogs.

   - "Allocate and free frozen pages" from Matthew Wilcox reorganizes
     the page allocator so we end up with the ability to allocate and
     free zero-refcount pages. So that callers (ie, slab) can avoid a
     refcount inc & dec

   - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to
     use large folios other than PMD-sized ones

   - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance
     and fixes for this small built-in kernel selftest

   - "mas_anode_descend() related cleanup" from Wei Yang tidies up part
     of the mapletree code

   - "mm: fix format issues and param types" from Keren Sun implements a
     few minor code cleanups

   - "simplify split calculation" from Wei Yang provides a few fixes and
     a test for the mapletree code

   - "mm/vma: make more mmap logic userland testable" from Lorenzo
     Stoakes continues the work of moving vma-related code into the
     (relatively) new mm/vma.c

   - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
     Hildenbrand cleans up and rationalizes handling of gfp flags in the
     page allocator

   - "readahead: Reintroduce fix for improper RA window sizing" from Jan
     Kara is a second attempt at fixing a readahead window sizing issue.
     It should reduce the amount of unnecessary reading

   - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
     addresses an issue where "huge" amounts of pte pagetables are
     accumulated:

       https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/

     Qi's series addresses this windup by synchronously freeing PTE
     memory within the context of madvise(MADV_DONTNEED)

   - "selftest/mm: Remove warnings found by adding compiler flags" from
     Muhammad Usama Anjum fixes some build warnings in the selftests
     code when optional compiler warnings are enabled

   - "mm: don't use __GFP_HARDWALL when migrating remote pages" from
     David Hildenbrand tightens the allocator's observance of
     __GFP_HARDWALL

   - "pkeys kselftests improvements" from Kevin Brodsky implements
     various fixes and cleanups in the MM selftests code, mainly
     pertaining to the pkeys tests

   - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
     estimate application working set size

   - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
     provides some cleanups to memcg's hugetlb charging logic

   - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
     removes the global swap cgroup lock. A speedup of 10% for a
     tmpfs-based kernel build was demonstrated

   - "zram: split page type read/write handling" from Sergey Senozhatsky
     has several fixes and cleaups for zram in the area of
     zram_write_page(). A watchdog softlockup warning was eliminated

   - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin
     Brodsky cleans up the pagetable destructor implementations. A rare
     use-after-free race is fixed

   - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
     simplifies and cleans up the debugging code in the VMA merging
     logic

   - "Account page tables at all levels" from Kevin Brodsky cleans up
     and regularizes the pagetable ctor/dtor handling. This results in
     improvements in accounting accuracy

   - "mm/damon: replace most damon_callback usages in sysfs with new
     core functions" from SeongJae Park cleans up and generalizes
     DAMON's sysfs file interface logic

   - "mm/damon: enable page level properties based monitoring" from
     SeongJae Park increases the amount of information which is
     presented in response to DAMOS actions

   - "mm/damon: remove DAMON debugfs interface" from SeongJae Park
     removes DAMON's long-deprecated debugfs interfaces. Thus the
     migration to sysfs is completed

   - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from
     Peter Xu cleans up and generalizes the hugetlb reservation
     accounting

   - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
     removes a never-used feature of the alloc_pages_bulk() interface

   - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
     extends DAMOS filters to support not only exclusion (rejecting),
     but also inclusion (allowing) behavior

   - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
     introduces a new memory descriptor for zswap.zpool that currently
     overlaps with struct page for now. This is part of the effort to
     reduce the size of struct page and to enable dynamic allocation of
     memory descriptors

   - "mm, swap: rework of swap allocator locks" from Kairui Song redoes
     and simplifies the swap allocator locking. A speedup of 400% was
     demonstrated for one workload. As was a 35% reduction for kernel
     build time with swap-on-zram

   - "mm: update mips to use do_mmap(), make mmap_region() internal"
     from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
     mmap_region() can be made MM-internal

   - "mm/mglru: performance optimizations" from Yu Zhao fixes a few
     MGLRU regressions and otherwise improves MGLRU performance

   - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae
     Park updates DAMON documentation

   - "Cleanup for memfd_create()" from Isaac Manjarres does that thing

   - "mm: hugetlb+THP folio and migration cleanups" from David
     Hildenbrand provides various cleanups in the areas of hugetlb
     folios, THP folios and migration

   - "Uncached buffered IO" from Jens Axboe implements the new
     RWF_DONTCACHE flag which provides synchronous dropbehind for
     pagecache reading and writing. To permite userspace to address
     issues with massive buildup of useless pagecache when
     reading/writing fast devices

   - "selftests/mm: virtual_address_range: Reduce memory" from Thomas
     Weißschuh fixes and optimizes some of the MM selftests"

* tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
  mm/compaction: fix UBSAN shift-out-of-bounds warning
  s390/mm: add missing ctor/dtor on page table upgrade
  kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags()
  tools: add VM_WARN_ON_VMG definition
  mm/damon/core: use str_high_low() helper in damos_wmark_wait_us()
  seqlock: add missing parameter documentation for raw_seqcount_try_begin()
  mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh
  mm/page_alloc: remove the incorrect and misleading comment
  zram: remove zcomp_stream_put() from write_incompressible_page()
  mm: separate move/undo parts from migrate_pages_batch()
  mm/kfence: use str_write_read() helper in get_access_type()
  selftests/mm/mkdirty: fix memory leak in test_uffdio_copy()
  kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags()
  selftests/mm: virtual_address_range: avoid reading from VM_IO mappings
  selftests/mm: vm_util: split up /proc/self/smaps parsing
  selftests/mm: virtual_address_range: unmap chunks after validation
  selftests/mm: virtual_address_range: mmap() without PROT_WRITE
  selftests/memfd/memfd_test: fix possible NULL pointer dereference
  mm: add FGP_DONTCACHE folio creation flag
  mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue
  ...
2025-01-26 18:36:23 -08:00
Kairui Song
27701521be mm, swap: clean up device availability check
Remove highest_bit and lowest_bit.  After the HDD allocation path has been
removed, the only purpose of these two fields is to determine whether the
device is full or not, which can instead be determined by checking the
inuse_pages.

Link: https://lkml.kernel.org/r/20250113175732.48099-6-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Barry Song <v-songbaohua@oppo.com>
Cc: Chis Li <chrisl@kernel.org>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Hugh Dickens <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25 20:22:36 -08:00
Luiz Capitulino
6bf9b5b40a mm: alloc_pages_bulk: rename API
The previous commit removed the page_list argument from
alloc_pages_bulk_noprof() along with the alloc_pages_bulk_list() function.

Now that only the *_array() flavour of the API remains, we can do the
following renaming (along with the _noprof() ones):

  alloc_pages_bulk_array -> alloc_pages_bulk
  alloc_pages_bulk_array_mempolicy -> alloc_pages_bulk_mempolicy
  alloc_pages_bulk_array_node -> alloc_pages_bulk_node

Link: https://lkml.kernel.org/r/275a3bbc0be20fbe9002297d60045e67ab3d4ada.1734991165.git.luizcap@redhat.com
Signed-off-by: Luiz Capitulino <luizcap@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25 20:22:31 -08:00
Linus Torvalds
8883957b3c \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmePs7oACgkQnJ2qBz9k
 QNmHuAf9GkLnY5u1/81xP5V9ukZ4N2yeMW0dydLS5cjWj/St5ELeMAza3jeqtJtD
 j36vbnmy2c5pPaGLAK8BJpMXT/R2TkmmKD004zcfqF2S3SgbGzdgO1zMZzq9KJpM
 woRKZtLuglDajedsDEBBcKotBhlN2+C/sQlFuL1mX4zitk9ajr0qYUB1+JqOeg5f
 qwPsDLT077ADpxd7lVIMcm+OqbduP5KWkBKYHpn7lJcLe1eqVMMzceJroW42zhVG
 Dq8Iln26bbU9Wx6FSPFCUcHEzHRHUfXmu07HN9U0X++0QgWjrmBQQLooGFB/bR4a
 edBrPpVas6xE4/brjgFX3gOKtv8xYg==
 =ewDV
 -----END PGP SIGNATURE-----

Merge tag 'fsnotify_hsm_for_v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify pre-content notification support from Jan Kara:
 "This introduces a new fsnotify event (FS_PRE_ACCESS) that gets
  generated before a file contents is accessed.

  The event is synchronous so if there is listener for this event, the
  kernel waits for reply. On success the execution continues as usual,
  on failure we propagate the error to userspace. This allows userspace
  to fill in file content on demand from slow storage. The context in
  which the events are generated has been picked so that we don't hold
  any locks and thus there's no risk of a deadlock for the userspace
  handler.

  The new pre-content event is available only for users with global
  CAP_SYS_ADMIN capability (similarly to other parts of fanotify
  functionality) and it is an administrator responsibility to make sure
  the userspace event handler doesn't do stupid stuff that can DoS the
  system.

  Based on your feedback from the last submission, fsnotify code has
  been improved and now file->f_mode encodes whether pre-content event
  needs to be generated for the file so the fast path when nobody wants
  pre-content event for the file just grows the additional file->f_mode
  check. As a bonus this also removes the checks whether the old
  FS_ACCESS event needs to be generated from the fast path. Also the
  place where the event is generated during page fault has been moved so
  now filemap_fault() generates the event if and only if there is no
  uptodate folio in the page cache.

  Also we have dropped FS_PRE_MODIFY event as current real-world users
  of the pre-content functionality don't really use it so let's start
  with the minimal useful feature set"

* tag 'fsnotify_hsm_for_v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (21 commits)
  fanotify: Fix crash in fanotify_init(2)
  fs: don't block write during exec on pre-content watched files
  fs: enable pre-content events on supported file systems
  ext4: add pre-content fsnotify hook for DAX faults
  btrfs: disable defrag on pre-content watched files
  xfs: add pre-content fsnotify hook for DAX faults
  fsnotify: generate pre-content permission event on page fault
  mm: don't allow huge faults for files with pre content watches
  fanotify: disable readahead if we have pre-content watches
  fanotify: allow to set errno in FAN_DENY permission response
  fanotify: report file range info with pre-content events
  fanotify: introduce FAN_PRE_ACCESS permission event
  fsnotify: generate pre-content permission event on truncate
  fsnotify: pass optional file access range in pre-content event
  fsnotify: introduce pre-content permission events
  fanotify: reserve event bit of deprecated FAN_DIR_MODIFY
  fanotify: rename a misnamed constant
  fanotify: don't skip extra event info if no info_mode is set
  fsnotify: check if file is actually being watched for pre-content events on open
  fsnotify: opt-in for permission events at file open time
  ...
2025-01-23 13:36:06 -08:00
Filipe Manana
fdef89ce6f btrfs: avoid starting new transaction when cleaning qgroup during subvolume drop
At btrfs_qgroup_cleanup_dropped_subvolume() all we want to commit the
current transaction in order to have all the qgroup rfer/excl numbers up
to date. However we are using btrfs_start_transaction(), which joins the
current transaction if there is one that is not yet committing, but also
starts a new one if there is none or if the current one is already
committing (its state is >= TRANS_STATE_COMMIT_START). This later case
results in unnecessary IO, wasting time and a pointless rotation of the
backup roots in the super block.

So instead of using btrfs_start_transaction() followed by a
btrfs_commit_transaction(), use btrfs_commit_current_transaction() which
achieves our purpose and avoids starting and committing new transactions.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-23 22:34:17 +01:00
Filipe Manana
e2f0943cf3 btrfs: fix use-after-free when attempting to join an aborted transaction
When we are trying to join the current transaction and if it's aborted,
we read its 'aborted' field after unlocking fs_info->trans_lock and
without holding any extra reference count on it. This means that a
concurrent task that is aborting the transaction may free the transaction
before we read its 'aborted' field, leading to a use-after-free.

Fix this by reading the 'aborted' field while holding fs_info->trans_lock
since any freeing task must first acquire that lock and set
fs_info->running_transaction to NULL before freeing the transaction.

This was reported by syzbot and Dmitry with the following stack traces
from KASAN:

   ==================================================================
   BUG: KASAN: slab-use-after-free in join_transaction+0xd9b/0xda0 fs/btrfs/transaction.c:278
   Read of size 4 at addr ffff888011839024 by task kworker/u4:9/1128

   CPU: 0 UID: 0 PID: 1128 Comm: kworker/u4:9 Not tainted 6.13.0-rc7-syzkaller-00019-gc45323b7560e #0
   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
   Workqueue: events_unbound btrfs_async_reclaim_data_space
   Call Trace:
    <TASK>
    __dump_stack lib/dump_stack.c:94 [inline]
    dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
    print_address_description mm/kasan/report.c:378 [inline]
    print_report+0x169/0x550 mm/kasan/report.c:489
    kasan_report+0x143/0x180 mm/kasan/report.c:602
    join_transaction+0xd9b/0xda0 fs/btrfs/transaction.c:278
    start_transaction+0xaf8/0x1670 fs/btrfs/transaction.c:697
    flush_space+0x448/0xcf0 fs/btrfs/space-info.c:803
    btrfs_async_reclaim_data_space+0x159/0x510 fs/btrfs/space-info.c:1321
    process_one_work kernel/workqueue.c:3236 [inline]
    process_scheduled_works+0xa66/0x1840 kernel/workqueue.c:3317
    worker_thread+0x870/0xd30 kernel/workqueue.c:3398
    kthread+0x2f0/0x390 kernel/kthread.c:389
    ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
    ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
    </TASK>

   Allocated by task 5315:
    kasan_save_stack mm/kasan/common.c:47 [inline]
    kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
    poison_kmalloc_redzone mm/kasan/common.c:377 [inline]
    __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:394
    kasan_kmalloc include/linux/kasan.h:260 [inline]
    __kmalloc_cache_noprof+0x243/0x390 mm/slub.c:4329
    kmalloc_noprof include/linux/slab.h:901 [inline]
    join_transaction+0x144/0xda0 fs/btrfs/transaction.c:308
    start_transaction+0xaf8/0x1670 fs/btrfs/transaction.c:697
    btrfs_create_common+0x1b2/0x2e0 fs/btrfs/inode.c:6572
    lookup_open fs/namei.c:3649 [inline]
    open_last_lookups fs/namei.c:3748 [inline]
    path_openat+0x1c03/0x3590 fs/namei.c:3984
    do_filp_open+0x27f/0x4e0 fs/namei.c:4014
    do_sys_openat2+0x13e/0x1d0 fs/open.c:1402
    do_sys_open fs/open.c:1417 [inline]
    __do_sys_creat fs/open.c:1495 [inline]
    __se_sys_creat fs/open.c:1489 [inline]
    __x64_sys_creat+0x123/0x170 fs/open.c:1489
    do_syscall_x64 arch/x86/entry/common.c:52 [inline]
    do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
    entry_SYSCALL_64_after_hwframe+0x77/0x7f

   Freed by task 5336:
    kasan_save_stack mm/kasan/common.c:47 [inline]
    kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
    kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:582
    poison_slab_object mm/kasan/common.c:247 [inline]
    __kasan_slab_free+0x59/0x70 mm/kasan/common.c:264
    kasan_slab_free include/linux/kasan.h:233 [inline]
    slab_free_hook mm/slub.c:2353 [inline]
    slab_free mm/slub.c:4613 [inline]
    kfree+0x196/0x430 mm/slub.c:4761
    cleanup_transaction fs/btrfs/transaction.c:2063 [inline]
    btrfs_commit_transaction+0x2c97/0x3720 fs/btrfs/transaction.c:2598
    insert_balance_item+0x1284/0x20b0 fs/btrfs/volumes.c:3757
    btrfs_balance+0x992/0x10c0 fs/btrfs/volumes.c:4633
    btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670
    vfs_ioctl fs/ioctl.c:51 [inline]
    __do_sys_ioctl fs/ioctl.c:906 [inline]
    __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
    do_syscall_x64 arch/x86/entry/common.c:52 [inline]
    do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
    entry_SYSCALL_64_after_hwframe+0x77/0x7f

   The buggy address belongs to the object at ffff888011839000
    which belongs to the cache kmalloc-2k of size 2048
   The buggy address is located 36 bytes inside of
    freed 2048-byte region [ffff888011839000, ffff888011839800)

   The buggy address belongs to the physical page:
   page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11838
   head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
   flags: 0xfff00000000040(head|node=0|zone=1|lastcpupid=0x7ff)
   page_type: f5(slab)
   raw: 00fff00000000040 ffff88801ac42000 ffffea0000493400 dead000000000002
   raw: 0000000000000000 0000000000080008 00000001f5000000 0000000000000000
   head: 00fff00000000040 ffff88801ac42000 ffffea0000493400 dead000000000002
   head: 0000000000000000 0000000000080008 00000001f5000000 0000000000000000
   head: 00fff00000000003 ffffea0000460e01 ffffffffffffffff 0000000000000000
   head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
   page dumped because: kasan: bad access detected
   page_owner tracks the page as allocated
   page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 57, tgid 57 (kworker/0:2), ts 67248182943, free_ts 67229742023
    set_page_owner include/linux/page_owner.h:32 [inline]
    post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1558
    prep_new_page mm/page_alloc.c:1566 [inline]
    get_page_from_freelist+0x365c/0x37a0 mm/page_alloc.c:3476
    __alloc_pages_noprof+0x292/0x710 mm/page_alloc.c:4753
    alloc_pages_mpol_noprof+0x3e1/0x780 mm/mempolicy.c:2269
    alloc_slab_page+0x6a/0x110 mm/slub.c:2423
    allocate_slab+0x5a/0x2b0 mm/slub.c:2589
    new_slab mm/slub.c:2642 [inline]
    ___slab_alloc+0xc27/0x14a0 mm/slub.c:3830
    __slab_alloc+0x58/0xa0 mm/slub.c:3920
    __slab_alloc_node mm/slub.c:3995 [inline]
    slab_alloc_node mm/slub.c:4156 [inline]
    __do_kmalloc_node mm/slub.c:4297 [inline]
    __kmalloc_node_track_caller_noprof+0x2e9/0x4c0 mm/slub.c:4317
    kmalloc_reserve+0x111/0x2a0 net/core/skbuff.c:609
    __alloc_skb+0x1f3/0x440 net/core/skbuff.c:678
    alloc_skb include/linux/skbuff.h:1323 [inline]
    alloc_skb_with_frags+0xc3/0x820 net/core/skbuff.c:6612
    sock_alloc_send_pskb+0x91a/0xa60 net/core/sock.c:2884
    sock_alloc_send_skb include/net/sock.h:1803 [inline]
    mld_newpack+0x1c3/0xaf0 net/ipv6/mcast.c:1747
    add_grhead net/ipv6/mcast.c:1850 [inline]
    add_grec+0x1492/0x19a0 net/ipv6/mcast.c:1988
    mld_send_cr net/ipv6/mcast.c:2114 [inline]
    mld_ifc_work+0x691/0xd90 net/ipv6/mcast.c:2651
   page last free pid 5300 tgid 5300 stack trace:
    reset_page_owner include/linux/page_owner.h:25 [inline]
    free_pages_prepare mm/page_alloc.c:1127 [inline]
    free_unref_page+0xd3f/0x1010 mm/page_alloc.c:2659
    __slab_free+0x2c2/0x380 mm/slub.c:4524
    qlink_free mm/kasan/quarantine.c:163 [inline]
    qlist_free_all+0x9a/0x140 mm/kasan/quarantine.c:179
    kasan_quarantine_reduce+0x14f/0x170 mm/kasan/quarantine.c:286
    __kasan_slab_alloc+0x23/0x80 mm/kasan/common.c:329
    kasan_slab_alloc include/linux/kasan.h:250 [inline]
    slab_post_alloc_hook mm/slub.c:4119 [inline]
    slab_alloc_node mm/slub.c:4168 [inline]
    __do_kmalloc_node mm/slub.c:4297 [inline]
    __kmalloc_noprof+0x236/0x4c0 mm/slub.c:4310
    kmalloc_noprof include/linux/slab.h:905 [inline]
    kzalloc_noprof include/linux/slab.h:1037 [inline]
    fib_create_info+0xc14/0x25b0 net/ipv4/fib_semantics.c:1435
    fib_table_insert+0x1f6/0x1f20 net/ipv4/fib_trie.c:1231
    fib_magic+0x3d8/0x620 net/ipv4/fib_frontend.c:1112
    fib_add_ifaddr+0x40c/0x5e0 net/ipv4/fib_frontend.c:1156
    fib_netdev_event+0x375/0x490 net/ipv4/fib_frontend.c:1494
    notifier_call_chain+0x1a5/0x3f0 kernel/notifier.c:85
    __dev_notify_flags+0x207/0x400
    dev_change_flags+0xf0/0x1a0 net/core/dev.c:9045
    do_setlink+0xc90/0x4210 net/core/rtnetlink.c:3109
    rtnl_changelink net/core/rtnetlink.c:3723 [inline]
    __rtnl_newlink net/core/rtnetlink.c:3875 [inline]
    rtnl_newlink+0x1bb6/0x2210 net/core/rtnetlink.c:4012

   Memory state around the buggy address:
    ffff888011838f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff888011838f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
   >ffff888011839000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                  ^
    ffff888011839080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff888011839100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   ==================================================================

Reported-by: syzbot+45212e9d87a98c3f5b42@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/678e7da5.050a0220.303755.007c.GAE@google.com/
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Link: https://lore.kernel.org/linux-btrfs/CACT4Y+ZFBdo7pT8L2AzM=vegZwjp-wNkVJZQf0Ta3vZqtExaSw@mail.gmail.com/
Fixes: 871383be59 ("btrfs: add missing unlocks to transaction abort paths")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-23 22:34:14 +01:00
Qu Wenruo
c9c8637933 btrfs: do not output error message if a qgroup has been already cleaned up
[BUG]
There is a bug report that btrfs outputs the following error message:

  BTRFS info (device nvme0n1p2): qgroup scan completed (inconsistency flag cleared)
  BTRFS warning (device nvme0n1p2): failed to cleanup qgroup 0/1179: -2

[CAUSE]
The error itself is pretty harmless, and the end user should ignore it.

When a subvolume is fully dropped, btrfs will call
btrfs_qgroup_cleanup_dropped_subvolume() to delete the qgroup.

However if a qgroup rescan happened before a subvolume fully dropped,
qgroup for that subvolume will not be re-created, as rescan will only
create new qgroup if there is a BTRFS_ROOT_REF_KEY found.

But before we drop a subvolume, the subvolume is unlinked thus there is no
BTRFS_ROOT_REF_KEY.

In that case, btrfs_remove_qgroup() will fail with -ENOENT and trigger
the above error message.

[FIX]
Just ignore -ENOENT error from btrfs_remove_qgroup() inside
btrfs_qgroup_cleanup_dropped_subvolume().

Reported-by: John Shand <jshand2013@gmail.com>
Link: https://bugzilla.suse.com/show_bug.cgi?id=1236056
Fixes: 839d6ea4f8 ("btrfs: automatically remove the subvolume qgroup")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-23 22:34:11 +01:00
Filipe Manana
0d85f5c2dd btrfs: fix assertion failure when splitting ordered extent after transaction abort
If while we are doing a direct IO write a transaction abort happens, we
mark all existing ordered extents with the BTRFS_ORDERED_IOERR flag (done
at btrfs_destroy_ordered_extents()), and then after that if we enter
btrfs_split_ordered_extent() and the ordered extent has bytes left
(meaning we have a bio that doesn't cover the whole ordered extent, see
details at btrfs_extract_ordered_extent()), we will fail on the following
assertion at btrfs_split_ordered_extent():

   ASSERT(!(flags & ~BTRFS_ORDERED_TYPE_FLAGS));

because the BTRFS_ORDERED_IOERR flag is set and the definition of
BTRFS_ORDERED_TYPE_FLAGS is just the union of all flags that identify the
type of write (regular, nocow, prealloc, compressed, direct IO, encoded).

Fix this by returning an error from btrfs_extract_ordered_extent() if we
find the BTRFS_ORDERED_IOERR flag in the ordered extent. The error will
be the error that resulted in the transaction abort or -EIO if no
transaction abort happened.

This was recently reported by syzbot with the following trace:

   FAULT_INJECTION: forcing a failure.
   name failslab, interval 1, probability 0, space 0, times 1
   CPU: 0 UID: 0 PID: 5321 Comm: syz.0.0 Not tainted 6.13.0-rc5-syzkaller #0
   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
   Call Trace:
    <TASK>
    __dump_stack lib/dump_stack.c:94 [inline]
    dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
    fail_dump lib/fault-inject.c:53 [inline]
    should_fail_ex+0x3b0/0x4e0 lib/fault-inject.c:154
    should_failslab+0xac/0x100 mm/failslab.c:46
    slab_pre_alloc_hook mm/slub.c:4072 [inline]
    slab_alloc_node mm/slub.c:4148 [inline]
    __do_kmalloc_node mm/slub.c:4297 [inline]
    __kmalloc_noprof+0xdd/0x4c0 mm/slub.c:4310
    kmalloc_noprof include/linux/slab.h:905 [inline]
    kzalloc_noprof include/linux/slab.h:1037 [inline]
    btrfs_chunk_alloc_add_chunk_item+0x244/0x1100 fs/btrfs/volumes.c:5742
    reserve_chunk_space+0x1ca/0x2c0 fs/btrfs/block-group.c:4292
    check_system_chunk fs/btrfs/block-group.c:4319 [inline]
    do_chunk_alloc fs/btrfs/block-group.c:3891 [inline]
    btrfs_chunk_alloc+0x77b/0xf80 fs/btrfs/block-group.c:4187
    find_free_extent_update_loop fs/btrfs/extent-tree.c:4166 [inline]
    find_free_extent+0x42d1/0x5810 fs/btrfs/extent-tree.c:4579
    btrfs_reserve_extent+0x422/0x810 fs/btrfs/extent-tree.c:4672
    btrfs_new_extent_direct fs/btrfs/direct-io.c:186 [inline]
    btrfs_get_blocks_direct_write+0x706/0xfa0 fs/btrfs/direct-io.c:321
    btrfs_dio_iomap_begin+0xbb7/0x1180 fs/btrfs/direct-io.c:525
    iomap_iter+0x697/0xf60 fs/iomap/iter.c:90
    __iomap_dio_rw+0xeb9/0x25b0 fs/iomap/direct-io.c:702
    btrfs_dio_write fs/btrfs/direct-io.c:775 [inline]
    btrfs_direct_write+0x610/0xa30 fs/btrfs/direct-io.c:880
    btrfs_do_write_iter+0x2a0/0x760 fs/btrfs/file.c:1397
    do_iter_readv_writev+0x600/0x880
    vfs_writev+0x376/0xba0 fs/read_write.c:1050
    do_pwritev fs/read_write.c:1146 [inline]
    __do_sys_pwritev2 fs/read_write.c:1204 [inline]
    __se_sys_pwritev2+0x196/0x2b0 fs/read_write.c:1195
    do_syscall_x64 arch/x86/entry/common.c:52 [inline]
    do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
    entry_SYSCALL_64_after_hwframe+0x77/0x7f
   RIP: 0033:0x7f1281f85d29
   RSP: 002b:00007f12819fe038 EFLAGS: 00000246 ORIG_RAX: 0000000000000148
   RAX: ffffffffffffffda RBX: 00007f1282176080 RCX: 00007f1281f85d29
   RDX: 0000000000000001 RSI: 0000000020000240 RDI: 0000000000000005
   RBP: 00007f12819fe090 R08: 0000000000000000 R09: 0000000000000003
   R10: 0000000000007000 R11: 0000000000000246 R12: 0000000000000002
   R13: 0000000000000000 R14: 00007f1282176080 R15: 00007ffcb9e23328
    </TASK>
   BTRFS error (device loop0 state A): Transaction aborted (error -12)
   BTRFS: error (device loop0 state A) in btrfs_chunk_alloc_add_chunk_item:5745: errno=-12 Out of memory
   BTRFS info (device loop0 state EA): forced readonly
   assertion failed: !(flags & ~BTRFS_ORDERED_TYPE_FLAGS), in fs/btrfs/ordered-data.c:1234
   ------------[ cut here ]------------
   kernel BUG at fs/btrfs/ordered-data.c:1234!
   Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
   CPU: 0 UID: 0 PID: 5321 Comm: syz.0.0 Not tainted 6.13.0-rc5-syzkaller #0
   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
   RIP: 0010:btrfs_split_ordered_extent+0xd8d/0xe20 fs/btrfs/ordered-data.c:1234
   RSP: 0018:ffffc9000d1df2b8 EFLAGS: 00010246
   RAX: 0000000000000057 RBX: 000000000006a000 RCX: 9ce21886c4195300
   RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
   RBP: 0000000000000091 R08: ffffffff817f0a3c R09: 1ffff92001a3bdf4
   R10: dffffc0000000000 R11: fffff52001a3bdf5 R12: 1ffff1100a45f401
   R13: ffff8880522fa018 R14: dffffc0000000000 R15: 000000000006a000
   FS:  00007f12819fe6c0(0000) GS:ffff88801fc00000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   CR2: 0000557750bd7da8 CR3: 00000000400ea000 CR4: 0000000000352ef0
   DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
   DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
   Call Trace:
    <TASK>
    btrfs_extract_ordered_extent fs/btrfs/direct-io.c:702 [inline]
    btrfs_dio_submit_io+0x4be/0x6d0 fs/btrfs/direct-io.c:737
    iomap_dio_submit_bio fs/iomap/direct-io.c:85 [inline]
    iomap_dio_bio_iter+0x1022/0x1740 fs/iomap/direct-io.c:447
    __iomap_dio_rw+0x13b7/0x25b0 fs/iomap/direct-io.c:703
    btrfs_dio_write fs/btrfs/direct-io.c:775 [inline]
    btrfs_direct_write+0x610/0xa30 fs/btrfs/direct-io.c:880
    btrfs_do_write_iter+0x2a0/0x760 fs/btrfs/file.c:1397
    do_iter_readv_writev+0x600/0x880
    vfs_writev+0x376/0xba0 fs/read_write.c:1050
    do_pwritev fs/read_write.c:1146 [inline]
    __do_sys_pwritev2 fs/read_write.c:1204 [inline]
    __se_sys_pwritev2+0x196/0x2b0 fs/read_write.c:1195
    do_syscall_x64 arch/x86/entry/common.c:52 [inline]
    do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
    entry_SYSCALL_64_after_hwframe+0x77/0x7f
   RIP: 0033:0x7f1281f85d29
   RSP: 002b:00007f12819fe038 EFLAGS: 00000246 ORIG_RAX: 0000000000000148
   RAX: ffffffffffffffda RBX: 00007f1282176080 RCX: 00007f1281f85d29
   RDX: 0000000000000001 RSI: 0000000020000240 RDI: 0000000000000005
   RBP: 00007f12819fe090 R08: 0000000000000000 R09: 0000000000000003
   R10: 0000000000007000 R11: 0000000000000246 R12: 0000000000000002
   R13: 0000000000000000 R14: 00007f1282176080 R15: 00007ffcb9e23328
    </TASK>
   Modules linked in:
   ---[ end trace 0000000000000000 ]---
   RIP: 0010:btrfs_split_ordered_extent+0xd8d/0xe20 fs/btrfs/ordered-data.c:1234
   RSP: 0018:ffffc9000d1df2b8 EFLAGS: 00010246
   RAX: 0000000000000057 RBX: 000000000006a000 RCX: 9ce21886c4195300
   RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
   RBP: 0000000000000091 R08: ffffffff817f0a3c R09: 1ffff92001a3bdf4
   R10: dffffc0000000000 R11: fffff52001a3bdf5 R12: 1ffff1100a45f401
   R13: ffff8880522fa018 R14: dffffc0000000000 R15: 000000000006a000
   FS:  00007f12819fe6c0(0000) GS:ffff88801fc00000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   CR2: 0000557750bd7da8 CR3: 00000000400ea000 CR4: 0000000000352ef0
   DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
   DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

In this case the transaction abort was due to (an injected) memory
allocation failure when attempting to allocate a new chunk.

Reported-by: syzbot+f60d8337a5c8e8d92a77@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6777f2dd.050a0220.178762.0045.GAE@google.com/
Fixes: 52b1fdca23 ("btrfs: handle completed ordered extents in btrfs_split_ordered_extent")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-23 22:34:09 +01:00
Filipe Manana
a216542027 btrfs: fix lockdep splat while merging a relocation root
When COWing a relocation tree path, at relocation.c:replace_path(), we
can trigger a lockdep splat while we are in the btrfs_search_slot() call
against the relocation root. This happens in that callchain at
ctree.c:read_block_for_search() when we happen to find a child extent
buffer already loaded through the fs tree with a lockdep class set to
the fs tree. So when we attempt to lock that extent buffer through a
relocation tree we have to reset the lockdep class to the class for a
relocation tree, since a relocation tree has extent buffers that used
to belong to a fs tree and may currently be already loaded (we swap
extent buffers between the two trees at the end of replace_path()).

However we are missing calls to btrfs_maybe_reset_lockdep_class() to reset
the lockdep class at ctree.c:read_block_for_search() before we read lock
an extent buffer, just like we did for btrfs_search_slot() in commit
b40130b23c ("btrfs: fix lockdep splat with reloc root extent buffers").

So add the missing btrfs_maybe_reset_lockdep_class() calls before the
attempts to read lock an extent buffer at ctree.c:read_block_for_search().

The lockdep splat was reported by syzbot and it looks like this:

   ======================================================
   WARNING: possible circular locking dependency detected
   6.13.0-rc5-syzkaller-00163-gab75170520d4 #0 Not tainted
   ------------------------------------------------------
   syz.0.0/5335 is trying to acquire lock:
   ffff8880545dbc38 (btrfs-tree-01){++++}-{4:4}, at: btrfs_tree_read_lock_nested+0x2f/0x250 fs/btrfs/locking.c:146

   but task is already holding lock:
   ffff8880545dba58 (btrfs-treloc-02/1){+.+.}-{4:4}, at: btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189

   which lock already depends on the new lock.

   the existing dependency chain (in reverse order) is:

   -> #2 (btrfs-treloc-02/1){+.+.}-{4:4}:
          reacquire_held_locks+0x3eb/0x690 kernel/locking/lockdep.c:5374
          __lock_release kernel/locking/lockdep.c:5563 [inline]
          lock_release+0x396/0xa30 kernel/locking/lockdep.c:5870
          up_write+0x79/0x590 kernel/locking/rwsem.c:1629
          btrfs_force_cow_block+0x14b3/0x1fd0 fs/btrfs/ctree.c:660
          btrfs_cow_block+0x371/0x830 fs/btrfs/ctree.c:755
          btrfs_search_slot+0xc01/0x3180 fs/btrfs/ctree.c:2153
          replace_path+0x1243/0x2740 fs/btrfs/relocation.c:1224
          merge_reloc_root+0xc46/0x1ad0 fs/btrfs/relocation.c:1692
          merge_reloc_roots+0x3b3/0x980 fs/btrfs/relocation.c:1942
          relocate_block_group+0xb0a/0xd40 fs/btrfs/relocation.c:3754
          btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4087
          btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3494
          __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4278
          btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4655
          btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670
          vfs_ioctl fs/ioctl.c:51 [inline]
          __do_sys_ioctl fs/ioctl.c:906 [inline]
          __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
          do_syscall_x64 arch/x86/entry/common.c:52 [inline]
          do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
          entry_SYSCALL_64_after_hwframe+0x77/0x7f

   -> #1 (btrfs-tree-01/1){+.+.}-{4:4}:
          lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
          down_write_nested+0xa2/0x220 kernel/locking/rwsem.c:1693
          btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189
          btrfs_init_new_buffer fs/btrfs/extent-tree.c:5052 [inline]
          btrfs_alloc_tree_block+0x41c/0x1440 fs/btrfs/extent-tree.c:5132
          btrfs_force_cow_block+0x526/0x1fd0 fs/btrfs/ctree.c:573
          btrfs_cow_block+0x371/0x830 fs/btrfs/ctree.c:755
          btrfs_search_slot+0xc01/0x3180 fs/btrfs/ctree.c:2153
          btrfs_insert_empty_items+0x9c/0x1a0 fs/btrfs/ctree.c:4351
          btrfs_insert_empty_item fs/btrfs/ctree.h:688 [inline]
          btrfs_insert_inode_ref+0x2bb/0xf80 fs/btrfs/inode-item.c:330
          btrfs_rename_exchange fs/btrfs/inode.c:7990 [inline]
          btrfs_rename2+0xcb7/0x2b90 fs/btrfs/inode.c:8374
          vfs_rename+0xbdb/0xf00 fs/namei.c:5067
          do_renameat2+0xd94/0x13f0 fs/namei.c:5224
          __do_sys_renameat2 fs/namei.c:5258 [inline]
          __se_sys_renameat2 fs/namei.c:5255 [inline]
          __x64_sys_renameat2+0xce/0xe0 fs/namei.c:5255
          do_syscall_x64 arch/x86/entry/common.c:52 [inline]
          do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
          entry_SYSCALL_64_after_hwframe+0x77/0x7f

   -> #0 (btrfs-tree-01){++++}-{4:4}:
          check_prev_add kernel/locking/lockdep.c:3161 [inline]
          check_prevs_add kernel/locking/lockdep.c:3280 [inline]
          validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904
          __lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226
          lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
          down_read_nested+0xb5/0xa50 kernel/locking/rwsem.c:1649
          btrfs_tree_read_lock_nested+0x2f/0x250 fs/btrfs/locking.c:146
          btrfs_tree_read_lock fs/btrfs/locking.h:188 [inline]
          read_block_for_search+0x718/0xbb0 fs/btrfs/ctree.c:1610
          btrfs_search_slot+0x1274/0x3180 fs/btrfs/ctree.c:2237
          replace_path+0x1243/0x2740 fs/btrfs/relocation.c:1224
          merge_reloc_root+0xc46/0x1ad0 fs/btrfs/relocation.c:1692
          merge_reloc_roots+0x3b3/0x980 fs/btrfs/relocation.c:1942
          relocate_block_group+0xb0a/0xd40 fs/btrfs/relocation.c:3754
          btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4087
          btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3494
          __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4278
          btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4655
          btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670
          vfs_ioctl fs/ioctl.c:51 [inline]
          __do_sys_ioctl fs/ioctl.c:906 [inline]
          __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
          do_syscall_x64 arch/x86/entry/common.c:52 [inline]
          do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
          entry_SYSCALL_64_after_hwframe+0x77/0x7f

   other info that might help us debug this:

   Chain exists of:
     btrfs-tree-01 --> btrfs-tree-01/1 --> btrfs-treloc-02/1

    Possible unsafe locking scenario:

          CPU0                    CPU1
          ----                    ----
     lock(btrfs-treloc-02/1);
                                  lock(btrfs-tree-01/1);
                                  lock(btrfs-treloc-02/1);
     rlock(btrfs-tree-01);

    *** DEADLOCK ***

   8 locks held by syz.0.0/5335:
    #0: ffff88801e3ae420 (sb_writers#13){.+.+}-{0:0}, at: mnt_want_write_file+0x5e/0x200 fs/namespace.c:559
    #1: ffff888052c760d0 (&fs_info->reclaim_bgs_lock){+.+.}-{4:4}, at: __btrfs_balance+0x4c2/0x26b0 fs/btrfs/volumes.c:4183
    #2: ffff888052c74850 (&fs_info->cleaner_mutex){+.+.}-{4:4}, at: btrfs_relocate_block_group+0x775/0xd90 fs/btrfs/relocation.c:4086
    #3: ffff88801e3ae610 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xf11/0x1ad0 fs/btrfs/relocation.c:1659
    #4: ffff888052c76470 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x405/0xda0 fs/btrfs/transaction.c:288
    #5: ffff888052c76498 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x405/0xda0 fs/btrfs/transaction.c:288
    #6: ffff8880545db878 (btrfs-tree-01/1){+.+.}-{4:4}, at: btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189
    #7: ffff8880545dba58 (btrfs-treloc-02/1){+.+.}-{4:4}, at: btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189

   stack backtrace:
   CPU: 0 UID: 0 PID: 5335 Comm: syz.0.0 Not tainted 6.13.0-rc5-syzkaller-00163-gab75170520d4 #0
   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
   Call Trace:
    <TASK>
    __dump_stack lib/dump_stack.c:94 [inline]
    dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
    print_circular_bug+0x13a/0x1b0 kernel/locking/lockdep.c:2074
    check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2206
    check_prev_add kernel/locking/lockdep.c:3161 [inline]
    check_prevs_add kernel/locking/lockdep.c:3280 [inline]
    validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904
    __lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226
    lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
    down_read_nested+0xb5/0xa50 kernel/locking/rwsem.c:1649
    btrfs_tree_read_lock_nested+0x2f/0x250 fs/btrfs/locking.c:146
    btrfs_tree_read_lock fs/btrfs/locking.h:188 [inline]
    read_block_for_search+0x718/0xbb0 fs/btrfs/ctree.c:1610
    btrfs_search_slot+0x1274/0x3180 fs/btrfs/ctree.c:2237
    replace_path+0x1243/0x2740 fs/btrfs/relocation.c:1224
    merge_reloc_root+0xc46/0x1ad0 fs/btrfs/relocation.c:1692
    merge_reloc_roots+0x3b3/0x980 fs/btrfs/relocation.c:1942
    relocate_block_group+0xb0a/0xd40 fs/btrfs/relocation.c:3754
    btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4087
    btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3494
    __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4278
    btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4655
    btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670
    vfs_ioctl fs/ioctl.c:51 [inline]
    __do_sys_ioctl fs/ioctl.c:906 [inline]
    __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
    do_syscall_x64 arch/x86/entry/common.c:52 [inline]
    do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
    entry_SYSCALL_64_after_hwframe+0x77/0x7f
   RIP: 0033:0x7f1ac6985d29
   Code: ff ff c3 (...)
   RSP: 002b:00007f1ac63fe038 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
   RAX: ffffffffffffffda RBX: 00007f1ac6b76160 RCX: 00007f1ac6985d29
   RDX: 0000000020000180 RSI: 00000000c4009420 RDI: 0000000000000007
   RBP: 00007f1ac6a01b08 R08: 0000000000000000 R09: 0000000000000000
   R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
   R13: 0000000000000001 R14: 00007f1ac6b76160 R15: 00007fffda145a88
    </TASK>

Reported-by: syzbot+63913e558c084f7f8fdc@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/677b3014.050a0220.3b53b0.0064.GAE@google.com/
Fixes: 99785998ed ("btrfs: reduce lock contention when eb cache miss for btree search")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-23 22:34:05 +01:00
Linus Torvalds
0eb4aaa230 for-6.14-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmeHvVQACgkQxWXV+ddt
 WDsJ6w//cPqI8tf3kMxurZcG7clJRIIISotPrC6hm3UDNpJLa7HDaVJ50FAoIhMV
 sB4RQNZky4mfB6ypXxmETzV3ZHvP0+oFgRs72Ommi0ZbdnBgxhaUTrDXLKl52o4r
 UoeqvRKReEYOesN09rPXYPwytUOkxHU/GjNzv7bC/Tzvq/xKaIN5qMYZwkHtJ8PK
 JtCFypfbmDPNDJz37l0BhRya2oMtpcUtxM9uP8RWVuQtaELgjcy56W/+osoyJTy9
 FSKaoWUPsDVDufnILlGR8Kub2Z5mcISVqyARUdr/q3j5CDfyTdQvahmUy7sHgUAe
 HGh5QBdRJu1QTvdZw+nK4YCaYpK6Nj4liDtO1cwVitde5RXsJrt6kYBLlY/kU2Qr
 KODOloM/zVKxULR0ARl11NULZquUsczP6Wxfn+dtyDJ3JGlY9OcuESmorHoUtkMX
 75Tj1AtRMNcfZAE2HquL1Oz3bIMcg4btDJsC+9Yp5K11SP12XpOwC42k/9Bx3iBe
 Iki0BSuppFqX5MMY3OEWzD1pz2vOGYR8ISD6EIsjpjl2vBeRwydaCCZfuszSC7gl
 Y4goSdwFMPVlqllL1h27XUjKVXvttCqqdB6P28MbvZKnFAPlm189BJQZC5cbHAJU
 ceBww5PvI9QxnJnFG5iOLcnko6liUWPP9l2c5LLtUsJIi8B5Hu0=
 =SXLv
 -----END PGP SIGNATURE-----

Merge tag 'for-6.14-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "User visible changes, features:

   - rebuilding of the free space tree at mount time is done in more
     transactions, fix potential hangs when the transaction thread is
     blocked due to large amount of block groups

   - more read IO balancing strategies (experimental config), add two
     new ways how to select a device for read if the profiles allow that
     (all RAID1*), the current default selects the device by pid which
     is good on average but less performant for single reader workloads

       - select preferred device for all reads (namely for testing)
       - round-robin, balance reads across devices relevant for the
         requested IO range

   - add encoded write ioctl support to io_uring (read was added in
     6.12), basis for writing send stream using that instead of
     syscalls, non-blocking mode is not yet implemented

   - support FS_IOC_READ_VERITY_METADATA, applications can use the
     metadata to do their own verification

   - pass inode's i_write_hint to bios, for parity with other
     filesystems, ioctls F_GET_RW_HINT/F_SET_RW_HINT

  Core:

   - in zoned mode: allow to directly reclaim a block group by simply
     resetting it, then it can be reused and another block group does
     not need to be allocated

   - super block validation now also does more comprehensive sys array
     validation, adding it to the points where superblock is validated
     (post-read, pre-write)

   - subpage mode fixes:
      - fix double accounting of blocks due to some races
      - improved or fixed error handling in a few cases (compression,
        delalloc)

   - raid stripe tree:
      - fix various cases with extent range splitting or deleting
      - implement hole punching to extent range
      - reduce number of stripe tree lookups during bio submission
      - more self-tests

   - updated self-tests (delayed refs)

   - error handling improvements

   - cleanups, refactoring
      - remove rest of backref caching infrastructure from relocation,
        not needed anymore
      - error message updates
      - remove unnecessary calls when extent buffer was marked dirty
      - unused parameter removal
      - code moved to new files

  Other code changes: add rb_find_add_cached() to the rb-tree API"

* tag 'for-6.14-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (127 commits)
  btrfs: selftests: add a selftest for deleting two out of three extents
  btrfs: selftests: add test for punching a hole into 3 RAID stripe-extents
  btrfs: selftests: add selftest for punching holes into the RAID stripe extents
  btrfs: selftests: test RAID stripe-tree deletion spanning two items
  btrfs: selftests: don't split RAID extents in half
  btrfs: selftests: check for correct return value of failed lookup
  btrfs: don't use btrfs_set_item_key_safe on RAID stripe-extents
  btrfs: implement hole punching for RAID stripe extents
  btrfs: fix deletion of a range spanning parts two RAID stripe extents
  btrfs: fix tail delete of RAID stripe-extents
  btrfs: fix front delete range calculation for RAID stripe extents
  btrfs: assert RAID stripe-extent length is always greater than 0
  btrfs: don't try to delete RAID stripe-extents if we don't need to
  btrfs: selftests: correct RAID stripe-tree feature flag setting
  btrfs: add io_uring interface for encoded writes
  btrfs: remove the unused locked_folio parameter from btrfs_cleanup_ordered_extents()
  btrfs: add extra error messages for delalloc range related errors
  btrfs: subpage: dump the involved bitmap when ASSERT() failed
  btrfs: subpage: fix the bitmap dump of the locked flags
  btrfs: do proper folio cleanup when run_delalloc_nocow() failed
  ...
2025-01-20 13:09:30 -08:00
Linus Torvalds
ed8fd8d5dd for-6.13-rc7-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmeI5MAACgkQxWXV+ddt
 WDtMIg//WwV2p/gQmFMfFu7txZQQrt6m6nMuq/2y0qAA+d2e9oXFr3jNLlho0RwL
 hvSM5IEsxzL1ujh67tqxUBojvRjq+CRwiI6fxFO13dLIMrlKoiINtuOEiIhgZr/Q
 K+weXC4yHCFSB4q82dcr5i++/GVdY8wBoLHGe4zdJ0TVtM0xHrYVQ/4PXAGMX97V
 Bmv038aAMpRJTxJ6rcObloNxOb+WoyM0DIRbJ2UWVmy+bOLHMEa/jSkkQDWwcuPV
 UJjQgXz21FXiJqhtcndM58mEsKfe1WV9gG3lihedyb/xNeCPggqkDbfSzZU55Lkm
 vRXSgklbkifK6RTqrozg7e38MV+rZMwwNzqysMxNtlvrC5R+0A6of4t2Ny1vaeGG
 flcVncbLi/ip9J4fMLUpVS0fbAZjQgIJca6fbgM7Ajdzvm9rIDGpR078GW1zDlg2
 20f7isQWqSkj3i9kPElGbDT65RvcxR5EpJQfwr78+bsgSsyBGwghhAJ5F7vqZh5L
 wdmJOYxkwq5r1YxeuEQPVv404WPrwZE2Gyk6TGXkzpA+l6isCViY3NTyQPBOMH9y
 2GEelW61QSKtc5BPp1h85Aw74wtSCpGkCY+t/v03oQ6PBssWIsVcXZ7NnjJh9+72
 U+JY0H7UPyyczEzicEGlwWYBWs2/KeBuCrladNCkv3S5TWgB5yQ=
 =sF6v
 -----END PGP SIGNATURE-----

Merge tag 'for-6.13-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:

 - handle d_path() errors when canonicalizing device mapper paths during
   device scan

* tag 'for-6.13-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: add the missing error handling inside get_canonical_dev_path
2025-01-16 08:54:33 -08:00
Johannes Thumshirn
9d0c23db26 btrfs: selftests: add a selftest for deleting two out of three extents
Add a selftest creating three extents and then deleting two out of the
three extents.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-14 15:57:55 +01:00
Johannes Thumshirn
cfda28fb70 btrfs: selftests: add test for punching a hole into 3 RAID stripe-extents
Test creating a range of three RAID stripe-extents and then punch a hole
in the middle, deleting all of the middle extents and partially deleting
the "book ends".

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-14 15:56:40 +01:00
Johannes Thumshirn
27ae15b25b btrfs: selftests: add selftest for punching holes into the RAID stripe extents
Add a selftest for punching a hole into a RAID stripe extent. The test
create an 1M extent and punches a 64k bytes long hole at offset of 32k from
the start of the extent.

Afterwards it verifies the start and length of both resulting new extents
"left" and "right" as well as the absence of the hole.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-14 15:54:45 +01:00
Johannes Thumshirn
1d395c3926 btrfs: selftests: test RAID stripe-tree deletion spanning two items
Add a selftest for RAID stripe-tree deletion with a delete range spanning
two items, so that we're punching a hole into two adjacent RAID stripe
extents truncating the first and "moving" the second to the right.

The following diagram illustrates the operation:

 |--- RAID Stripe Extent ---||--- RAID Stripe Extent ---|
 |-----  keep  -----|--- drop ---|-----  keep  ----|

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-14 15:54:44 +01:00
Johannes Thumshirn
a0afdec255 btrfs: selftests: don't split RAID extents in half
The selftests for partially deleting the start or tail of RAID
stripe-extents split these extents in half.

This can hide errors in the calculation, so don't split the RAID
stripe-extents in half but delete the first or last 16K of the 64K
extents.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-14 15:52:30 +01:00
Johannes Thumshirn
d44d3d724b btrfs: selftests: check for correct return value of failed lookup
Commit 5e72aabc1f ("btrfs: return ENODATA in case RST lookup fails")
changed btrfs_get_raid_extent_offset()'s return value to ENODATA in case
the RAID stripe-tree lookup failed.

Adjust the test cases which check for absence of a given range to check
for ENODATA as return value in this case.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-14 15:52:30 +01:00
Johannes Thumshirn
dc14ba1078 btrfs: don't use btrfs_set_item_key_safe on RAID stripe-extents
Don't use btrfs_set_item_key_safe() to modify the keys in the RAID
stripe-tree, as this can lead to corruption of the tree, which is caught
by the checks in btrfs_set_item_key_safe():

 BTRFS info (device nvme1n1): leaf 49168384 gen 15 total ptrs 194 free space 8329 owner 12
 BTRFS info (device nvme1n1): refs 2 lock_owner 1030 current 1030
  [ snip ]
  item 105 key (354549760 230 20480) itemoff 14587 itemsize 16
                  stride 0 devid 5 physical 67502080
  item 106 key (354631680 230 4096) itemoff 14571 itemsize 16
                  stride 0 devid 1 physical 88559616
  item 107 key (354631680 230 32768) itemoff 14555 itemsize 16
                  stride 0 devid 1 physical 88555520
  item 108 key (354717696 230 28672) itemoff 14539 itemsize 16
                  stride 0 devid 2 physical 67604480
  [ snip ]
 BTRFS critical (device nvme1n1): slot 106 key (354631680 230 32768) new key (354635776 230 4096)
 ------------[ cut here ]------------
 kernel BUG at fs/btrfs/ctree.c:2602!
 Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI
 CPU: 1 UID: 0 PID: 1055 Comm: fsstress Not tainted 6.13.0-rc1+ #1464
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
 RIP: 0010:btrfs_set_item_key_safe+0xf7/0x270
 Code: <snip>
 RSP: 0018:ffffc90001337ab0 EFLAGS: 00010287
 RAX: 0000000000000000 RBX: ffff8881115fd000 RCX: 0000000000000000
 RDX: 0000000000000001 RSI: 0000000000000001 RDI: 00000000ffffffff
 RBP: ffff888110ed6f50 R08: 00000000ffffefff R09: ffffffff8244c500
 R10: 00000000ffffefff R11: 00000000ffffffff R12: ffff888100586000
 R13: 00000000000000c9 R14: ffffc90001337b1f R15: ffff888110f23b58
 FS:  00007f7d75c72740(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007fa811652c60 CR3: 0000000111398001 CR4: 0000000000370eb0
 Call Trace:
  <TASK>
  ? __die_body.cold+0x14/0x1a
  ? die+0x2e/0x50
  ? do_trap+0xca/0x110
  ? do_error_trap+0x65/0x80
  ? btrfs_set_item_key_safe+0xf7/0x270
  ? exc_invalid_op+0x50/0x70
  ? btrfs_set_item_key_safe+0xf7/0x270
  ? asm_exc_invalid_op+0x1a/0x20
  ? btrfs_set_item_key_safe+0xf7/0x270
  btrfs_partially_delete_raid_extent+0xc4/0xe0
  btrfs_delete_raid_extent+0x227/0x240
  __btrfs_free_extent.isra.0+0x57f/0x9c0
  ? exc_coproc_segment_overrun+0x40/0x40
  __btrfs_run_delayed_refs+0x2fa/0xe80
  btrfs_run_delayed_refs+0x81/0xe0
  btrfs_commit_transaction+0x2dd/0xbe0
  ? preempt_count_add+0x52/0xb0
  btrfs_sync_file+0x375/0x4c0
  do_fsync+0x39/0x70
  __x64_sys_fsync+0x13/0x20
  do_syscall_64+0x54/0x110
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
 RIP: 0033:0x7f7d7550ef90
 Code: <snip>
 RSP: 002b:00007ffd70237248 EFLAGS: 00000202 ORIG_RAX: 000000000000004a
 RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f7d7550ef90
 RDX: 000000000000013a RSI: 000000000040eb28 RDI: 0000000000000004
 RBP: 000000000000001b R08: 0000000000000078 R09: 00007ffd7023725c
 R10: 00007f7d75400390 R11: 0000000000000202 R12: 028f5c28f5c28f5c
 R13: 8f5c28f5c28f5c29 R14: 000000000040b520 R15: 00007f7d75c726c8
  </TASK>

While the root cause of the tree order corruption isn't clear, using
btrfs_duplicate_item() to copy the item and then adjusting both the key
and the per-device physical addresses is a safe way to counter this
problem.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-14 15:52:22 +01:00
Johannes Thumshirn
6aa0e7cc56 btrfs: implement hole punching for RAID stripe extents
If the stripe extent we want to delete starts before the range we want to
delete and ends after the range we want to delete we're punching a
hole in the stripe extent:

  |--- RAID Stripe Extent ---|
  | keep |--- drop ---| keep |

This means we need to a) truncate the existing item and b)
create a second item for the remaining range.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-14 15:52:13 +01:00
Johannes Thumshirn
7664311904 btrfs: fix deletion of a range spanning parts two RAID stripe extents
When a user requests the deletion of a range that spans multiple stripe
extents and btrfs_search_slot() returns us the second RAID stripe extent,
we need to pick the previous item and truncate it, if there's still a
range to delete left, move on to the next item.

The following diagram illustrates the operation:

 |--- RAID Stripe Extent ---||--- RAID Stripe Extent ---|
        |--- keep  ---|--- drop ---|

While at it, comment the trivial case of a whole item delete as well.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-14 15:51:24 +01:00
Johannes Thumshirn
50cae2ca69 btrfs: fix tail delete of RAID stripe-extents
Fix tail delete of RAID stripe-extents, if there is a range to be deleted
as well after the tail delete of the extent.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-14 15:49:10 +01:00
Johannes Thumshirn
a678543e60 btrfs: fix front delete range calculation for RAID stripe extents
When deleting the front of a RAID stripe-extent the delete code
miscalculates the size on how much to pad the remaining extent part in the
front.

Fix the calculation so we're always having the sizes we expect.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-14 15:49:04 +01:00
Johannes Thumshirn
5a0e38eab7 btrfs: assert RAID stripe-extent length is always greater than 0
When modifying a RAID stripe-extent, ASSERT() that the length of the new
RAID stripe-extent is always greater than 0.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-14 15:48:56 +01:00
Johannes Thumshirn
9257d8632a btrfs: don't try to delete RAID stripe-extents if we don't need to
Even if the RAID stripe-tree is not enabled in the filesystem,
do_free_extent_accounting() still calls into btrfs_delete_raid_extent().

Check if the extent in question is on a block-group that has a profile
which is used by RAID stripe-tree before attempting to delete a stripe
extent. Return early if it doesn't, otherwise we're doing a unnecessary
search.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-14 15:48:46 +01:00
Johannes Thumshirn
c221a9a29d btrfs: selftests: correct RAID stripe-tree feature flag setting
RAID stripe-tree is an incompatible feature not a read-only compatible, so
set the incompat flag not a compat_ro one in the selftest code.

Subsequent changes in btrfs_delete_raid_extent() will start checking for
this flag.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-14 15:47:54 +01:00
Qu Wenruo
fe4de594f7 btrfs: add the missing error handling inside get_canonical_dev_path
Inside function get_canonical_dev_path(), we call d_path() to get the
final device path.

But d_path() can return error, and in that case the next strscpy() call
will trigger an invalid memory access.

Add back the missing error handling for d_path().

Reported-by: Boris Burkov <boris@bur.io>
Fixes: 7e06de7c83 ("btrfs: canonicalize the device path before adding it")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 21:39:52 +01:00
Mark Harmstone
e32dcdb0af btrfs: add io_uring interface for encoded writes
Add an io_uring interface for encoded writes, with the same parameters
as the BTRFS_IOC_ENCODED_WRITE ioctl.

As with the encoded reads code, there's a test program for this at
https://github.com/maharmstone/io_uring-encoded, and I'll get this
worked into an fstest.

How io_uring works is that it initially calls btrfs_uring_cmd with the
IO_URING_F_NONBLOCK flag set, and if we return -EAGAIN it tries again in
a kthread with the flag cleared.

Ideally we'd honour this and call try_lock etc., but there's still a lot
of work to be done to create non-blocking versions of all the functions
in our write path. Instead, just validate the input in
btrfs_uring_encoded_write() on the first pass and return -EAGAIN, with a
view to properly optimizing the optimistic path later on.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 21:06:31 +01:00
Qu Wenruo
bf50aca633 btrfs: remove the unused locked_folio parameter from btrfs_cleanup_ordered_extents()
The function btrfs_cleanup_ordered_extents() is only called in error
handling path, and the last caller with a @locked_folio parameter was
removed to fix a bug in the btrfs_run_delalloc_range() error handling.

There is no need to pass @locked_folio parameter anymore.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 16:00:50 +01:00
Qu Wenruo
975a6a8855 btrfs: add extra error messages for delalloc range related errors
All the error handling bugs I hit so far are all -ENOSPC from either:

- cow_file_range()
- run_delalloc_nocow()
- submit_uncompressed_range()

Previously when those functions failed, there was no error message at
all, making the debugging much harder.

So here we introduce extra error messages for:

- cow_file_range()
- run_delalloc_nocow()
- submit_uncompressed_range()
- writepage_delalloc() when btrfs_run_delalloc_range() failed
- extent_writepage() when extent_writepage_io() failed

One example of the new debug error messages is the following one:

  run fstests generic/750 at 2024-12-08 12:41:41
  BTRFS: device fsid 461b25f5-e240-4543-8deb-e7c2bd01a6d3 devid 1 transid 8 /dev/mapper/test-scratch1 (253:4) scanned by mount (2436600)
  BTRFS info (device dm-4): first mount of filesystem 461b25f5-e240-4543-8deb-e7c2bd01a6d3
  BTRFS info (device dm-4): using crc32c (crc32c-arm64) checksum algorithm
  BTRFS info (device dm-4): forcing free space tree for sector size 4096 with page size 65536
  BTRFS info (device dm-4): using free-space-tree
  BTRFS warning (device dm-4): read-write for sector size 4096 with page size 65536 is experimental
  BTRFS info (device dm-4): checking UUID tree
  BTRFS error (device dm-4): cow_file_range failed, root=363 inode=412 start=503808 len=98304: -28
  BTRFS error (device dm-4): run_delalloc_nocow failed, root=363 inode=412 start=503808 len=98304: -28
  BTRFS error (device dm-4): failed to run delalloc range, root=363 ino=412 folio=458752 submit_bitmap=11-15 start=503808 len=98304: -28

Which shows an error from cow_file_range() which is called inside a
nocow write attempt, along with the extra bitmap from
writepage_delalloc().

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 15:59:35 +01:00
Qu Wenruo
61d730731b btrfs: subpage: dump the involved bitmap when ASSERT() failed
For btrfs_folio_assert_not_dirty() and btrfs_folio_set_lock(), we call
bitmap_test_range_all_zero() to ensure the involved range has no
dirty/lock bit already set.

However with my recent enhanced delalloc range error handling, I was
hitting the ASSERT() inside btrfs_folio_set_lock(), and it turns out
that some error handling path is not properly updating the folio flags.

So add some extra dumping for the ASSERTs to dump the involved bitmap
to help debug.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 15:57:51 +01:00
Qu Wenruo
396294d1af btrfs: subpage: fix the bitmap dump of the locked flags
We're dumping the locked bitmap into the @checked_bitmap variable,
printing incorrect values during debug.

Thankfully even during my development I haven't hit a case where I need
to dump the locked bitmap.  But for the sake of consistency, fix it by
dupping the locked bitmap into @locked_bitmap variable for output.

Fixes: 75258f20fb ("btrfs: subpage: dump extra subpage bitmaps for debug")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 15:53:47 +01:00
Qu Wenruo
c2b47df81c btrfs: do proper folio cleanup when run_delalloc_nocow() failed
[BUG]
With CONFIG_DEBUG_VM set, test case generic/476 has some chance to crash
with the following VM_BUG_ON_FOLIO():

  BTRFS error (device dm-3): cow_file_range failed, start 1146880 end 1253375 len 106496 ret -28
  BTRFS error (device dm-3): run_delalloc_nocow failed, start 1146880 end 1253375 len 106496 ret -28
  page: refcount:4 mapcount:0 mapping:00000000592787cc index:0x12 pfn:0x10664
  aops:btrfs_aops [btrfs] ino:101 dentry name(?):"f1774"
  flags: 0x2fffff80004028(uptodate|lru|private|node=0|zone=2|lastcpupid=0xfffff)
  page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(folio))
  ------------[ cut here ]------------
  kernel BUG at mm/page-writeback.c:2992!
  Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
  CPU: 2 UID: 0 PID: 3943513 Comm: kworker/u24:15 Tainted: G           OE      6.12.0-rc7-custom+ #87
  Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
  Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
  Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
  pc : folio_clear_dirty_for_io+0x128/0x258
  lr : folio_clear_dirty_for_io+0x128/0x258
  Call trace:
   folio_clear_dirty_for_io+0x128/0x258
   btrfs_folio_clamp_clear_dirty+0x80/0xd0 [btrfs]
   __process_folios_contig+0x154/0x268 [btrfs]
   extent_clear_unlock_delalloc+0x5c/0x80 [btrfs]
   run_delalloc_nocow+0x5f8/0x760 [btrfs]
   btrfs_run_delalloc_range+0xa8/0x220 [btrfs]
   writepage_delalloc+0x230/0x4c8 [btrfs]
   extent_writepage+0xb8/0x358 [btrfs]
   extent_write_cache_pages+0x21c/0x4e8 [btrfs]
   btrfs_writepages+0x94/0x150 [btrfs]
   do_writepages+0x74/0x190
   filemap_fdatawrite_wbc+0x88/0xc8
   start_delalloc_inodes+0x178/0x3a8 [btrfs]
   btrfs_start_delalloc_roots+0x174/0x280 [btrfs]
   shrink_delalloc+0x114/0x280 [btrfs]
   flush_space+0x250/0x2f8 [btrfs]
   btrfs_async_reclaim_data_space+0x180/0x228 [btrfs]
   process_one_work+0x164/0x408
   worker_thread+0x25c/0x388
   kthread+0x100/0x118
   ret_from_fork+0x10/0x20
  Code: 910a8021 a90363f7 a9046bf9 94012379 (d4210000)
  ---[ end trace 0000000000000000 ]---

[CAUSE]
The first two lines of extra debug messages show the problem is caused
by the error handling of run_delalloc_nocow().

E.g. we have the following dirtied range (4K blocksize 4K page size):

    0                 16K                  32K
    |//////////////////////////////////////|
    |  Pre-allocated  |

And the range [0, 16K) has a preallocated extent.

- Enter run_delalloc_nocow() for range [0, 16K)
  Which found range [0, 16K) is preallocated, can do the proper NOCOW
  write.

- Enter fallback_to_fow() for range [16K, 32K)
  Since the range [16K, 32K) is not backed by preallocated extent, we
  have to go COW.

- cow_file_range() failed for range [16K, 32K)
  So cow_file_range() will do the clean up by clearing folio dirty,
  unlock the folios.

  Now the folios in range [16K, 32K) is unlocked.

- Enter extent_clear_unlock_delalloc() from run_delalloc_nocow()
  Which is called with PAGE_START_WRITEBACK to start page writeback.
  But folios can only be marked writeback when it's properly locked,
  thus this triggered the VM_BUG_ON_FOLIO().

Furthermore there is another hidden but common bug that
run_delalloc_nocow() is not clearing the folio dirty flags in its error
handling path.
This is the common bug shared between run_delalloc_nocow() and
cow_file_range().

[FIX]
- Clear folio dirty for range [@start, @cur_offset)
  Introduce a helper, cleanup_dirty_folios(), which
  will find and lock the folio in the range, clear the dirty flag and
  start/end the writeback, with the extra handling for the
  @locked_folio.

- Introduce a helper to clear folio dirty, start and end writeback

- Introduce a helper to record the last failed COW range end
  This is to trace which range we should skip, to avoid double
  unlocking.

- Skip the failed COW range for the error handling

CC: stable@vger.kernel.org
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 15:52:17 +01:00
Qu Wenruo
06f3642847 btrfs: do proper folio cleanup when cow_file_range() failed
[BUG]
When testing with COW fixup marked as BUG_ON() (this is involved with the
new pin_user_pages*() change, which should not result new out-of-band
dirty pages), I hit a crash triggered by the BUG_ON() from hitting COW
fixup path.

This BUG_ON() happens just after a failed btrfs_run_delalloc_range():

  BTRFS error (device dm-2): failed to run delalloc range, root 348 ino 405 folio 65536 submit_bitmap 6-15 start 90112 len 106496: -28
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/extent_io.c:1444!
  Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
  CPU: 0 UID: 0 PID: 434621 Comm: kworker/u24:8 Tainted: G           OE      6.12.0-rc7-custom+ #86
  Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
  Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
  pc : extent_writepage_io+0x2d4/0x308 [btrfs]
  lr : extent_writepage_io+0x2d4/0x308 [btrfs]
  Call trace:
   extent_writepage_io+0x2d4/0x308 [btrfs]
   extent_writepage+0x218/0x330 [btrfs]
   extent_write_cache_pages+0x1d4/0x4b0 [btrfs]
   btrfs_writepages+0x94/0x150 [btrfs]
   do_writepages+0x74/0x190
   filemap_fdatawrite_wbc+0x88/0xc8
   start_delalloc_inodes+0x180/0x3b0 [btrfs]
   btrfs_start_delalloc_roots+0x174/0x280 [btrfs]
   shrink_delalloc+0x114/0x280 [btrfs]
   flush_space+0x250/0x2f8 [btrfs]
   btrfs_async_reclaim_data_space+0x180/0x228 [btrfs]
   process_one_work+0x164/0x408
   worker_thread+0x25c/0x388
   kthread+0x100/0x118
   ret_from_fork+0x10/0x20
  Code: aa1403e1 9402f3ef aa1403e0 9402f36f (d4210000)
  ---[ end trace 0000000000000000 ]---

[CAUSE]
That failure is mostly from cow_file_range(), where we can hit -ENOSPC.

Although the -ENOSPC is already a bug related to our space reservation
code, let's just focus on the error handling.

For example, we have the following dirty range [0, 64K) of an inode,
with 4K sector size and 4K page size:

   0        16K        32K       48K       64K
   |///////////////////////////////////////|
   |#######################################|

Where |///| means page are still dirty, and |###| means the extent io
tree has EXTENT_DELALLOC flag.

- Enter extent_writepage() for page 0

- Enter btrfs_run_delalloc_range() for range [0, 64K)

- Enter cow_file_range() for range [0, 64K)

- Function btrfs_reserve_extent() only reserved one 16K extent
  So we created extent map and ordered extent for range [0, 16K)

   0        16K        32K       48K       64K
   |////////|//////////////////////////////|
   |<- OE ->|##############################|

   And range [0, 16K) has its delalloc flag cleared.
   But since we haven't yet submit any bio, involved 4 pages are still
   dirty.

- Function btrfs_reserve_extent() returns with -ENOSPC
  Now we have to run error cleanup, which will clear all
  EXTENT_DELALLOC* flags and clear the dirty flags for the remaining
  ranges:

   0        16K        32K       48K       64K
   |////////|                              |
   |        |                              |

  Note that range [0, 16K) still has its pages dirty.

- Some time later, writeback is triggered again for the range [0, 16K)
  since the page range still has dirty flags.

- btrfs_run_delalloc_range() will do nothing because there is no
  EXTENT_DELALLOC flag.

- extent_writepage_io() finds page 0 has no ordered flag
  Which falls into the COW fixup path, triggering the BUG_ON().

Unfortunately this error handling bug dates back to the introduction of
btrfs.  Thankfully with the abuse of COW fixup, at least it won't crash
the kernel.

[FIX]
Instead of immediately unlocking the extent and folios, we keep the extent
and folios locked until either erroring out or the whole delalloc range
finished.

When the whole delalloc range finished without error, we just unlock the
whole range with PAGE_SET_ORDERED (and PAGE_UNLOCK for !keep_locked
cases), with EXTENT_DELALLOC and EXTENT_LOCKED cleared.
And the involved folios will be properly submitted, with their dirty
flags cleared during submission.

For the error path, it will be a little more complex:

- The range with ordered extent allocated (range (1))
  We only clear the EXTENT_DELALLOC and EXTENT_LOCKED, as the remaining
  flags are cleaned up by
  btrfs_mark_ordered_io_finished()->btrfs_finish_one_ordered().

  For folios we finish the IO (clear dirty, start writeback and
  immediately finish the writeback) and unlock the folios.

- The range with reserved extent but no ordered extent (range(2))
- The range we never touched (range(3))
  For both range (2) and range(3) the behavior is not changed.

Now even if cow_file_range() failed halfway with some successfully
reserved extents/ordered extents, we will keep all folios clean, so
there will be no future writeback triggered on them.

CC: stable@vger.kernel.org
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 15:50:12 +01:00
Qu Wenruo
a7858d5c36 btrfs: fix error handling of submit_uncompressed_range()
[BUG]
If we failed to compress the range, or cannot reserve a large enough
data extent (e.g. too fragmented free space), we will fall back to
submit_uncompressed_range().

But inside submit_uncompressed_range(), run_delalloc_cow() can also fail
due to -ENOSPC or any other error.

In that case there are 3 bugs in the error handling:

1) Double freeing for the same ordered extent
   This can lead to crash due to ordered extent double accounting

2) Start/end writeback without updating the subpage writeback bitmap

3) Unlock the folio without clear the subpage lock bitmap

Both bugs 2) and 3) will crash the kernel if the btrfs block size is
smaller than folio size, as the next time the folio gets writeback/lock
updates, subpage will find the bitmap already have the range set,
triggering an ASSERT().

[CAUSE]
Bug 1) happens in the following call chain:

  submit_uncompressed_range()
  |- run_delalloc_cow()
  |  |- cow_file_range()
  |     |- btrfs_reserve_extent()
  |        Failed with -ENOSPC or whatever error
  |
  |- btrfs_clean_up_ordered_extents()
  |  |- btrfs_mark_ordered_io_finished()
  |     Which cleans all the ordered extents in the async_extent range.
  |
  |- btrfs_mark_ordered_io_finished()
     Which cleans the folio range.

The finished ordered extents may not be immediately removed from the
ordered io tree, as they are removed inside a work queue.

So the second btrfs_mark_ordered_io_finished() may find the finished but
not-yet-removed ordered extents, and double free them.

Furthermore, the second btrfs_mark_ordered_io_finished() is not subpage
compatible, as it uses fixed folio_pos() with PAGE_SIZE, which can cover
other ordered extents.

Bugs 2) and 3) are more straightforward, btrfs just calls folio_unlock(),
folio_start_writeback() and folio_end_writeback(), other than the helpers
which handle subpage cases.

[FIX]
For bug 1) since the first btrfs_cleanup_ordered_extents() call is
handling the whole range, we should not do the second
btrfs_mark_ordered_io_finished() call.

And for the first btrfs_cleanup_ordered_extents(), we no longer need to
pass the @locked_page parameter, as we are already in the async extent
context, thus will never rely on the error handling inside
btrfs_run_delalloc_range().

So just let the btrfs_clean_up_ordered_extents() handle every folio
equally.

For bug 2) we should not even call
folio_start_writeback()/folio_end_writeback() anymore.
As the error handling protocol, cow_file_range() should clear
dirty flag and start/finish the writeback for the whole range passed in.

For bug 3) just change the folio_unlock() to btrfs_folio_end_lock()
helper.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 15:36:06 +01:00
Qu Wenruo
8bf334beb3 btrfs: fix double accounting race when extent_writepage_io() failed
[BUG]
If submit_one_sector() failed inside extent_writepage_io() for sector
size < page size cases (e.g. 4K sector size and 64K page size), then
we can hit double ordered extent accounting error.

This should be very rare, as submit_one_sector() only fails when we
failed to grab the extent map, and such extent map should exist inside
the memory and has been pinned.

[CAUSE]
For example we have the following folio layout:

    0  4K          32K    48K   60K 64K
    |//|           |//////|     |///|

Where |///| is the dirty range we need to writeback. The 3 different
dirty ranges are submitted for regular COW.

Now we hit the following sequence:

- submit_one_sector() returned 0 for [0, 4K)

- submit_one_sector() returned 0 for [32K, 48K)

- submit_one_sector() returned error for [60K, 64K)

- btrfs_mark_ordered_io_finished() called for the whole folio
  This will mark the following ranges as finished:
  * [0, 4K)
  * [32K, 48K)
    Both ranges have their IO already submitted, this cleanup will
    lead to double accounting.

  * [60K, 64K)
    That's the correct cleanup.

The only good news is, this error is only theoretical, as the target
extent map is always pinned, thus we should directly grab it from
memory, other than reading it from the disk.

[FIX]
Instead of calling btrfs_mark_ordered_io_finished() for the whole folio
range, which can touch ranges we should not touch, instead
move the error handling inside extent_writepage_io().

So that we can cleanup exact sectors that ought to be submitted but failed.

This provides much more accurate cleanup, avoiding the double accounting.

CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 15:33:43 +01:00
Qu Wenruo
72dad8e377 btrfs: fix double accounting race when btrfs_run_delalloc_range() failed
[BUG]
When running btrfs with block size (4K) smaller than page size (64K,
aarch64), there is a very high chance to crash the kernel at
generic/750, with the following messages:
(before the call traces, there are 3 extra debug messages added)

  BTRFS warning (device dm-3): read-write for sector size 4096 with page size 65536 is experimental
  BTRFS info (device dm-3): checking UUID tree
  hrtimer: interrupt took 5451385 ns
  BTRFS error (device dm-3): cow_file_range failed, root=4957 inode=257 start=1605632 len=69632: -28
  BTRFS error (device dm-3): run_delalloc_nocow failed, root=4957 inode=257 start=1605632 len=69632: -28
  BTRFS error (device dm-3): failed to run delalloc range, root=4957 ino=257 folio=1572864 submit_bitmap=8-15 start=1605632 len=69632: -28
  ------------[ cut here ]------------
  WARNING: CPU: 2 PID: 3020984 at ordered-data.c:360 can_finish_ordered_extent+0x370/0x3b8 [btrfs]
  CPU: 2 UID: 0 PID: 3020984 Comm: kworker/u24:1 Tainted: G           OE      6.13.0-rc1-custom+ #89
  Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
  Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
  Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
  pc : can_finish_ordered_extent+0x370/0x3b8 [btrfs]
  lr : can_finish_ordered_extent+0x1ec/0x3b8 [btrfs]
  Call trace:
   can_finish_ordered_extent+0x370/0x3b8 [btrfs] (P)
   can_finish_ordered_extent+0x1ec/0x3b8 [btrfs] (L)
   btrfs_mark_ordered_io_finished+0x130/0x2b8 [btrfs]
   extent_writepage+0x10c/0x3b8 [btrfs]
   extent_write_cache_pages+0x21c/0x4e8 [btrfs]
   btrfs_writepages+0x94/0x160 [btrfs]
   do_writepages+0x74/0x190
   filemap_fdatawrite_wbc+0x74/0xa0
   start_delalloc_inodes+0x17c/0x3b0 [btrfs]
   btrfs_start_delalloc_roots+0x17c/0x288 [btrfs]
   shrink_delalloc+0x11c/0x280 [btrfs]
   flush_space+0x288/0x328 [btrfs]
   btrfs_async_reclaim_data_space+0x180/0x228 [btrfs]
   process_one_work+0x228/0x680
   worker_thread+0x1bc/0x360
   kthread+0x100/0x118
   ret_from_fork+0x10/0x20
  ---[ end trace 0000000000000000 ]---
  BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1605632 OE len=16384 to_dec=16384 left=0
  BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1622016 OE len=12288 to_dec=12288 left=0
  Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
  BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1634304 OE len=8192 to_dec=4096 left=0
  CPU: 1 UID: 0 PID: 3286940 Comm: kworker/u24:3 Tainted: G        W  OE      6.13.0-rc1-custom+ #89
  Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
  Workqueue:  btrfs_work_helper [btrfs] (btrfs-endio-write)
  pstate: 404000c5 (nZcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
  pc : process_one_work+0x110/0x680
  lr : worker_thread+0x1bc/0x360
  Call trace:
   process_one_work+0x110/0x680 (P)
   worker_thread+0x1bc/0x360 (L)
   worker_thread+0x1bc/0x360
   kthread+0x100/0x118
   ret_from_fork+0x10/0x20
  Code: f84086a1 f9000fe1 53041c21 b9003361 (f9400661)
  ---[ end trace 0000000000000000 ]---
  Kernel panic - not syncing: Oops: Fatal exception
  SMP: stopping secondary CPUs
  SMP: failed to stop secondary CPUs 2-3
  Dumping ftrace buffer:
     (ftrace buffer empty)
  Kernel Offset: 0x275bb9540000 from 0xffff800080000000
  PHYS_OFFSET: 0xffff8fbba0000000
  CPU features: 0x100,00000070,00801250,8201720b

[CAUSE]
The above warning is triggered immediately after the delalloc range
failure, this happens in the following sequence:

- Range [1568K, 1636K) is dirty

   1536K  1568K     1600K    1636K  1664K
   |      |/////////|////////|      |

  Where 1536K, 1600K and 1664K are page boundaries (64K page size)

- Enter extent_writepage() for page 1536K

- Enter run_delalloc_nocow() with locked page 1536K and range
  [1568K, 1636K)
  This is due to the inode having preallocated extents.

- Enter cow_file_range() with locked page 1536K and range
  [1568K, 1636K)

- btrfs_reserve_extent() only reserved two extents
  The main loop of cow_file_range() only reserved two data extents,

  Now we have:

   1536K  1568K        1600K    1636K  1664K
   |      |<-->|<--->|/|///////|      |
               1584K  1596K
  Range [1568K, 1596K) has an ordered extent reserved.

- btrfs_reserve_extent() failed inside cow_file_range() for file offset
  1596K
  This is already a bug in our space reservation code, but for now let's
  focus on the error handling path.

  Now cow_file_range() returned -ENOSPC.

- btrfs_run_delalloc_range() do error cleanup <<< ROOT CAUSE
  Call btrfs_cleanup_ordered_extents() with locked folio 1536K and range
  [1568K, 1636K)

  Function btrfs_cleanup_ordered_extents() normally needs to skip the
  ranges inside the folio, as it will normally be cleaned up by
  extent_writepage().

  Such split error handling is already problematic in the first place.

  What's worse is the folio range skipping itself, which is not taking
  subpage cases into consideration at all, it will only skip the range
  if the page start >= the range start.
  In our case, the page start < the range start, since for subpage cases
  we can have delalloc ranges inside the folio but not covering the
  folio.

  So it doesn't skip the page range at all.
  This means all the ordered extents, both [1568K, 1584K) and
  [1584K, 1596K) will be marked as IOERR.

  And these two ordered extents have no more pending ios, they are marked
  finished, and *QUEUED* to be deleted from the io tree.

- extent_writepage() do error cleanup
  Call btrfs_mark_ordered_io_finished() for the range [1536K, 1600K).

  Although ranges [1568K, 1584K) and [1584K, 1596K) are finished, the
  deletion from io tree is async, it may or may not happen at this
  time.

  If the ranges have not yet been removed, we will do double cleaning on
  those ranges, triggering the above ordered extent warnings.

In theory there are other bugs, like the cleanup in extent_writepage()
can cause double accounting on ranges that are submitted asynchronously
(compression for example).

But that's much harder to trigger because normally we do not mix regular
and compression delalloc ranges.

[FIX]
The folio range split is already buggy and not subpage compatible, it
was introduced a long time ago where subpage support was not even considered.

So instead of splitting the ordered extents cleanup into the folio range
and out of folio range, do all the cleanup inside writepage_delalloc().

- Pass @NULL as locked_folio for btrfs_cleanup_ordered_extents() in
  btrfs_run_delalloc_range()

- Skip the btrfs_cleanup_ordered_extents() if writepage_delalloc()
  failed

  So all ordered extents are only cleaned up by
  btrfs_run_delalloc_range().

- Handle the ranges that already have ordered extents allocated
  If part of the folio already has ordered extent allocated, and
  btrfs_run_delalloc_range() failed, we also need to cleanup that range.

Now we have a concentrated error handling for ordered extents during
btrfs_run_delalloc_range().

Fixes: d1051d6ebf ("btrfs: Fix error handling in btrfs_cleanup_ordered_extents")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 15:26:23 +01:00
David Sterba
311473984c btrfs: async-thread: rename DFT_THRESHOLD to DEFAULT_THRESHOLD
Rename the macro so it's obvious what it means.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:23 +01:00
David Sterba
ef8c0047aa btrfs: remove redundant variables from __process_folios_contig() and lock_delalloc_folios()
Same pattern in both functions, we really only use index, start_index is
redundant.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:23 +01:00
David Sterba
248c4ff393 btrfs: split waiting from read_extent_buffer_pages(), drop parameter wait
There are only 2 WAIT_* values left for wait parameter, we can encode
this to the function name if the waiting functionality is split.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:23 +01:00
David Sterba
db9eef2ea8 btrfs: remove unused define WAIT_PAGE_LOCK for extent io
Last use was in the readahead code that got removed by f26c923860
("btrfs: remove reada infrastructure").

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:23 +01:00
David Sterba
f8e0b8f9c2 btrfs: unwrap folio locking helpers
Another conversion to folio API, use the folio locking directly instead
of back and forth page <-> folio conversions.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:23 +01:00
David Sterba
549a88acbe btrfs: change return type to bool type of check_eb_alignment()
The check function pattern is supposed to return true/false, currently
there's only one error code.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:23 +01:00
David Sterba
a43caf82a1 btrfs: switch grab_extent_buffer() to folios
Use the folio API, remove page_folio/folio_page conversions.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:22 +01:00
David Sterba
cc8f51a355 btrfs: rename btrfs_release_extent_buffer_pages() to mention folios
Continue page to folio updates, sync what the function does with it's
name.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:22 +01:00
David Sterba
a722c72bef btrfs: open code __free_extent_buffer()
Using the kmem cache freeing directly is clear enough, we don't need to
wrap it.  All the users are in the same file.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:22 +01:00
David Sterba
b6160baed3 btrfs: drop one time used local variable in end_bbio_meta_write()
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:22 +01:00
David Sterba
075adeeb92 btrfs: make wait_on_extent_buffer_writeback() static inline
The simple helper can be inlined, no need for the separate function.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:22 +01:00
David Sterba
011a9a1f24 btrfs: use btrfs_inode in extent_writepage()
As extent_writepage() is internal helper we should use our inode type,
so change it from struct inode.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:22 +01:00
David Sterba
06de96faf7 btrfs: rename __get_extent_map() and pass btrfs_inode
The double underscore naming scheme does not apply here, there's only
only get_extent_map(). As the definition is changed also pass the struct
btrfs_inode.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:22 +01:00
David Sterba
3a1c46dbc9 btrfs: open code set_page_extent_mapped()
The function set_page_extent_mapped() is now a simple wrapper so use the
folio helper.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:22 +01:00
David Sterba
2b41599bff btrfs: rename __unlock_for_delalloc() and drop underscores
Drop the leading underscores in '__unlock_for_delalloc()' and rename it
to 'unlock_delalloc_folio()'. This also ensures naming parity with
'lock_delalloc_folios()'.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:22 +01:00
David Sterba
2a1e8378dc btrfs: use SECTOR_SIZE defines in btrfs_issue_discard()
Use the existing define for single sector size.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:22 +01:00
David Sterba
6d67ff1c0b btrfs: remove stray comment about SRCU
The subvol_srcu was removed in c75e839414 ("btrfs: kill the
subvol_srcu") years ago.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:21 +01:00
David Sterba
5f14eb12a3 btrfs: drop unused parameter fs_info to btrfs_delete_delayed_insertion_item()
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:21 +01:00
Jing Xia
2fa07d7a0f btrfs: pass write-hint for buffered IO
Commit 449813515d ("block, fs: Restore the per-bio/request data
lifetime fields") restored write-hint support in btrfs. But that is
applicable only for direct IO. This patch supports passing
write-hint for buffered IO from btrfs file system to block layer
by filling bi_write_hint of struct bio in alloc_new_bio().

There's an ongoing discussion which devices can use that,
https://lore.kernel.org/all/20240910150200.6589-6-joshi.k@samsung.com,
in SCSI there's support using sd_group_number() and
sd_setup_rw32_cmnd().

The hint goes from the application directly to the block device so it's
up to the application to set up everything properly to utilize the
different hint classes.

Link: https://lore.kernel.org/all/20240910150200.6589-6-joshi.k@samsung.com
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jing Xia <j.xia@samsung.com>
[ Add more context and use case. ]
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:21 +01:00
Anand Jain
3681dbe0af btrfs: print read policy on module load
Print the read read policy if set as module parameter (with
CONFIG_BTRFS_EXPERIMENTAL).

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:21 +01:00
Anand Jain
e426286cfa btrfs: configure read policy via module parameter
For testing purposes allow to configure the read policy via module
parameter from the beginning. Available only with CONFIG_BTRFS_EXPERIMENTAL

Examples:

- Set the RAID1 balancing method to round-robin with a custom
  min_contig_read of 4k:
  $ modprobe btrfs read_policy=round-robin:4096

- Set the round-robin balancing method with the default
  min_contiguous_read:
  $ modprobe btrfs read_policy=round-robin

- Set the "devid" balancing method, defaulting to the latest device:
  $ modprobe btrfs read_policy=devid

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:21 +01:00
Anand Jain
bb4715e967 btrfs: print status of experimental mode when loading module
Commit c9c49e8f157e ("btrfs: split out CONFIG_BTRFS_EXPERIMENTAL from
CONFIG_BTRFS_DEBUG") introduces a way to enable or disable experimental
features, print its status during module load, like:

  Btrfs loaded, experimental=on, debug=on, assert=on, zoned=yes, fsverity=yes

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:21 +01:00
Anand Jain
c86aae73bd btrfs: add read policy to set a preferred device
Add read policy that will force all reads to go to the given device
(specified by devid) on the RAID1 profiles.

This will be used for testing, e.g. to read from stale device. Users may
find other use cases.

Can be set in sysfs, the value format is "devid:<devid>" to the file

  /sys/fs/btrfs/FSID/read_policy

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:21 +01:00
Anand Jain
6d7a915495 btrfs: introduce RAID1 round-robin read balancing
Add round-robin read policy that balances reads over available devices
(all RAID1 block group profiles). Switch to the next devices is done
after a number of blocks is read, which is 256K by default and is
configurable in sysfs.

The format is "round-robin:<min-contig-read>" and can be set in file

  /sys/fs/btrfs/FSID/read_policy

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:21 +01:00
Anand Jain
22fb0d99c9 btrfs: add tracking of read blocks for read policy
Track number of read blocks in the whole filesystem. The counter is
initialized when devices are opened. The counter is increased at
btrfs_submit_dev_bio() if the stats tracking is enabled (depends on the
read policy).  Stats tracking is disabled by default and is enabled
through fs_devices::collect_fs_stats when required.

The code is not under the EXPERIMENTAL define, as stats can be expanded
to include write counts and other performance counters, with the user
interface independent of its internal use.

This is an in-memory-only feature, not related to the dev error stats.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:21 +01:00
Anand Jain
b6bed20ed3 btrfs: sysfs: handle value associated with read balancing policy
Enable specifying additional configuration values along the RAID1
balancing read policy in a single input string.

Update btrfs_read_policy_to_enum() to parse and handle a value
associated with the policy in the format "policy:value", the value part
if present is converted to 64-bit integer.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:21 +01:00
Anand Jain
38cae63137 btrfs: sysfs: add btrfs_read_policy_to_enum() helper and refactor read policy store
Introduce btrfs_read_policy_to_enum() helper to simplify the conversion
of a string read policy to its corresponding enum value. This reduces
duplication and improves code clarity in btrfs_read_policy_store().

The parameter is copied locally to allow modification, enabling the
separation of the method and its value. This prepares for the addition of
more functionality in subsequent patches.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:21 +01:00
Anand Jain
83be7f8b9c btrfs: sysfs: refactor output formatting in btrfs_read_policy_show()
Refactor the logic in btrfs_read_policy_show() for easier extension with
more balancing methods.  Streamline the space and bracket handling
around the active policy without altering the functional output.  This
is in preparation to add more methods.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:20 +01:00
Anand Jain
a5019b7070 btrfs: initialize fs_devices->fs_info earlier in btrfs_init_devices_late()
Currently, fs_devices->fs_info is initialized in btrfs_init_devices_late(),
but this occurs too late for find_live_mirror(), which is invoked by
load_super_root() much earlier than btrfs_init_devices_late().

Fix this by moving the initialization to open_ctree(), before load_super_root().

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:20 +01:00
Filipe Manana
74973b45a6 btrfs: xattr: remove unnecessary call to btrfs_mark_buffer_dirty()
The call to btrfs_mark_buffer_dirty() at btrfs_setxattr() is not
necessary as we have a path setup for writing with btrfs_search_slot()
having a 'cow' argument set to 1.

This just makes the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:20 +01:00
Filipe Manana
1ca4e15f41 btrfs: volumes: remove unnecessary calls to btrfs_mark_buffer_dirty()
We have several places explicitly calling btrfs_mark_buffer_dirty() but
that is not necessarily since the target leaf came from a path that was
obtained for a btree search function that modifies the btree, something
like btrfs_insert_empty_item() or anything else that ends up calling
btrfs_search_slot() with a value of 1 for its 'cow' argument.

These just make the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove them.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:20 +01:00
Filipe Manana
c9a4390707 btrfs: uuid-tree: remove unnecessary call to btrfs_mark_buffer_dirty()
The call to btrfs_mark_buffer_dirty() at btrfs_uuid_tree_add() is not
necessary as we have a path setup for writing with btrfs_search_slot()
having a 'cow' argument set to 1 (through btrfs_insert_empty_item()).

This just makes the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:20 +01:00
Filipe Manana
65733e8d6c btrfs: root-tree: remove unnecessary calls to btrfs_mark_buffer_dirty()
We have several places explicitly calling btrfs_mark_buffer_dirty() but
that is not necessarily since the target leaf came from a path that was
obtained for a btree search function that modifies the btree, something
like btrfs_insert_empty_item() or anything else that ends up calling
btrfs_search_slot() with a value of 1 for its 'cow' argument.

These just make the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove them.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:20 +01:00
Filipe Manana
5a8293a1cc btrfs: relocation: remove unnecessary calls to btrfs_mark_buffer_dirty()
We have several places explicitly calling btrfs_mark_buffer_dirty() but
that is not necessarily since the target leaf came from a path that was
obtained for a btree search function that modifies the btree, something
like btrfs_insert_empty_item() or anything else that ends up calling
btrfs_search_slot() with a value of 1 for its 'cow' argument.

These just make the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove them.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:20 +01:00
Filipe Manana
bdf1660b22 btrfs: raid-stripe-tree: remove unnecessary call to btrfs_mark_buffer_dirty()
The call to btrfs_mark_buffer_dirty() at update_raid_extent_item() is not
necessary as we have a path setup for writing with btrfs_search_slot()
having a 'cow' argument set to 1.

This just makes the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:20 +01:00
Filipe Manana
d74a36f37e btrfs: qgroup: remove unnecessary calls to btrfs_mark_buffer_dirty()
We have several places explicitly calling btrfs_mark_buffer_dirty() but
that is not necessarily since the target leaf came from a path that was
obtained for a btree search function that modifies the btree, something
like btrfs_insert_empty_item() or anything else that ends up calling
btrfs_search_slot() with a value of 1 for its 'cow' argument.

These just make the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove them.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:20 +01:00
Filipe Manana
bd25bf9dcd btrfs: ioctl: remove unnecessary call to btrfs_mark_buffer_dirty()
The call to btrfs_mark_buffer_dirty() at btrfs_ioctl_default_subvol() is
not necessary as we have a path setup for writing with btrfs_search_slot()
having a 'cow' argument set to 1.

This just makes the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:20 +01:00
Filipe Manana
212e5f5cb8 btrfs: inode-item: remove unnecessary calls to btrfs_mark_buffer_dirty()
We have several places explicitly calling btrfs_mark_buffer_dirty() but
that is not necessarily since the target leaf came from a path that was
obtained for a btree search function that modifies the btree, something
like btrfs_insert_empty_item() or anything else that ends up calling
btrfs_search_slot() with a value of 1 for its 'cow' argument.

These just make the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove them.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:19 +01:00
Filipe Manana
5c7763312c btrfs: inode: remove unnecessary calls to btrfs_mark_buffer_dirty()
We have several places explicitly calling btrfs_mark_buffer_dirty() but
that is not necessarily since the target leaf came from a path that was
obtained for a btree search function that modifies the btree, something
like btrfs_insert_empty_item() or anything else that ends up calling
btrfs_search_slot() with a value of 1 for its 'cow' argument.

These just make the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove them.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:19 +01:00
Filipe Manana
038d6999ec btrfs: free-space-cache: remove unnecessary calls to btrfs_mark_buffer_dirty()
We have several places explicitly calling btrfs_mark_buffer_dirty() but
that is not necessarily since the target leaf came from a path that was
obtained for a btree search function that modifies the btree, something
like btrfs_insert_empty_item() or anything else that ends up calling
btrfs_search_slot() with a value of 1 for its 'cow' argument.

These just make the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove them.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:19 +01:00
Filipe Manana
5e887b5071 btrfs: file-item: remove unnecessary calls to btrfs_mark_buffer_dirty()
We have several places explicitly calling btrfs_mark_buffer_dirty() but
that is not necessarily since the target leaf came from a path that was
obtained for a btree search function that modifies the btree, something
like btrfs_insert_empty_item() or anything else that ends up calling
btrfs_search_slot() with a value of 1 for its 'cow' argument.

These just make the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove them.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:19 +01:00
Filipe Manana
49c318e4f7 btrfs: file: remove unnecessary calls to btrfs_mark_buffer_dirty()
We have several places explicitly calling btrfs_mark_buffer_dirty() but
that is not necessarily since the target leaf came from a path that was
obtained for a btree search function that modifies the btree, something
like btrfs_insert_empty_item() or anything else that ends up calling
btrfs_search_slot() with a value of 1 for its 'cow' argument.

These just make the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove them.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:19 +01:00
Filipe Manana
4866812020 btrfs: dir-item: remove unnecessary calls to btrfs_mark_buffer_dirty()
We have several places explicitly calling btrfs_mark_buffer_dirty() but
that is not necessarily since the target leaf came from a path that was
obtained for a btree search function that modifies the btree, something
like btrfs_insert_empty_item() or anything else that ends up calling
btrfs_search_slot() with a value of 1 for its 'cow' argument.

These just make the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove them.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:19 +01:00
Filipe Manana
7caa86c44b btrfs: dev-replace: remove unnecessary call to btrfs_mark_buffer_dirty()
The call to btrfs_mark_buffer_dirty() at btrfs_run_dev_replace() is not
necessary as we have a path setup for writing with btrfs_search_slot()
having a 'cow' argument set to 1.

This just makes the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:19 +01:00
Filipe Manana
a81ae6c31d btrfs: delayed-inode: remove unnecessary call to btrfs_mark_buffer_dirty()
The call to btrfs_mark_buffer_dirty() at __btrfs_update_delayed_inode() is
not necessary as we have a path setup for writing with btrfs_search_slot()
having a 'cow' argument set to 1.

This just makes the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:19 +01:00
Filipe Manana
ca9d907645 btrfs: block-group: remove unnecessary calls to btrfs_mark_buffer_dirty()
We have several places explicitly calling btrfs_mark_buffer_dirty() but
that is not necessarily since the target leaf came from a path that was
obtained for a btree search function that modifies the btree, something
like btrfs_insert_empty_item() or anything else that ends up calling
btrfs_search_slot() with a value of 1 for its 'cow' argument.

These just make the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove them.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:19 +01:00
Filipe Manana
1440fd2757 btrfs: extent-tree: remove unnecessary calls to btrfs_mark_buffer_dirty()
We have several places explicitly calling btrfs_mark_buffer_dirty() but
that is not necessarily since the target leaf came from a path that was
obtained for a btree search function that modifies the btree, something
like btrfs_insert_empty_item() or anything else that ends up calling
btrfs_search_slot() with a value of 1 for its 'cow' argument.

These just make the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove them.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:19 +01:00
Filipe Manana
63eb222387 btrfs: free-space-tree: remove unnecessary calls to btrfs_mark_buffer_dirty()
We have several places explicitly calling btrfs_mark_buffer_dirty() but
that is not necessarily since the target leaf came from a path that was
obtained for a btree search function that modifies the btree, something
ike btrfs_insert_empty_item() or anything else that ends up calling
btrfs_search_slot() with a value of 1 for its 'cow' argument.

These just make the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove them.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:19 +01:00
Filipe Manana
8787c36c63 btrfs: tree-log: remove unnecessary calls to btrfs_mark_buffer_dirty()
We have several places explicitly calling btrfs_mark_buffer_dirty() but
that is not necessarily since the target leaf came from a path that was
obtained for a btree search function that modifies the btree, something
like btrfs_insert_empty_item() or anything else that ends up calling
btrfs_search_slot() with a value of 1 for its 'cow' argument.

These just make the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove them.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:18 +01:00
Filipe Manana
097a7eef61 btrfs: uncollapse transaction aborts during renames
During renames we are grouping transaction aborts that can be due to a
failure of one of several function calls. While this makes the code less
verbose, it makes it harder to debug as we end up not knowing from which
function call we got an error.

So change this to trigger a transaction abort after each function call
failure, so that when we get a transaction abort message we know exactly
which function call failed, helping us to debug issues.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:18 +01:00
Qu Wenruo
2a9bb78cfd btrfs: validate system chunk array at btrfs_validate_super()
Currently btrfs_validate_super() only does a very basic check on the
array chunk size (not too large than the available space, but not too
small to contain no chunk).

The more comprehensive checks (the regular chunk checks and size check
inside the system chunk array) are all done inside btrfs_read_sys_array().

It's not a big deal, but it also means we do not do any validation on
the system chunk array at super block writeback time either.

Do the following modification to centralize the system chunk array
checks into btrfs_validate_super():

- Make chunk_err() helper accept stack chunk pointer
  If @leaf parameter is NULL, then the @chunk pointer will be a pointer
  to the chunk item, other than the offset inside the leaf.

  And since @leaf can be NULL, add a new @fs_info parameter for that
  case.

- Make btrfs_check_chunk_valid() handle stack chunk pointer
  The same as chunk_err(), a new @fs_info parameter, and if @leaf is
  NULL, then @chunk will be a pointer to a stack chunk.

  If @chunk is NULL, then all needed btrfs_chunk members will be read
  using the stack helper instead of the leaf helper.
  This means we need to read out all the needed member at the beginning
  of the function.

  Furthermore, at super block read time, fs_info->sectorsize is not yet
  initialized, we need one extra @sectorsize parameter to grab the
  correct sectorsize.

- Introduce a helper validate_sys_chunk_array()
  * Validate the disk key.
  * Validate the size before we access the full chunk items.
  * Do the full chunk item validation.

- Call validate_sys_chunk_array() at btrfs_validate_super()

- Simplify the checks inside btrfs_read_sys_array()
  Now the checks will be converted to an ASSERT().

- Simplify the checks inside read_one_chunk()
  Now that all chunk items inside system chunk array and chunk tree are
  verified, there is no need to verify them again inside read_one_chunk().

This change has the following advantages:

- More comprehensive checks at write time
  And unlike the sys_chunk_array read routine, this time we do not need
  to allocate a dummy extent buffer to do the check.
  All the checks done here require no new memory allocation.

- Slightly improved readability when iterating the system chunk array

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:18 +01:00
Roger L. Beckermeyer III
4e4d058e21 btrfs: update tree_insert() to use rb helpers
Update tree_insert() to use rb_find_add_cached().
add cmp_refs_node in rb_find_add_cached() to compare.

Since we're here, also make comp_data_refs() and comp_refs() accept
both parameters as const.

Signed-off-by: Roger L. Beckermeyer III <beckerlee3@gmail.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:18 +01:00
Roger L. Beckermeyer III
287373c701 btrfs: update btrfs_add_chunk_map() to use rb helpers
Update btrfs_add_chunk_map() to use rb_find_add_cached().

Signed-off-by: Roger L. Beckermeyer III <beckerlee3@gmail.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:18 +01:00
Roger L. Beckermeyer III
0877597dc3 btrfs: update __btrfs_add_delayed_item() to use rb helper
Update __btrfs_add_delayed_item() to use rb_find_add_cached().

Signed-off-by: Roger L. Beckermeyer III <beckerlee3@gmail.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:18 +01:00
Roger L. Beckermeyer III
14ae60c712 btrfs: update prelim_ref_insert() to use rb helpers
Update prelim_ref_insert() to use rb_find_add_cached().

There is a special change that the existing prelim_ref_compare() is
called with the first parameter as the existing ref in the rbtree.

But the newer rb_find_add_cached() expects the cmp() function to have
the first parameter as the to-be-added node, thus the new helper
prelim_ref_rb_add_cmp() need to adapt this new order.

Signed-off-by: Roger L. Beckermeyer III <beckerlee3@gmail.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:18 +01:00
Roger L. Beckermeyer III
372484f2c2 btrfs: update btrfs_add_block_group_cache() to use rb helper
Update fs/btrfs/block-group.c to use rb_find_add_cached().

Suggested-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Roger L. Beckermeyer III <beckerlee3@gmail.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:18 +01:00
Wolfram Sang
57e421867b btrfs: don't include linux/rwlock_types.h directly
The header clearly states that it does not want to be included directly,
only via linux/spinlock_types.h. Drop this as we can simply use the
spinlock.h which is already included.

Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:18 +01:00
Qu Wenruo
882af9f13e btrfs: handle free space tree rebuild in multiple transactions
During free space tree rebuild, we're holding a transaction handle for
the whole rebuild process.

This can lead to blocked task warning, e.g. btrfs-transaction kthread
(which is already created before btrfs_start_pre_rw_mount()) can be
waked up to join and commit the current transaction.

But the free space tree rebuild process may need to go through thousands
block groups, this will block btrfs-transaction kthread for a long time.

Fix the problem by calling btrfs_should_end_transaction() after each
block group, so that we won't hold the transaction handle too long.

And since the free-space-tree rebuild can be split into
multiple transactions, we need to consider the safety when the rebuild
process is interrupted.

Thankfully since we only set the FREE_SPACE_TREE compat_ro flag without
FREE_SPACE_TREE_VALID flag, even if the rebuild is interrupted, on the
next RW mount, we will still go rebuild the free space tree, by deleting
any items we have and re-starting the rebuild from scratch.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:18 +01:00
Filipe Manana
6a2b3d7a36 btrfs: use uuid_is_null() to verify if an uuid is empty
At btrfs_is_empty_uuid() we have our custom code to check if an uuid is
empty, however there a kernel uuid library that has a function named
uuid_is_null() which does the same and probably more efficient.

So change btrfs_is_empty_uuid() to use uuid_is_null(), which is almost
a directly replacement, it just wraps the necessary casting since our
uuid types are u8 arrays while the uuid kernel library uses the uuid_t
type, which is just a typedef of an u8 array of 16 elements as well.

Also since the function is now to trivial, make it a static inline
function in fs.h.

Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:17 +01:00
Filipe Manana
de9c8265b7 btrfs: remove pointless comment from ctree.h
It's pointless to have a comment above the prototype declarations of
btrfs_ctree_init() and btrfs_ctree_exit() mentioning that they are
declared in ctree.c. This is from the old days when ctree.h was used
to place anything that didn't fit in any other file. So remove it.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:17 +01:00
Filipe Manana
07174a3429 btrfs: move extent-tree function declarations out of ctree.h
We have 3 functions that have their prototypes declared in ctree.h but
they are defined at extent-tree.c and they are unrelated to the btree
data structure. Move the prototypes out of ctree.h and into extent-tree.h.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:17 +01:00
Filipe Manana
378f25d3fc btrfs: move btrfs_alloc_write_mask() into fs.h
Currently btrfs_alloc_write_mask() is defined in ctree.h but it's not
related at all to the btree data structure, so move it into fs.h.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:17 +01:00
Filipe Manana
a4545b74e2 btrfs: move BTRFS_BYTES_TO_BLKS() into fs.h
Currently BTRFS_BYTES_TO_BLKS() is defined in ctree.h but it's not related
at all to the btree data structure, so move it into fs.h.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:17 +01:00
Filipe Manana
2205302298 btrfs: move the folio ordered helpers from ctree.h into fs.h
The folio ordered helper macros are defined at ctree.h but this is not
the best place since ctree.{h,c} is all about the btree data structure
implementation and not a generic module. So move these macros into the
fs.h header.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:17 +01:00
Filipe Manana
a5b3f117da btrfs: move btrfs_is_empty_uuid() from ioctl.c into fs.c
It's a generic helper not specific to ioctls and used in several places,
so move it out from ioctl.c and into fs.c. While at it change its return
type from int to bool and declare the loop variable in the loop itself.

This also slightly reduces the module's size.

Before this change:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  1781492	 161037	  16920	1959449	 1de619	fs/btrfs/btrfs.ko

After this change:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  1781340	 161037	  16920	1959297	 1de581	fs/btrfs/btrfs.ko

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:17 +01:00
Filipe Manana
0b93369104 btrfs: move the exclusive operation functions into fs.c
The declarations for the exclusive operation functions are located at fs.h
but their definitions are in ioctl.c, which doesn't make much sense since
(most of them) are used in several files other than ioctl.c. Since they
are used in several files and they are generic enough, move them out of
ioctl.c and into fs.c, even the ones that are currently only used at
ioctl.c, for the sake of having them all in the same C file.

This also reduces the module's size.

Before this change:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  1782094	 161045	  16920	1960059	 1de87b	fs/btrfs/btrfs.ko

After this change:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  1781492	 161037	  16920	1959449	 1de619	fs/btrfs/btrfs.ko

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:17 +01:00
Filipe Manana
a6f0bcf9b1 btrfs: move csum related functions from ctree.c into fs.c
The ctree module is about the implementation of the btree data structure
and not a place holder for generic filesystem things like the csum
algorithm details. Move the functions related to the csum algorithm
details away from ctree.c and into fs.c, which is a far better place for
them. Also fix missing punctuation in comments and change one multiline
comment to a single line comment since everything fits in under 80
characters.

For some reason this also slightly reduces the module's size.

Before this change:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  1782126	 161045	  16920	1960091	 1de89b	fs/btrfs/btrfs.ko

After this change:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  1782094	 161045	  16920	1960059	 1de87b	fs/btrfs/btrfs.ko

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:17 +01:00
Filipe Manana
b815a78e17 btrfs: move abort_should_print_stack() to transaction.h
The function abort_should_print_stack() is declared in transaction.h but
its definition is in ctree.c, which doesn't make sense since ctree.c is
the btree implementation and the function is related to the transaction
code. Move its definition into transaction.h as an inline function since
it's a very short and trivial function, and also add the 'btrfs_' prefix
into its name.

This change also reduces the module size.

Before this change:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  1783148	 161137	  16920	1961205	 1decf5	fs/btrfs/btrfs.ko

After this change:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  1782126	 161045	  16920	1960091	 1de89b	fs/btrfs/btrfs.ko

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:17 +01:00
Johannes Thumshirn
63e5f9df7c btrfs: pass btrfs_io_geometry to is_single_device_io
Now that we have the stripe tree decision saved in struct
btrfs_io_geometry we can pass it into is_single_device_io() and get rid of
another call to btrfs_need_raid_stripe_tree_update().

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:16 +01:00
Johannes Thumshirn
9c48bcec47 btrfs: cache RAID stripe tree decision in btrfs_io_context
Cache the decision if a particular I/O needs to update RAID stripe tree
entries in struct btrfs_io_context.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:16 +01:00
Johannes Thumshirn
68ab9825a6 btrfs: cache stripe tree usage in struct btrfs_io_geometry
Cache the return of btrfs_need_stripe_tree_update() in struct
btrfs_io_geometry starting from btrfs_map_block().

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:16 +01:00
Filipe Manana
88694f74f4 btrfs: add assertions and comment about path expectations to btrfs_cross_ref_exist()
We should always call check_delayed_ref() with a path having a locked leaf
from the extent tree where either the extent item is located or where it
should be located in case it doesn't exist yet (when there's a pending
unflushed delayed ref to do it), as we need to lock any existing delayed
ref head while holding such leaf locked in order to avoid races with
flushing delayed references, which could make us think an extent is not
shared when it really is.

So add some assertions and a comment about such expectations to
btrfs_cross_ref_exist(), which is the only caller of check_delayed_ref().

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:16 +01:00
Filipe Manana
2747c55595 btrfs: add function comment for check_committed_ref()
There are some not immediately obvious details about the operation of
check_committed_ref(), namely that when it returns 0 it must return with
the path having a locked leaf from the extent tree that contains the
extent's extent item, so that we can later check for delayed refs when
calling check_delayed_ref() in a way that doesn't race with a task running
delayed references. For similar reasons, it must also return with a locked
leaf when the extent item is not found, and that leaf is where the extent
item should be located, because we may have delayed references that are
going to create the extent item. Also document that the function can
return false positives in order to not be too slow, and that the most
important is to not return false negatives.

So add a function comment to check_committed_ref().

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:16 +01:00
Filipe Manana
9e0d43ea4e btrfs: simplify arguments for btrfs_cross_ref_exist()
Instead of passing a root and an objectid which matches an inode number,
pass the inode instead, since the root is always the root associated to
the inode and the objectid is the number of that inode.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:16 +01:00
Filipe Manana
adf7da3f26 btrfs: simplify return logic at check_committed_ref()
Instead of setting the value to return in a local variable 'ret' and then
jumping into a label named 'out' that does nothing but return that value,
simplify everything by getting rid of the label and directly returning a
value.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:16 +01:00
Filipe Manana
78cdfba85d btrfs: avoid redundant call to get inline ref type at check_committed_ref()
At check_committed_ref() we are calling btrfs_get_extent_inline_ref_type()
twice, once before we check if have an inline extent owner ref (for simple
qgroups) and then once again sometime after that check. This second call
is redundant when we have simple quotas disabled or we found an inline ref
that is not of the owner ref type. So avoid this second call unless we
have simple quotas enabled and found an owner ref, saving a function call
that does inline ref validation again.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:16 +01:00
Filipe Manana
4f000a87fb btrfs: remove the snapshot check from check_committed_ref()
At check_committed_ref() we have this check to see if the data extent was
created in a generation lower than or equals to the generation where the
last snapshot for the root was created, and if so we return immediately
with 1, since it's very likely the extent is shared, referenced by other
root.

The only call chain for check_committed_ref() is the following:

   can_nocow_file_extent()
      btrfs_cross_ref_exist()
         check_committed_ref()

And we already do that snapshot check at can_nocow_file_extent(), before
we call btrfs_cross_ref_exist(). This makes the check done at
check_committed_ref() redundant, so remove it.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:16 +01:00
Filipe Manana
6c44075524 btrfs: remove no longer needed strict argument from can_nocow_extent()
All callers of can_nocow_extent() now pass a value of false for its
'strict' argument, making it redundant. So remove the argument from
can_nocow_extent() as well as can_nocow_file_extent(),
btrfs_cross_ref_exist() and check_committed_ref(), because this
argument was used just to influence the behavior of check_committed_ref().
Also remove the 'strict' field from struct can_nocow_file_extent_args,
which is now always false as well, as its value is taken from the
argument to can_nocow_extent().

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:16 +01:00
Johannes Thumshirn
4016358e85 btrfs: remove unused variable length in btrfs_insert_one_raid_extent()
Remove the variable length in btrfs_insert_one_raid_extent() as it is
unused.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:16 +01:00
Qu Wenruo
d0f038104f btrfs: output the reason for open_ctree() failure
There is a recent ML report that mounting a large fs backed by hardware
RAID56 controller (with one device missing) took too much time, and
systemd seems to kill the mount attempt.

In that case, the only error message is:

  BTRFS error (device sdj): open_ctree failed

There is no reason on why the failure happened, making it very hard to
understand the reason.

At least output the error number (in the particular case it should be
-EINTR) to provide some clue.

Link: https://lore.kernel.org/linux-btrfs/9b9c4d2810abcca2f9f76e32220ed9a90febb235.camel@scientia.org/
Reported-by: Christoph Anton Mitterer <calestyo@scientia.org>
Cc: stable@vger.kernel.org
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:15 +01:00
Qu Wenruo
a883120b2d btrfs: open-code btrfs_copy_from_user()
The function btrfs_copy_from_user() handles the folio dirtying for
buffered write. The original design is to allow that function to handle
multiple folios, but since commit c87c299776 ("btrfs: make buffered
write to copy one page a time") there is no need to support multiple
folios.

So here open-code btrfs_copy_from_user() to
copy_folio_from_iter_atomic() and flush_dcache_folio() calls.

The short-copy check and revert are still kept as-is.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:15 +01:00
Qu Wenruo
c0def46dec btrfs: improve the warning and error message for btrfs_remove_qgroup()
[WARNING]
There are several warnings about the recently introduced qgroup
auto-removal that it triggers WARN_ON() for the non-zero rfer/excl
numbers, e.g:

 ------------[ cut here ]------------
 WARNING: CPU: 67 PID: 2882 at fs/btrfs/qgroup.c:1854 btrfs_remove_qgroup+0x3df/0x450
 CPU: 67 UID: 0 PID: 2882 Comm: btrfs-cleaner Kdump: loaded Not tainted 6.11.6-300.fc41.x86_64 #1
 RIP: 0010:btrfs_remove_qgroup+0x3df/0x450
 Call Trace:
  <TASK>
  btrfs_qgroup_cleanup_dropped_subvolume+0x97/0xc0
  btrfs_drop_snapshot+0x44e/0xa80
  btrfs_clean_one_deleted_snapshot+0xc3/0x110
  cleaner_kthread+0xd8/0x130
  kthread+0xd2/0x100
  ret_from_fork+0x34/0x50
  ret_from_fork_asm+0x1a/0x30
  </TASK>
 ---[ end trace 0000000000000000 ]---
 BTRFS warning (device sda): to be deleted qgroup 0/319 has non-zero numbers, rfer 258478080 rfer_cmpr 258478080 excl 0 excl_cmpr 0

[CAUSE]
Although the root cause is still unclear, as if qgroup is consistent a
fully dropped subvolume (with extra transaction committed) should lead
to all zero numbers for the qgroup.

My current guess is the subvolume drop triggered the new subtree drop
threshold thus marked qgroup inconsistent, then rescan cleared it but
some corner case is not properly handled during subvolume dropping.

But at least for this particular case, since it's only the rfer/excl not
properly reset to 0, and qgroup is already marked inconsistent, there is
nothing to be worried for the end users.

The user space tool utilizing qgroup would queue a rescan to handle
everything, so the kernel wanring is a little overkilled.

[ENHANCEMENT]
Enhance the warning inside btrfs_remove_qgroup() by:

- Only do WARN() if CONFIG_BTRFS_DEBUG is enabled
  As explained the kernel can handle inconsistent qgroups by simply do a
  rescan, there is nothing to bother the end users.

- Treat the reserved space leak the same as non-zero numbers
  By outputting the values and trigger a WARN() if it's a debug build.
  So far I haven't experienced any case related to reserved space so I
  hope we will never need to bother them.

Fixes: 839d6ea4f8 ("btrfs: automatically remove the subvolume qgroup")
Link: https://github.com/kdave/btrfs-progs/issues/922
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:15 +01:00
Josef Bacik
f974bc3c9a btrfs: remove detached list from struct btrfs_backref_cache
We don't ever look at this list, remove it.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:15 +01:00
Josef Bacik
b61e0eb037 btrfs: remove the ->lowest and ->leaves members from struct btrfs_backref_node
Before we were keeping all of our nodes on various lists in order to
make sure everything got cleaned up correctly.  We used node->lowest to
indicate that node->lower was linked into the cache->leaves list.  Now
that we do cleanup based on the rb-tree both the list and the flag are
useless, so delete them both.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:15 +01:00
Josef Bacik
29e74a12a3 btrfs: simplify btrfs_backref_release_cache()
We rely on finding all our nodes on the various lists in the backref
cache, when they are all also in the rbtree.  Instead just search
through the rbtree and free everything.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:15 +01:00
Josef Bacik
4eb8064dc9 btrfs: do not handle non-shareable roots in backref cache
Now that we handle relocation for non-shareable roots without using the
backref cache, remove the ->cowonly field from the backref nodes and
update the handling to throw an error.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:15 +01:00
Josef Bacik
46bb6765d3 btrfs: don't build backref tree for COW-only blocks
We already determine the owner for any blocks we find when we're
relocating, and for COW-only blocks (and the data reloc tree) we COW
down to the block and call it good enough.  However we still build a
whole backref tree for them, even though we're not going to use it, and
then just don't put these blocks in the cache.

Rework the code to check if the block belongs to a COW-only root or the
data reloc root, and then just cow down to the block, skipping the
backref cache generation.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:15 +01:00
Josef Bacik
0097422c0d btrfs: remove clone_backref_node() from relocation
Since we no longer maintain backref cache across transactions, and this
is only called when we're creating the reloc root for a newly created
snapshot in the transaction critical section, we will end up doing a
bunch of work that will just get thrown away when we start the
transaction in the relocation loop.  Delete this code as it no longer
does anything for us.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:15 +01:00
Josef Bacik
551d04a32a btrfs: simplify loop in select_reloc_root()
We have this setup as a loop, but in reality we will never walk back up
the backref tree, if we do then it's a bug.  Get rid of the loop and
handle the case where we have node->new_bytenr set at all.  Previous
check was only if node->new_bytenr != root->node->start, but if it did
then we would hit the WARN_ON() and walk back up the tree.

Instead we want to just return error if ->new_bytenr is set, and then do
the normal updating of the node for the reloc root and carry on.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:15 +01:00
Josef Bacik
cb7de8ee9c btrfs: add a comment for new_bytenr in backref_cache_node
Add a comment for this field so we know what it is used for.  Previously
we used it to update the backref cache, so people may mistakenly think
it is useless, but in fact exists to make sure the backref cache makes
sense.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:15 +01:00
Josef Bacik
b1d4d5d1d8 btrfs: remove the changed list for backref cache
Now that we're not updating the backref cache when we switch transids we
can remove the changed list.

We're going to keep the new_bytenr field because it serves as a good
sanity check for the backref cache and relocation, and can prevent us
from making extent tree corruption worse.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:14 +01:00
Josef Bacik
6a4730b325 btrfs: convert BUG_ON in btrfs_reloc_cow_block() to proper error handling
This BUG_ON is meant to catch backref cache problems, but these can
arise from either bugs in the backref cache or corruption in the extent
tree.  Fix it to be a proper error.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:14 +01:00
Hao-ran Zheng
5324c4e10e btrfs: fix data race when accessing the inode's disk_i_size at btrfs_drop_extents()
A data race occurs when the function `insert_ordered_extent_file_extent()`
and the function `btrfs_inode_safe_disk_i_size_write()` are executed
concurrently. The function `insert_ordered_extent_file_extent()` is not
locked when reading inode->disk_i_size, causing
`btrfs_inode_safe_disk_i_size_write()` to cause data competition when
writing inode->disk_i_size, thus affecting the value of `modify_tree`.

The specific call stack that appears during testing is as follows:

  ============DATA_RACE============
   btrfs_drop_extents+0x89a/0xa060 [btrfs]
   insert_reserved_file_extent+0xb54/0x2960 [btrfs]
   insert_ordered_extent_file_extent+0xff5/0x1760 [btrfs]
   btrfs_finish_one_ordered+0x1b85/0x36a0 [btrfs]
   btrfs_finish_ordered_io+0x37/0x60 [btrfs]
   finish_ordered_fn+0x3e/0x50 [btrfs]
   btrfs_work_helper+0x9c9/0x27a0 [btrfs]
   process_scheduled_works+0x716/0xf10
   worker_thread+0xb6a/0x1190
   kthread+0x292/0x330
   ret_from_fork+0x4d/0x80
   ret_from_fork_asm+0x1a/0x30
  ============OTHER_INFO============
   btrfs_inode_safe_disk_i_size_write+0x4ec/0x600 [btrfs]
   btrfs_finish_one_ordered+0x24c7/0x36a0 [btrfs]
   btrfs_finish_ordered_io+0x37/0x60 [btrfs]
   finish_ordered_fn+0x3e/0x50 [btrfs]
   btrfs_work_helper+0x9c9/0x27a0 [btrfs]
   process_scheduled_works+0x716/0xf10
   worker_thread+0xb6a/0x1190
   kthread+0x292/0x330
   ret_from_fork+0x4d/0x80
   ret_from_fork_asm+0x1a/0x30
  =================================

The main purpose of the check of the inode's disk_i_size is to avoid
taking write locks on a btree path when we have a write at or beyond
EOF, since in these cases we don't expect to find extent items in the
root to drop. However if we end up taking write locks due to a data
race on disk_i_size, everything is still correct, we only add extra
lock contention on the tree in case there's concurrency from other tasks.
If the race causes us to not take write locks when we actually need them,
then everything is functionally correct as well, since if we find out we
have extent items to drop and we took read locks (modify_tree set to 0),
we release the path and retry again with write locks.

Since this data race does not affect the correctness of the function,
it is a harmless data race, use data_race() to check inode->disk_i_size.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Hao-ran Zheng <zhenghaoran154@gmail.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:14 +01:00
Johannes Thumshirn
f6f0da564c btrfs: don't BUG_ON() in btrfs_drop_extents()
btrfs_drop_extents() calls BUG_ON() in case the counter of to be deleted
extents is greater than 0. But all of these code paths can handle errors,
so there's no need to crash the kernel. Instead WARN() that the condition
has been met and gracefully bail out.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:14 +01:00
Naohiro Aota
453a73c306 btrfs: zoned: reclaim unused zone by zone resetting
On the zoned mode, once used and freed region is still not reusable after the
freeing. The underlying zone needs to be reset before reusing. Btrfs resets a
zone when it removes a block group, and then new block group is allocated on
the zones to reuse the zones. But, it is sometime too late to catch up with a
write side.

This commit introduces a new space-info reclaim method ZONE_RESET. That will
pick a block group from the unused list and reset its zone to reuse the
zone_unusable space. It is faster than removing the block group and re-creating
a new block group on the same zones.

For the first implementation, the ZONE_RESET is only applied to a block group
whose region is fully zone_unusable. Reclaiming partial zone_unusable block
group could be implemented later.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:14 +01:00
Naohiro Aota
7de9ca1f30 btrfs: drop fs_info argument from btrfs_update_space_info_*()
Since commit e1e577aafe41 ("btrfs: store fs_info in space_info"), we have
the fs_info in a space_info. So, we can drop fs_info argument from
btrfs_update_space_info_*. There is no behavior change.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:14 +01:00
Naohiro Aota
3704db1013 btrfs: factor out btrfs_return_free_space()
Factor out a part of unpin_extent_range() that returns space back to the
space info, prioritizing global block reserve.  Also, move the "len"
variable into the loop to clarify we don't need to carry it beyond an
iteration.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:14 +01:00