Commit graph

272 commits

Author SHA1 Message Date
Linus Torvalds
f3827213ab for-6.18-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmjTk4MACgkQxWXV+ddt
 WDvOHA//ajYvH7DIoFgQ09Q+UCfdawhWs/b4aW2ePpNK061tF6hvGgmGVe/Ugy8W
 297kSBVxpnaLfedHkm3m91SAft6VKSfdV3oV2DNn9sxUXQoa9hC6n9qIaqeOpfd8
 Nk+OvgSWpqonAHHMbsNev4C+vKZO534VRg09eFfIV7ATpQO7wxc1DKXFT5hgYP3m
 nosRc0f/4gx0EGHjiXyfuG5una1A/vry4+EP7jrvzvKHY9VzYMLRXH+glNUi5X5E
 GOwFXd6ADUpKDKN9Ove/Bm4DSz9jrTNu81qm/1i1mTpxS80sxBFIrD4KOil+hQDX
 B82n01KS8yJkBYH32Qnpg+9Cij/ZR/0OOg88wBLGeQiDoDw7J8D9mJe1/RHWHHTC
 rQ1C50CDlVGIPpnB1BftbvvdYlAPKgpnnznaaKg9Mdy3T5FtFQ3MqwZYOW/jubtY
 Zo7shxrDjSvPb7MHG6GlLBNxZ8JXXGyc+seEfjZ8iiEeMGsE9vIQ1L18c0GZSmgc
 /m/nQV/akycoNg/9J84HqClGLUWUApdMPaXrvOwC5CjpgOgJZ+rdUqhexqcNwmsl
 O+s9fwQidtAr5fAgl6SjwqaPauqBd4VSybs7IkGbz+zyaZeRdWo5gsg5t5Hjuyd5
 gJiIAztzI8bOPI1T/EheGVwSkmJTEkhnJDQvMRQcpEpo5D5K3YY=
 =9wY+
 -----END PGP SIGNATURE-----

Merge tag 'for-6.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "There are no new features, the changes are in the core code, notably
  tree-log error handling and reporting improvements, and initial
  support for block size > page size.

  Performance improvements:

   - search data checksums in the commit root (previous transaction) to
     avoid locking contention, this improves parallelism of read
     heavy/low write workloads, and also reduces transaction commit
     time; on real and reproducer workload the sync time went from
     minutes to tens of seconds (workload and numbers are in the
     changelog)

  Core:

   - tree-log updates:
      - error handling improvements, transaction aborts
      - add new error state 'O' (printed in status messages) when log
        replay fails and is aborted
      - reduced number of btrfs_path allocations when traversing the
        tree

   - 'block size > page size' support
      - basic implementation with limitations, under experimental build
      - limitations: no direct io, raid56, encoded read (standalone and
        in send ioctl), encoded write
      - preparatory work for compression, removing implicit assumptions
        of page and block sizes
      - compression workspaces are now per-filesystem, we cannot assume
        common block size for work memory among different filesystems

   - tree-checker now verifies INODE_EXTREF item (which is implementing
     hardlinks)

   - tree leaf pretty printer updates, there were missing data from
     items, keys/items

   - move config option CONFIG_BTRFS_REF_VERIFY to CONFIG_BTRFS_DEBUG,
     it's a debugging feature and not needed to be enabled separately

   - more struct btrfs_path auto free updates

   - use ref_tracker API for tracking delayed inodes, enabled by mount
     option 'ref_verify', allowing to better pinpoint leaking references

   - in zoned mode, avoid selecting data relocation zoned for ordinary
     data block groups

   - updated and enhanced error messages

   - lots of cleanups and refactoring"

* tag 'for-6.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (113 commits)
  btrfs: use smp_mb__after_atomic() when forcing COW in create_pending_snapshot()
  btrfs: add unlikely annotations to branches leading to transaction abort
  btrfs: add unlikely annotations to branches leading to EIO
  btrfs: add unlikely annotations to branches leading to EUCLEAN
  btrfs: more trivial BTRFS_PATH_AUTO_FREE conversions
  btrfs: zoned: don't fail mount needlessly due to too many active zones
  btrfs: use kmalloc_array() for open-coded arithmetic in kmalloc()
  btrfs: enable experimental bs > ps support
  btrfs: add extra ASSERT()s to catch unaligned bios
  btrfs: fix symbolic link reading when bs > ps
  btrfs: prepare scrub to support bs > ps cases
  btrfs: prepare zlib to support bs > ps cases
  btrfs: prepare lzo to support bs > ps cases
  btrfs: prepare zstd to support bs > ps cases
  btrfs: prepare compression folio alloc/free for bs > ps cases
  btrfs: fix the incorrect max_bytes value for find_lock_delalloc_range()
  btrfs: remove pointless key offset setup in create_pending_snapshot()
  btrfs: annotate btrfs_is_testing() as unlikely and make it return bool
  btrfs: make the rule checking more readable for should_cow_block()
  btrfs: simplify inline extent end calculation at replay_one_extent()
  ...
2025-09-30 08:14:49 -07:00
Linus Torvalds
56e7b31071 vfs-6.18-rc1.inode
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZQQgAKCRCRxhvAZXjc
 oud9AQD5IG4sNnzCjsvcTDpQkbX5eZW+LFIiAiiN+nztZ+OcRQEAvC2N7YovfqM3
 TWpVoNDKvEPdtDc9ttFMUKqBZYvxvgE=
 =sEaL
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.18-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs inode updates from Christian Brauner:
 "This contains a series I originally wrote and that Eric brought over
  the finish line. It moves out the i_crypt_info and i_verity_info
  pointers out of 'struct inode' and into the fs-specific part of the
  inode.

  So now the few filesytems that actually make use of this pay the price
  in their own private inode storage instead of forcing it upon every
  user of struct inode.

  The pointer for the crypt and verity info is simply found by storing
  an offset to its address in struct fsverity_operations and struct
  fscrypt_operations. This shrinks struct inode by 16 bytes.

  I hope to move a lot more out of it in the future so that struct inode
  becomes really just about very core stuff that we need, much like
  struct dentry and struct file, instead of the dumping ground it has
  become over the years.

  On top of this are a various changes associated with the ongoing inode
  lifetime handling rework that multiple people are pushing forward:

   - Stop accessing inode->i_count directly in f2fs and gfs2. They
     simply should use the __iget() and iput() helpers

   - Make the i_state flags an enum

   - Rework the iput() logic

     Currently, if we are the last iput, and we have the I_DIRTY_TIME
     bit set, we will grab a reference on the inode again and then mark
     it dirty and then redo the put. This is to make sure we delay the
     time update for as long as possible

     We can rework this logic to simply dec i_count if it is not 1, and
     if it is do the time update while still holding the i_count
     reference

     Then we can replace the atomic_dec_and_lock with locking the
     ->i_lock and doing atomic_dec_and_test, since we did the
     atomic_add_unless above

   - Add an icount_read() helper and convert everyone that accesses
     inode->i_count directly for this purpose to use the helper

   - Expand dump_inode() to dump more information about an inode helping
     in debugging

   - Add some might_sleep() annotations to iput() and associated
     helpers"

* tag 'vfs-6.18-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: add might_sleep() annotation to iput() and more
  fs: expand dump_inode()
  inode: fix whitespace issues
  fs: add an icount_read helper
  fs: rework iput logic
  fs: make the i_state flags an enum
  fs: stop accessing ->i_count directly in f2fs and gfs2
  fsverity: check IS_VERITY() in fsverity_cleanup_inode()
  fs: remove inode::i_verity_info
  btrfs: move verity info pointer to fs-specific part of inode
  f2fs: move verity info pointer to fs-specific part of inode
  ext4: move verity info pointer to fs-specific part of inode
  fsverity: add support for info in fs-specific part of inode
  fs: remove inode::i_crypt_info
  ceph: move crypt info pointer to fs-specific part of inode
  ubifs: move crypt info pointer to fs-specific part of inode
  f2fs: move crypt info pointer to fs-specific part of inode
  ext4: move crypt info pointer to fs-specific part of inode
  fscrypt: add support for info in fs-specific part of inode
  fscrypt: replace raw loads of info pointer with helper function
2025-09-29 09:42:30 -07:00
Qu Wenruo
ea77a1c1c7 btrfs: cache max and min order inside btrfs_fs_info
Inside btrfs_fs_info we cache several bits shift like sectorsize_bits.

Apply this to max and min folio orders so that every time mapping order
needs to be applied we can skip the calculation.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23 08:49:17 +02:00
Qu Wenruo
9afc617265 btrfs: introduce btrfs_bio_for_each_block() helper
Currently if we want to iterate a bio in block unit, we do something
like this:

	while (iter->bi_size) {
		struct bio_vec bv = bio_iter_iovec();

		/* Do something with using the bv */

		bio_advance_iter_single(&bbio->bio, iter, sectorsize);
	}

That's fine for now, but it will not handle future bs > ps, as
bio_iter_iovec() returns a single-page bvec, meaning the bv_len will not
exceed page size.

This means the code using that bv can only handle a block if bs <= ps.

To address this problem and handle future bs > ps cases better:

- Introduce a helper btrfs_bio_for_each_block()
  Instead of bio_vec, which has single and multiple page version and
  multiple page version has quite some limits, use my favorite way to
  represent a block, phys_addr_t.

  For bs <= ps cases, nothing is changed, except we will do a very
  small overhead to convert phys_addr_t to a folio, then use the proper
  folio helpers to handle the possible highmem cases.

  For bs > ps cases, all blocks will be backed by large folios, meaning
  every folio will cover at least one block. And still use proper folio
  helpers to handle highmem cases.

  With phys_addr_t, we will handle both large folio and highmem
  properly. So there is no better single variable to present a btrfs
  block than phys_addr_t.

- Extract the data block csum calculation into a helper
  The new helper, btrfs_calculate_block_csum() will be utilized by
  btrfs_csum_one_bio().

- Use btrfs_bio_for_each_block() to replace existing call sites
  Including:

  * index_one_bio() from raid56.c
    Very straight-forward.

  * btrfs_check_read_bio()
    Also update repair_one_sector() to grab the folio using phys_addr_t,
    and do extra checks to make sure the folio covers at least one
    block.
    We do not need to bother bv_len at all now.

  * btrfs_csum_one_bio()
    Now we can move the highmem handling into a dedicated helper,
    calculate_block_csum(), and use btrfs_bio_for_each_block() helper.

There is one exception in btrfs_decompress_buf2page(), which is copying
decompressed data into the original bio, which is not iterating using
block size thus we don't need to bother.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23 08:49:17 +02:00
Qu Wenruo
35aff706dc btrfs: concentrate highmem handling for data verification
Currently for btrfs checksum verification, we do it in the following
pattern:

	kaddr = kmap_local_*();
	ret = btrfs_check_csum_csum(kaddr);
	kunmap_local(kaddr);

It's OK for now, but it's still not following the patterns of helpers
inside linux/highmem.h, which never requires a virt memory address.

In those highmem helpers, they mostly accept a folio, some offset/length
inside the folio, and in the implementation they check if the folio
needs partial kmap, and do the handling.

Inspired by those formal highmem helpers, enhance the highmem handling
of data checksum verification by:

- Rename btrfs_check_sector_csum() to btrfs_check_block_csum()
  To follow the more common term "block" used in all other major
  filesystems.

- Pass a physical address into btrfs_check_block_csum() and
  btrfs_data_csum_ok()
  The physical address is always available even for a highmem page.
  Since it's page frame number << PAGE_SHIFT + offset in page.

  And with that physical address, we can grab the folio covering the
  page, and do extra checks to ensure it covers at least one block.

  This also allows us to do the kmap inside btrfs_check_block_csum().
  This means all the extra HIGHMEM handling will be concentrated into
  btrfs_check_block_csum(), and no callers will need to bother highmem
  by themselves.

- Properly zero out the block if csum mismatch
  Since btrfs_data_csum_ok() only got a paddr, we can not and should not
  use memzero_bvec(), which only accepts single page bvec.
  Instead use paddr to grab the folio and call folio_zero_range()

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23 08:49:16 +02:00
David Sterba
67e78f983e btrfs: convert several int parameters to bool
We're almost done cleaning misused int/bool parameters. Convert a bunch
of them, found by manual grepping.  Note that btrfs_sync_fs() needs an
int as it's mandated by the struct super_operations prototype.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-22 10:54:32 +02:00
Filipe Manana
59a0dd4ab9 btrfs: fix race between setting last_dir_index_offset and inode logging
At inode_logged() if we find that the inode was not logged before we
update its ->last_dir_index_offset to (u64)-1 with the goal that the
next directory log operation will see the (u64)-1 and then figure out
it must check what was the index of the last logged dir index key and
update ->last_dir_index_offset to that key's offset (this is done in
update_last_dir_index_offset()).

This however has a possibility for a time window where a race can happen
and lead to directory logging skipping dir index keys that should be
logged. The race happens like this:

1) Task A calls inode_logged(), sees ->logged_trans as 0 and then checks
   that the inode item was logged before, but before it sets the inode's
   ->last_dir_index_offset to (u64)-1...

2) Task B is at btrfs_log_inode() which calls inode_logged() early, and
   that has set ->last_dir_index_offset to (u64)-1;

3) Task B then enters log_directory_changes() which calls
   update_last_dir_index_offset(). There it sees ->last_dir_index_offset
   is (u64)-1 and that the inode was logged before (ctx->logged_before is
   true), and so it searches for the last logged dir index key in the log
   tree and it finds that it has an offset (index) value of N, so it sets
   ->last_dir_index_offset to N, so that we can skip index keys that are
   less than or equal to N (later at process_dir_items_leaf());

4) Task A now sets ->last_dir_index_offset to (u64)-1, undoing the update
   that task B just did;

5) Task B will now skip every index key when it enters
   process_dir_items_leaf(), since ->last_dir_index_offset is (u64)-1.

Fix this by making inode_logged() not touch ->last_dir_index_offset and
initializing it to 0 when an inode is loaded (at btrfs_alloc_inode()) and
then having update_last_dir_index_offset() treat a value of 0 as meaning
we must check the log tree and update with the index of the last logged
index key. This is fine since the minimum possible value for
->last_dir_index_offset is 1 (BTRFS_DIR_START_INDEX - 1 = 2 - 1 = 1).
This also simplifies the management of ->last_dir_index_offset and now
all accesses to it are done under the inode's log_mutex.

Fixes: 0f8ce49821 ("btrfs: avoid inode logging during rename and link when possible")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-08-22 00:58:47 +02:00
Eric Biggers
fcafdd4210
btrfs: move verity info pointer to fs-specific part of inode
Move the fsverity_info pointer into the filesystem-specific part of the
inode by adding the field btrfs_inode::i_verity_info and configuring
fsverity_operations::inode_info_offs accordingly.

This is a prerequisite for a later commit that removes
inode::i_verity_info, saving memory and improving cache efficiency on
filesystems that don't support fsverity.

Co-developed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://lore.kernel.org/20250810075706.172910-12-ebiggers@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-08-21 13:58:08 +02:00
Qu Wenruo
041c39da53 btrfs: enable large data folios for data reloc inode
For data reloc inodes, they are a special type of inodes that are not
exposed to user space, and are only utilized during data block groups
relocation.

They do not go under regular read-write operations, but have their file
extents manually created to have the same layout of a block group, then
its content is read from the original block group, and written back to
the new location which is in a new block group.

Previously all the handling was done in page units, and commit
c283289812 ("btrfs: make relocate_one_page() handle subpage case")
changed the handling to subpage blocks.

On the other hand, data reloc inodes are a perfect match for large data
folios, as each relocation cluster represents one or more data extents
that are contiguous in their logical addresses.

This patch enables large folios for data reloc inodes by:

- Remove the special handling of data reloc inodes when setting folio
  order

- Change relocate_one_folio() to return the file offset of the next
  folio
  Originally it's designed to handle fixed page sized blocks, but with
  large folios, we can handle a large folio, thus we have to return the
  end of the current folio.

- Remove the warning on folio_order()

- Use folio_size() to replace fixed PAGE_SIZE usage

- Use file_offset as iterator inside relocate_file_extent_cluster

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 01:13:03 +02:00
Qu Wenruo
cc38d178ff btrfs: enable large data folio support under CONFIG_BTRFS_EXPERIMENTAL
With all the preparation patches already merged, it's pretty easy to
enable large data folios:

- Remove the ASSERT() on folio size in btrfs_end_repair_bio()

- Add a helper to properly set the max folio order
  Currently due to several call sites that are fetching the bitmap
  content directly into an unsigned long, we can only support
  BITS_PER_LONG blocks for each bitmap.

- Call the helper when reading/creating an inode

The support has the following limitations:

- No large folios for data reloc inode
  The relocation code still requires page sized folio.
  But it's not that hot nor common compared to regular buffered ios.

  Will be improved in the future.

- Requires CONFIG_BTRFS_EXPERIMENTAL

- Will require all folio related operations to check if it needs the
  extra btrfs_subpage structure
  Now any folio larger than block size will need btrfs_subpage structure
  handling.

Unfortunately I do not have a physical machine for performance test, but
if everything goes like XFS/EXT4, it should mostly bring single digits
percentage performance improvement in the real world.

Although I believe there are still quite some optimizations to be done,
let's focus on testing the current large data folio support first.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:30 +02:00
Qu Wenruo
8e4f21f2b1 btrfs: handle unaligned EOF truncation correctly for subpage cases
[BUG]
The following fsx sequence will fail on btrfs with 64K page size and 4K
fs block size:

  #fsx -d -e 1 -N 4 $mnt/junk -S 36386
  READ BAD DATA: offset = 0xe9ba, size = 0x6dd5, fname = /mnt/btrfs/junk
  OFFSET      GOOD    BAD     RANGE
  0xe9ba      0x0000  0x03ac  0x0
  operation# (mod 256) for the bad data may be 3
  ...
  LOG DUMP (4 total operations):
  1(  1 mod 256): WRITE    0x6c62 thru 0x1147d	(0xa81c bytes) HOLE	***WWWW
  2(  2 mod 256): TRUNCATE DOWN	from 0x1147e to 0x5448	******WWWW
  3(  3 mod 256): ZERO     0x1c7aa thru 0x28fe2	(0xc839 bytes)
  4(  4 mod 256): MAPREAD  0xe9ba thru 0x1578e	(0x6dd5 bytes)	***RRRR***

[CAUSE]
Only 2 operations are really involved in this case:

  3 pollute_eof	0x5448 thru	0xffff	(0xabb8 bytes)
  3 zero	from 0x1c7aa to 0x28fe3, (0xc839 bytes)
  4 mapread	0xe9ba thru	0x1578e	(0x6dd5 bytes)

At operation 3, fsx pollutes beyond EOF, that is done by mmap()
and write into that mmap() range beyond EOF.

Such write will fill the range beyond EOF, but it will never reach disk
as ranges beyond EOF will not be marked dirty nor uptodate.

Then we zero_range for [0x1c7aa, 0x28fe3], and since the range is beyond
our isize (which was 0x5448), we should zero out any range beyond
EOF (0x5448).

During btrfs_zero_range(), we call btrfs_truncate_block() to dirty the
unaligned head block.
But that function only really zeroes out the block at [0x5000, 0x5fff], it
doesn't bother any range other that that block, since those ranges will
not be marked dirty nor written back.

So the range [0x6000, 0xffff] is still polluted, and later mapread()
will return the poisoned value.

[FIX]
Enhance btrfs_truncate_block() by:

- Pass a @start/@end pair to indicate the full truncation range
  This is to handle the following truncation case:

    Page size is 64K, fs block size is 4K, truncate range is
    [6K, 60K]

    0                      32K                    64K
    |   |///////////////////////////////////|     |
        6K                                  60K

    The range is not aligned for its head block, so we need to call
    btrfs_truncate_block() with @from = 6K, @front = 0, @len = 0.

    But with that information we only know to zero the range [6K, 8K),
    if we zero out the range [6K, 64K), the last block will also be
    zeroed, causing data loss.

  So here we need the full range we're truncating, so that we can avoid
  over-truncation.

- Rename @from to @offset
  As now the parameter is only utilized to locate a block, it's not
  really carrying the old @from meaning well.

- Remove @front parameter
  With the full truncate range passed in, we can determine if the
  @offset is at the head or tail block.

- Skip truncation if @offset is not in the head nor tail blocks
  The call site in hole punch unconditionally call
  btrfs_truncate_block() without even checking the range is aligned or
  not.
  If the @offset is neither in the head nor in tail block, it means we can
  safely ignore it.

- Skip truncate if the range inside the target block is already aligned

- Make btrfs_truncate_block() zero all blocks beyond EOF
  Since we have the original range, we know exactly if we're doing
  truncation beyond EOF (the @end will be (u64)-1).

  If we're doing truncation beyond EOF, then enlarge the truncation
  range to the folio end, to address the possibly polluted ranges.

  Otherwise still keep the zero range inside the block, as we can have
  large data folios soon, always truncating every blocks inside the same
  folio can be costly for large folios.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:55 +02:00
Christoph Hellwig
959ddf2839 btrfs: move kmapping out of btrfs_check_sector_csum()
Move kmapping the page out of btrfs_check_sector_csum().

This allows using bvec_kmap_local() where suitable and reduces the number
of kmap*() calls in the raid56 code.

This also means btrfs_check_sector_csum() will only accept a properly
kmapped address.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:46 +02:00
Filipe Manana
92be661a57 btrfs: make btrfs_iget_path() return a btrfs inode instead
It's an internal function and btrfs_iget() is now returning a btrfs inode,
so change btrfs_iget_path() to also return a btrfs inode instead of a VFS
inode.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:50 +01:00
Filipe Manana
b204e5c7d4 btrfs: make btrfs_iget() return a btrfs inode instead
It's an internal function and most of the time the callers are doing a lot
of BTRFS_I() calls on the returned VFS inode to get the btrfs inode, so
change the return type to struct btrfs_inode instead.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:50 +01:00
Daniel Vacek
fc5c0c5825 btrfs: defrag: extend ioctl to accept compression levels
The zstd and zlib compression types support setting compression levels.
Enhance the defrag interface to specify the levels as well. For zstd the
negative (realtime) levels are also accepted.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:50 +01:00
David Sterba
44dddd493e btrfs: pass struct btrfs_inode to can_nocow_extent()
Pass a struct btrfs_inode to can_nocow_extent() as it's an internal
interface, allowing to remove some use of BTRFS_I.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:43 +01:00
Qu Wenruo
ecde48a1a6 btrfs: expose per-inode stable writes flag
The address space flag AS_STABLE_WRITES determine if FGP_STABLE for will
wait for the folio to finish its writeback.

For btrfs, due to the default data checksum behavior, if we modify the
folio while it's still under writeback, it will cause data checksum
mismatch.  Thus for quite some call sites we manually call
folio_wait_writeback() to prevent such problem from happening.

Currently there is only one call site inside btrfs really utilizing
FGP_STABLE, and in that case we also manually call folio_wait_writeback()
to do the waiting.

But it's better to properly expose the stable writes flag to a per-inode
basis, to allow call sites to fully benefit from FGP_STABLE flag.
E.g. for inodes with NODATASUM allowing beginning dirtying the page
without waiting for writeback.

This involves:

- Update the mapping's stable write flag when setting/clearing NODATASUM
  inode flag using ioctl
  This only works for empty files, so it should be fine.

- Update the mapping's stable write flag when reading an inode from disk

- Remove the explicit folio_wait_writeback() for FGP_BEGINWRITE call
  site

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:41 +01:00
Filipe Manana
6c44075524 btrfs: remove no longer needed strict argument from can_nocow_extent()
All callers of can_nocow_extent() now pass a value of false for its
'strict' argument, making it redundant. So remove the argument from
can_nocow_extent() as well as can_nocow_file_extent(),
btrfs_cross_ref_exist() and check_committed_ref(), because this
argument was used just to influence the behavior of check_committed_ref().
Also remove the 'strict' field from struct can_nocow_file_extent_args,
which is now always false as well, as its value is taken from the
argument to can_nocow_extent().

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:16 +01:00
Mark Harmstone
34310c442e btrfs: add io_uring command for encoded reads (ENCODED_READ ioctl)
Add an io_uring command for encoded reads, using the same interface as
the existing BTRFS_IOC_ENCODED_READ ioctl.

btrfs_uring_encoded_read() is an io_uring version of
btrfs_ioctl_encoded_read(), which validates the user input and calls
btrfs_encoded_read() to read the appropriate metadata. If we determine
that we need to read an extent from disk, we call
btrfs_encoded_read_regular_fill_pages() through
btrfs_uring_read_extent() to prepare the bio.

The existing btrfs_encoded_read_regular_fill_pages() is changed so that
if it is passed a valid uring_ctx, rather than waking up any waiting
threads it calls btrfs_uring_read_extent_endio(). This in turn copies
the read data back to userspace, and calls io_uring_cmd_done() to
complete the io_uring command.

Because we're potentially doing a non-blocking read,
btrfs_uring_read_extent() doesn't clean up after itself if it returns
-EIOCBQUEUED. Instead, it allocates a priv struct, populates the fields
there that we will need to unlock the inode and free our allocations,
and defers this to the btrfs_uring_read_finished() that gets called when
the bio completes.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:21 +01:00
Mark Harmstone
26efd44796 btrfs: change btrfs_encoded_read() so that reading of extent is done by caller
Change the behaviour of btrfs_encoded_read() so that if it needs to read
an extent from disk, it leaves the extent and inode locked and returns
-EIOCBQUEUED. The caller is then responsible for doing the I/O via
btrfs_encoded_read_regular() and unlocking the extent and inode.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:21 +01:00
David Sterba
590168edbe btrfs: drop unused parameter file_offset from btrfs_encoded_read_regular_fill_pages()
The file_offset parameter used to be passed to encoded read struct but
was removed in commit b665affe93 ("btrfs: remove unused members from
struct btrfs_encoded_read_private").

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:16 +01:00
Qu Wenruo
0fcaf926ad btrfs: remove btrfs_set_range_writeback()
The function btrfs_set_range_writeback() was originally a callback for
metadata and data, to mark a range with writeback flag.

Then it was converted into a common function call for both metadata and
data.

From the very beginning, the function had been only called on a full page,
later converted to handle range inside a page.

But it never needed to handle multiple pages, and since commit
8189197425 ("btrfs: refactor __extent_writepage_io() to do
sector-by-sector submission") the function was only called on a
sector-by-sector basis.

This makes the function unnecessary, and can be converted to a simple
btrfs_folio_set_writeback() call instead.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:15 +01:00
Filipe Manana
7ee85f5515 btrfs: fix race setting file private on concurrent lseek using same fd
When doing concurrent lseek(2) system calls against the same file
descriptor, using multiple threads belonging to the same process, we have
a short time window where a race happens and can result in a memory leak.

The race happens like this:

1) A program opens a file descriptor for a file and then spawns two
   threads (with the pthreads library for example), lets call them
   task A and task B;

2) Task A calls lseek with SEEK_DATA or SEEK_HOLE and ends up at
   file.c:find_desired_extent() while holding a read lock on the inode;

3) At the start of find_desired_extent(), it extracts the file's
   private_data pointer into a local variable named 'private', which has
   a value of NULL;

4) Task B also calls lseek with SEEK_DATA or SEEK_HOLE, locks the inode
   in shared mode and enters file.c:find_desired_extent(), where it also
   extracts file->private_data into its local variable 'private', which
   has a NULL value;

5) Because it saw a NULL file private, task A allocates a private
   structure and assigns to the file structure;

6) Task B also saw a NULL file private so it also allocates its own file
   private and then assigns it to the same file structure, since both
   tasks are using the same file descriptor.

   At this point we leak the private structure allocated by task A.

Besides the memory leak, there's also the detail that both tasks end up
using the same cached state record in the private structure (struct
btrfs_file_private::llseek_cached_state), which can result in a
use-after-free problem since one task can free it while the other is
still using it (only one task took a reference count on it). Also, sharing
the cached state is not a good idea since it could result in incorrect
results in the future - right now it should not be a problem because it
end ups being used only in extent-io-tree.c:count_range_bits() where we do
range validation before using the cached state.

Fix this by protecting the private assignment and check of a file while
holding the inode's spinlock and keep track of the task that allocated
the private, so that it's used only by that task in order to prevent
user-after-free issues with the cached state record as well as potentially
using it incorrectly in the future.

Fixes: 3c32c7212f ("btrfs: use cached state when looking for delalloc ranges with lseek")
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-17 17:31:48 +02:00
David Sterba
070969f17d btrfs: rework BTRFS_I as macro to preserve parameter const
Currently BTRFS_I is a static inline function that takes a const inode
and returns btrfs inode, dropping the 'const' qualifier. This can break
assumptions of compiler though it seems there's no real case.

To make the parameter and return type consistent regardint const we can
use the container_of_const() that preserves it. However this would not
check the parameter type. To fix that use the same _Generic construct
but implement only the two expected types.

Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:22 +02:00
Filipe Manana
1b6e068a0c btrfs: add and use helper to verify the calling task has locked the inode
We have a few places that check if we have the inode locked by doing:

    ASSERT(inode_is_locked(vfs_inode));

This actually proved to be useful several times as if assertions are
enabled (and by default they are in many distros) it immediately triggers
a crash which is impossible for users to miss.

However that doesn't check if the lock is held by the calling task, so
the check passes if some other task locked the inode.

Using one of the lockdep functions to check the lock is held, like
lockdep_assert_held() for example, does check that the calling task
holds the lock, and if that's not the case it produces a warning and
stack trace in dmesg. However, despite the misleading "assert" in the
name of the lockdep helpers, it does not trigger a crash/BUG_ON(), just
a warning and splat in dmesg, which is easy to get unnoticed by users
who may have lockdep enabled.

So add a helper that does the ASSERT() and calls lockdep_assert_held()
immediately after and use it every where we check the inode is locked.
Like this if the lock is held by some other task we get the warning
in dmesg which is caught by fstests, very helpful during development,
and may also be occassionaly noticed by users with lockdep enabled.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:22 +02:00
Josef Bacik
dce9ef9412 btrfs: convert btrfs_get_extent() to take a folio
We only pass this into read_inline_extent, change it to take a folio and
update the callers.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:16 +02:00
Josef Bacik
d71b53c3cb btrfs: convert btrfs_writepage_cow_fixup() to use folio
Instead of a page, use a folio for btrfs_writepage_cow_fixup.  We
already have a folio at the only caller, and the fixup worker uses
folios.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:16 +02:00
Josef Bacik
2609c9289f btrfs: convert btrfs_run_delalloc_range() to take a folio
Now that every function that btrfs_run_delalloc_range calls takes a
folio, update it to take a folio and update the callers.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:15 +02:00
Filipe Manana
9aa29a20b7 btrfs: move the direct IO code into its own file
The direct IO code is over a thousand lines and it's currently spread
between file.c and inode.c, which makes it not easy to locate some parts
of it sometimes. Also inode.c is about 11 thousand lines and file.c about
4 thousand lines, both too big. So move all the direct IO code into a
dedicated file, so that it's easy to locate all its code and reduce the
sizes of inode.c and file.c.

This is a pure move of code without any other changes except export a
a couple functions from inode.c (get_extent_allocation_hint() and
create_io_em()) because they are used in inode.c and the new direct-io.c
file, and a couple functions from file.c (btrfs_buffered_write() and
btrfs_write_check()) because they are used both in file.c and in the new
direct-io.c file.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:29 +02:00
David Sterba
8610ba7eab btrfs: pass a btrfs_inode to is_data_inode()
Pass a struct btrfs_inode to is_data_inode() as it's an
internal interface, allowing to remove some use of BTRFS_I.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:28 +02:00
Filipe Manana
d383eb69eb btrfs: remove super block argument from btrfs_iget_path()
It's pointless to pass a super block argument to btrfs_iget_path() because
we always pass a root and from it we can get the super block through:

   root->fs_info->sb

So remove the super block argument.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:25 +02:00
Filipe Manana
d13240dd0a btrfs: remove super block argument from btrfs_iget()
It's pointless to pass a super block argument to btrfs_iget() because we
always pass a root and from it we can get the super block through:

   root->fs_info->sb

So remove the super block argument.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:25 +02:00
David Sterba
2917f74102 btrfs: constify pointer parameters where applicable
We can add const to many parameters, this is for clarity and minor
addition to safety. There are some minor effects, in the assembly
code and .ko measured on release config. This patch does not cover all
possible conversions.

Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:22 +02:00
Qu Wenruo
3b8dbf3425 btrfs: cleanup recursive include of the same header
We have several headers that are including themselves, triggering clangd
warnings.
Such includes are caused by commit 602035d7fe ("btrfs: add forward
declarations and headers, part 2").

Just remove such unnecessary include.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:22 +02:00
Qu Wenruo
e9ea31fb5c btrfs: cleanup duplicated parameters related to btrfs_alloc_ordered_extent
All parameters after @filepos of btrfs_alloc_ordered_extent() can be
replaced with btrfs_file_extent structure.

This patch does the cleanup, meanwhile some points to note:

- Move btrfs_file_extent structure to ordered-data.h
  The structure is needed by both btrfs_alloc_ordered_extent() and
  can_nocow_extent(), but since btrfs_inode.h includes
  ordered-data.h, so we need to move the structure to ordered-data.h.

- Move the special handling of NOCOW/PREALLOC into
  btrfs_alloc_ordered_extent()
  This is to allow btrfs_split_ordered_extent() to properly split them
  for DIO.
  For now just move the handling into btrfs_alloc_ordered_extent() to
  simplify the callers.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:21 +02:00
Qu Wenruo
cdc627e65c btrfs: cleanup duplicated parameters related to can_nocow_file_extent_args
The following functions and structures can be simplified using the
btrfs_file_extent structure:

- can_nocow_extent()
  No need to return ram_bytes/orig_block_len through the parameter list,
  the @file_extent parameter contains all the needed info.

- can_nocow_file_extent_args
  The following members are no longer needed:

  * disk_bytenr
    This one is confusing as it's not really the
    btrfs_file_extent_item::disk_bytenr, but where the IO would be,
    thus it's file_extent::disk_bytenr + file_extent::offset now.

  * num_bytes
    Now file_extent::num_bytes.

  * extent_offset
    Now file_extent::offset.

  * disk_num_bytes
    Now file_extent::disk_num_bytes.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:21 +02:00
Qu Wenruo
4aa7b5d178 btrfs: remove extent_map::orig_start member
Since we have extent_map::offset, the old extent_map::orig_start is just
extent_map::start - extent_map::offset for non-hole/inline extents.

And since the new extent_map::offset is already verified by
validate_extent_map() while the old orig_start is not, let's just remove
the old member from all call sites.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:20 +02:00
Qu Wenruo
87a6962f73 btrfs: export the expected file extent through can_nocow_extent()
Currently function can_nocow_extent() only returns members needed for
extent_map.

However since we will soon change the extent_map structure to be more
like btrfs_file_extent_item, we want to expose the expected file extent
caused by the NOCOW write for future usage.

This introduces a new structure, btrfs_file_extent, to be a more
memory access friendly representation of btrfs_file_extent_item.
And use that structure to expose the expected file extent caused by the
NOCOW write.

For now there is no user of the new structure yet.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:20 +02:00
Filipe Manana
7a7bc21449 btrfs: remove objectid from struct btrfs_inode on 64 bits platforms
On 64 bits platforms we don't really need to have a dedicated member (the
objectid field) for the inode's number since we store in the VFS inode's
i_ino member, which is an unsigned long and this type is 64 bits wide on
64 bits platforms. We only need that field in case we are on a 32 bits
platform because the unsigned long type is 32 bits wide on such platforms
See commit 33345d0152 ("Btrfs: Always use 64bit inode number") regarding
this 64/32 bits detail.

The objectid field of struct btrfs_inode is also used to store the ID of
a root for directories that are stubs for unreferenced roots. In such
cases the inode is a directory and has the BTRFS_INODE_ROOT_STUB runtime
flag set.

So in order to reduce the size of btrfs_inode structure on 64 bits
platforms we can remove the objectid member and use the VFS inode's i_ino
member instead whenever we need to get the inode number. In case the inode
is a root stub (BTRFS_INODE_ROOT_STUB set) we can use the member
last_reflink_trans to store the ID of the unreferenced root, since such
inode is a directory and reflinks can't be done against directories.

So remove the objectid fields for 64 bits platforms and alias the
last_reflink_trans field with a name of ref_root_id in a union.
On a release kernel config, this reduces the size of struct btrfs_inode
from 1040 bytes down to 1032 bytes.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:17 +02:00
Filipe Manana
068fc8f914 btrfs: remove location key from struct btrfs_inode
Currently struct btrfs_inode has a key member, named "location", that is
either:

1) The key of the inode's item. In this case the objectid is the number
   of the inode;

2) A key stored in a dir entry with a type of BTRFS_ROOT_ITEM_KEY, for
   the case where we have a root that is a snapshot of a subvolume that
   points to other subvolumes. In this case the objectid is the ID of
   a subvolume inside the snapshotted parent subvolume.

The key is only used to lookup the inode item for the first case, while
for the second it's never used since it corresponds to directory stubs
created with new_simple_dir() and which are marked as dummy, so there's
no actual inode item to ever update. In the second case we only check
the key type at btrfs_ino() for 32 bits platforms and its objectid is
only needed for unlink.

Instead of using a key we can do fine with just the objectid, since we
can generate the key whenever we need it having only the objectid, as
in all use cases the type is always BTRFS_INODE_ITEM_KEY and the offset
is always 0.

So use only an objectid instead of a full key. This reduces the size of
struct btrfs_inode from 1048 bytes down to 1040 bytes on a release kernel.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:17 +02:00
Filipe Manana
d9891ae28b btrfs: unify index_cnt and csum_bytes from struct btrfs_inode
The index_cnt field of struct btrfs_inode is used only for two purposes:

1) To store the index for the next entry added to a directory;

2) For the data relocation inode to track the logical start address of the
   block group currently being relocated.

For the relocation case we use index_cnt because it's not used for
anything else in the relocation use case - we could have used other fields
that are not used by relocation such as defrag_bytes, last_unlink_trans
or last_reflink_trans for example (among others).

Since the csum_bytes field is not used for directories, do the following
changes:

1) Put index_cnt and csum_bytes in a union, and index_cnt is only
   initialized when the inode is a directory. The csum_bytes is only
   accessed in IO paths for regular files, so we're fine here;

2) Use the defrag_bytes field for relocation, since the data relocation
   inode is never used for defrag purposes. And to make the naming better,
   alias it to reloc_block_group_start by using a union.

This reduces the size of struct btrfs_inode by 8 bytes in a release
kernel, from 1056 bytes down to 1048 bytes.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:17 +02:00
Filipe Manana
310b2f5d5a btrfs: use an xarray to track open inodes in a root
Currently we use a red black tree (rb-tree) to track the currently open
inodes of a root (in struct btrfs_root::inode_tree). This however is not
very efficient when the number of inodes is large since rb-trees are
binary trees. For example for 100K open inodes, the tree has a depth of
17. Besides that, inserting into the tree requires navigating through it
and pulling useless cache lines in the process since the red black tree
nodes are embedded within the btrfs inode - on the other hand, by being
embedded, it requires no extra memory allocations.

We can improve this by using an xarray instead, which is efficient when
indices are densely clustered (such as inode numbers), is more cache
friendly and behaves like a resizable array, with a much better search
and insertion complexity than a red black tree. This only has one small
disadvantage which is that insertion will sometimes require allocating
memory for the xarray - which may fail (not that often since it uses a
kmem_cache) - but on the other hand we can reduce the btrfs inode
structure size by 24 bytes (from 1080 down to 1056 bytes) after removing
the embedded red black tree node, which after the next patches will allow
to reduce the size of the structure to 1024 bytes, meaning we will be able
to store 4 inodes per 4K page instead of 3 inodes.

This change does a straightforward change to use an xarray, and results
in a transaction abort if we can't allocate memory for the xarray when
creating an inode - but the next patch changes things so that we don't
need to abort.

Running the following fs_mark test showed some improvements:

    $ cat test.sh
    #!/bin/bash

    DEV=/dev/nullb0
    MNT=/mnt/nullb0
    MOUNT_OPTIONS="-o ssd"
    FILES=100000
    THREADS=$(nproc --all)

    echo "performance" | \
        tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

    mkfs.btrfs -f $DEV
    mount $MOUNT_OPTIONS $DEV $MNT

    OPTS="-S 0 -L 5 -n $FILES -s 0 -t $THREADS -k"
    for ((i = 1; i <= $THREADS; i++)); do
        OPTS="$OPTS -d $MNT/d$i"
    done

    fs_mark $OPTS

    umount $MNT

Before this patch:

    FSUse%        Count         Size    Files/sec     App Overhead
        10      1200000            0      92081.6         12505547
        16      2400000            0     138222.6         13067072
        23      3600000            0     148833.1         13290336
        43      4800000            0      97864.7         13931248
        53      6000000            0      85597.3         14384313

After this patch:

    FSUse%        Count         Size    Files/sec     App Overhead
        10      1200000            0      93225.1         12571078
        16      2400000            0     146720.3         12805007
        23      3600000            0     160626.4         13073835
        46      4800000            0     116286.2         13802927
        53      6000000            0      90087.9         14754892

The test was run with a release kernel config (Debian's default config).

Also capturing the insertion times into the rb tree and into the xarray,
that is measuring the duration of the old function inode_tree_add() and
the duration of the new btrfs_add_inode_to_root() function, gave the
following results (in nanoseconds):

Before this patch, inode_tree_add() execution times:

   Count: 5000000
   Range:  0.000 - 5536887.000; Mean: 775.674; Median: 729.000; Stddev: 4820.961
   Percentiles:  90th: 1015.000; 95th: 1139.000; 99th: 1397.000
         0.000 -       7.816:      40 |
         7.816 -      37.858:     209 |
        37.858 -     170.278:    6059 |
       170.278 -     753.961: 2754890 #####################################################
       753.961 -    3326.728: 2232312 ###########################################
      3326.728 -   14667.018:    4366 |
     14667.018 -   64652.943:     852 |
     64652.943 -  284981.761:     550 |
    284981.761 - 1256150.914:     221 |
   1256150.914 - 5536887.000:       7 |

After this patch, btrfs_add_inode_to_root() execution times:

   Count: 5000000
   Range:  0.000 - 2900652.000; Mean: 272.148; Median: 241.000; Stddev: 2873.369
   Percentiles:  90th: 342.000; 95th: 432.000; 99th: 572.000
        0.000 -       7.264:     104 |
        7.264 -      33.145:     352 |
       33.145 -     140.081:  109606 #
      140.081 -     581.930: 4840090 #####################################################
      581.930 -    2407.590:   43532 |
     2407.590 -    9950.979:    2245 |
     9950.979 -   41119.278:     514 |
    41119.278 -  169902.616:     155 |
   169902.616 -  702018.539:      47 |
   702018.539 - 2900652.000:       9 |

Average, percentiles, standard deviation, etc, are all much better.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:17 +02:00
Filipe Manana
f13e01b89d btrfs: ensure fast fsync waits for ordered extents after a write failure
If a write path in COW mode fails, either before submitting a bio for the
new extents or an actual IO error happens, we can end up allowing a fast
fsync to log file extent items that point to unwritten extents.

This is because dropping the extent maps happens when completing ordered
extents, at btrfs_finish_one_ordered(), and the completion of an ordered
extent is executed in a work queue.

This can result in a fast fsync to start logging file extent items based
on existing extent maps before the ordered extents complete, therefore
resulting in a log that has file extent items that point to unwritten
extents, resulting in a corrupt file if a crash happens after and the log
tree is replayed the next time the fs is mounted.

This can happen for both direct IO writes and buffered writes.

For example consider a direct IO write, in COW mode, that fails at
btrfs_dio_submit_io() because btrfs_extract_ordered_extent() returned an
error:

1) We call btrfs_finish_ordered_extent() with the 'uptodate' parameter
   set to false, meaning an error happened;

2) That results in marking the ordered extent with the BTRFS_ORDERED_IOERR
   flag;

3) btrfs_finish_ordered_extent() queues the completion of the ordered
   extent - so that btrfs_finish_one_ordered() will be executed later in
   a work queue. That function will drop extent maps in the range when
   it's executed, since the extent maps point to unwritten locations
   (signaled by the BTRFS_ORDERED_IOERR flag);

4) After calling btrfs_finish_ordered_extent() we keep going down the
   write path and unlock the inode;

5) After that a fast fsync starts and locks the inode;

6) Before the work queue executes btrfs_finish_one_ordered(), the fsync
   task sees the extent maps that point to the unwritten locations and
   logs file extent items based on them - it does not know they are
   unwritten, and the fast fsync path does not wait for ordered extents
   to complete, which is an intentional behaviour in order to reduce
   latency.

For the buffered write case, here's one example:

1) A fast fsync begins, and it starts by flushing delalloc and waiting for
   the writeback to complete by calling filemap_fdatawait_range();

2) Flushing the dellaloc created a new extent map X;

3) During the writeback some IO error happened, and at the end io callback
   (end_bbio_data_write()) we call btrfs_finish_ordered_extent(), which
   sets the BTRFS_ORDERED_IOERR flag in the ordered extent and queues its
   completion;

4) After queuing the ordered extent completion, the end io callback clears
   the writeback flag from all pages (or folios), and from that moment the
   fast fsync can proceed;

5) The fast fsync proceeds sees extent map X and logs a file extent item
   based on extent map X, resulting in a log that points to an unwritten
   data extent - because the ordered extent completion hasn't run yet, it
   happens only after the logging.

To fix this make btrfs_finish_ordered_extent() set the inode flag
BTRFS_INODE_NEEDS_FULL_SYNC in case an error happened for a COW write,
so that a fast fsync will wait for ordered extent completion.

Note that this issues of using extent maps that point to unwritten
locations can not happen for reads, because in read paths we start by
locking the extent range and wait for any ordered extents in the range
to complete before looking for extent maps.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-28 16:35:12 +02:00
Filipe Manana
65bb9fb00b btrfs: update comment for btrfs_set_inode_full_sync() about locking
Nowadays we have a lock used to synchronize mmap writes with reflink and
fsync operations (struct btrfs_inode::i_mmap_lock), so update the comment
for btrfs_set_inode_full_sync() to mention that it can also be called
while holding that mmap lock. Besides being a valid alternative to the
inode's VFS lock, we already have the extent map shrinker using that mmap
lock instead.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:07 +02:00
Filipe Manana
5e485ac6f0 btrfs: export find_next_inode() as btrfs_find_first_inode()
Export the relocation private helper find_next_inode() to inode.c, as this
same logic is also used at btrfs_prune_dentries() and will be used by an
upcoming change that adds an extent map shrinker. The next patch will
change btrfs_prune_dentries() to use this helper.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:04 +02:00
Filipe Manana
0ddefc2a7c btrfs: move btrfs_page_mkwrite() from inode.c into file.c
btrfs_page_mkwrite() is a struct vm_operations_struct callback and we
define that structure in file.c. Currently the function is in inode.c and
has to be exported to be used in file.c, which makes no sense because it's
not used anywhere else. So move btrfs_page_mkwrite() from inode.c and into
file.c.

While at it do a few minor style changes:

1) Capitalize the first word of every comment and end each sentence with
   punctuation;

2) Avoid splitting some statements into two lines when everything fits in
   85 characters or less.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:03 +02:00
David Sterba
5a8a57f9a4 btrfs: merge btrfs_del_delalloc_inode() helpers
The helpers btrfs_del_delalloc_inode() and __btrfs_del_delalloc_inode()
don't follow the pattern when the "__" helper does a special case and
are in fact reversed regarding the naming. We can merge them into one as
there's only one place that needs to be open coded.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:54 +01:00
Filipe Manana
bdc0f89e06 btrfs: reduce inode lock critical section when setting and clearing delalloc
When setting and clearing a delalloc range, at btrfs_set_delalloc_extent()
and btrfs_clear_delalloc_extent(), we are adding/removing the inode
to/from the root's list of delalloc inodes while under the protection of
the inode's lock. This however is not needed, we can add and remove the
inode to the root's list without holding the inode's lock because here
we are under the protection of the io tree's lock, reducing the size of
the critical section delimited by the inode's lock. The inode's lock is
used in many other places such as when finishing an ordered extent (when
calling btrfs_update_inode_bytes() or btrfs_delalloc_release_metadata(),
or decreasing the number of outstanding extents) or when reserving space
when doing a buffered or direct IO write (calls to functions from
delalloc-space.c).

So move the inode add/remove operations to the root's list of delalloc
inodes to outside the critical section delimited by the inode's lock.
This also allows us to get rid of the BTRFS_INODE_IN_DELALLOC_LIST flag
since we can rely on the inode's delalloc bytes counter to determine if
the inode is or is not in the list.

The following fio based test, that exercises IO to multiple files in the
same subvolume, was used to test:

   $ cat test.sh
   #!/bin/bash

   DEV=/dev/nullb0
   MNT=/mnt/nullb0
   MOUNT_OPTIONS="-o ssd"

   mkfs.btrfs -f $DEV &> /dev/null
   mount $MOUNT_OPTIONS $DEV $MNT

   fio --direct=0 --ioengine=sync --thread --directory=$MNT \
       --invalidate=1 --group_reporting=1 \
       --new_group --rw=randwrite --size=50m --numjobs=200 \
       --bs=4k --fsync_on_close=0 --fallocate=none --end_fsync=0 \
       --name=foo --filename_format=FioWorkloads.\$jobnum

   umount $MNT

The test was run on a non-debug kernel (Debian's default kernel config)
against a 16G null block device.

Result before this patch:

   WRITE: bw=81.9MiB/s (85.9MB/s), 81.9MiB/s-81.9MiB/s (85.9MB/s-85.9MB/s), io=9.77GiB (10.5GB), run=122136-122136msec

Result after this patch:

   WRITE: bw=86.8MiB/s (91.0MB/s), 86.8MiB/s-86.8MiB/s (91.0MB/s-91.0MB/s), io=9.77GiB (10.5GB), run=115180-115180msec

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:50 +01:00
Filipe Manana
f5169f12d7 btrfs: stop passing root argument to __btrfs_del_delalloc_inode()
There's no need to pass a root argument to __btrfs_del_delalloc_inode()
and btrfs_del_delalloc_inode(), we can just pass the inode since the root
is always the root associated to that inode. Some remove the root argument
from these functions.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:49 +01:00
David Sterba
5693a1286a btrfs: add forward declarations and headers, part 3
Do a cleanup in the rest of the headers:

- add forward declarations for types referenced by pointers
- add includes when types need them

This fixes potential compilation problems if the headers are reordered
or the missing includes are not provided indirectly.

Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:49 +01:00