linux

mirror of https://github.com/torvalds/linux.git synced 2025-11-02 01:29:02 +02:00

Author	SHA1	Message	Date
Qu Wenruo	c2ffb1ec1a	btrfs: prepare compression folio alloc/free for bs > ps cases This includes the following preparation for bs > ps cases: - Always alloc/free the folio directly if bs > ps This adds a new @fs_info parameter for btrfs_alloc_compr_folio(), thus affecting all compression algorithms. For btrfs_free_compr_folio() it needs no parameter for now, as we can use the folio size to skip the caching part. For now the change is just to passing a @fs_info into the function, all the folio size assumption is still based on page size. - Properly zero the last folio in compress_file_range() Since the compressed folios can be larger than a page, we need to properly zero the whole folio. - Use correct folio size for btrfs_add_compressed_bio_folios() Instead of page size, use the correct folio size. - Use correct folio size/shift for btrfs_compress_filemap_get_folio() As we are not only using simple page sized folios anymore. - Use correct folio size for btrfs_decompress() There is an ASSERT() making sure the decompressed range is no larger than a page, which will be triggered for bs > ps cases. - Skip readahead for compressed pages Similar to subpage cases. - Make btrfs_alloc_folio_array() to accept a new @order parameter - Add a helper to calculate the minimal folio size All those changes should not affect the existing bs <= ps handling. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:24 +02:00
David Sterba	17dc82dc1e	btrfs: fix typos in comments and strings Annual typo fixing pass. Strangely codespell found only about 30% of what is in this patch, the rest was done manually using text spellchecker with a custom dictionary of acceptable terms. Reviewed-by: Neal Gompa <neal@gompa.dev> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:16 +02:00
Qu Wenruo	0d0b80929e	btrfs: rename btrfs_compress_op to btrfs_compress_levels Since all workspace managers are per-fs, there is no need nor no way to store them inside btrfs_compress_op::wsm anymore. With that said, we can do the following modifications: - Remove zstd_workspace_mananger::ops Zstd always grab the global btrfs_compress_op[]. - Remove btrfs_compress_op::wsm member - Rename btrfs_compress_op to btrfs_compress_levels This should make it more clear that btrfs_compress_levels structures are only to indicate the levels of each compress algorithm. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:16 +02:00
Qu Wenruo	9c8f4cf456	btrfs: cleanup the per-module compression workspace managers Since all workspaces are handled by the per-fs workspace managers, we can safely remove the old per-module managers. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:16 +02:00
Qu Wenruo	856d46c313	btrfs: migrate to use per-fs workspace manager There are several interfaces involved for each algorithm: - alloc workspace All algorithms allocate a workspace without the need for workspace manager. So no change needs to be done. - get workspace This involves checking the workspace manager to find a free one, and if not, allocate a new one. For none and lzo, they share the same generic btrfs_get_workspace() helper, only needs to update that function to use the per-fs manager. For zlib it uses a wrapper around btrfs_get_workspace(), so no special work needed. For zstd, update zstd_find_workspace() and zstd_get_workspace() to utilize the per-fs manager. - put workspace For none/zlib/lzo they share the same btrfs_put_workspace(), update that function to use the per-fs manager. For zstd, it's zstd_put_workspace(), the same update. - zstd specific timer This is the timer to reclaim workspace, change it to grab the per-fs workspace manager instead. Now all workspace are managed by the per-fs manager. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:15 +02:00
Qu Wenruo	6f9c3f48ac	btrfs: add generic workspace manager initialization This involves: - Add (alloc\|free)_workspace_manager helpers. These are the helper to alloc/free workspace_manager structure. The allocator will allocate a workspace_manager structure, initialize it, and pre-allocate one workspace for it. The freer will do the cleanup and set the manager pointer to NULL. - Call alloc_workspace_manager() inside btrfs_alloc_compress_wsm() - Call alloc_workspace_manager() inside btrfs_free_compress_wsm() For none, zlib and lzo compression algorithms. For now the generic per-fs workspace managers won't really have any effect, and all compression is still going through the global workspace manager. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:15 +02:00
Qu Wenruo	330f02b136	btrfs: add workspace manager initialization for zstd This involves: - Add zstd_alloc_workspace_manager() and zstd_free_workspace_manager() Those two functions will accept an fs_info pointer, and alloc/free fs_info->compr_wsm[BTRFS_COMPRESS_ZSTD] pointer. - Add btrfs_alloc_compress_wsm() and btrfs_free_compress_wsm() Those are helpers allocating the workspace managers for all algorithms. For now only zstd is supported, and the timing is a little unusual, the btrfs_alloc_compress_wsm() should only be called after the sectorsize being initialized. Meanwhile btrfs_free_fs_info_compress() is called in btrfs_free_fs_info(). - Move the definition of btrfs_compression_type to "fs.h" The reason is that "compression.h" has already included "fs.h", thus we can not just include "compression.h" to get the definition of BTRFS_NR_COMPRESS_TYPES to define fs_info::compr_wsm[]. For now the per-fs zstd workspace manager won't really have any effect, and all compression is still going through the global workspace manager. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:15 +02:00
Qu Wenruo	2c5cca03c1	btrfs: add an fs_info parameter for compression workspace manager [BACKGROUND] Currently btrfs shares workspaces and their managers for all filesystems, this is mostly fine as all those workspaces are using page size based buffers, and btrfs only support block size (bs) <= page size (ps). This means even if bs < ps, we at most waste some buffer space in the workspace, but everything will still work fine. The problem here is that is limiting our support for bs > ps cases. As now a workspace now may need larger buffer to handle bs > ps cases, but since the pool has no way to distinguish different workspaces, a regular workspace (which is still using buffer size based on ps) can be passed to a btrfs whose bs > ps. In that case the buffer is not large enough, and will cause various problems. [ENHANCEMENT] To prepare for the per-fs workspace migration, add an fs_info parameter to all workspace related functions. For now this new fs_info parameter is not yet utilized. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:15 +02:00
Qu Wenruo	d71b419f27	btrfs: pass btrfs_inode pointer directly into btrfs_compress_folios() For the 3 supported compression algorithms, two of them (zstd and zlib) are already grabbing the btrfs inode for error messages. It's more common to pass btrfs_inode and grab the address space from it. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:31 +02:00
Boris Burkov	f07b855c56	btrfs: try to search for data csums in commit root If you run a workload with: - a cgroup that does tons of parallel data reading, with a working set much larger than its memory limit - a second cgroup that writes relatively fewer files, with overwrites, with no memory limit (see full code listing at the bottom for a reproducer) Then what quickly occurs is: - we have a large number of threads trying to read the csum tree - we have a decent number of threads deleting csums running delayed refs - we have a large number of threads in direct reclaim and thus high memory pressure The result of this is that we writeback the csum tree repeatedly mid transaction, to get back the extent_buffer folios for reclaim. As a result, we repeatedly COW the csum tree for the delayed refs that are deleting csums. This means repeatedly write locking the higher levels of the tree. As a result of this, we achieve an unpleasant priority inversion. We have: - a high degree of contention on the csum root node (and other upper nodes) eb rwsem - a memory starved cgroup doing tons of reclaim on CPU. - many reader threads in the memory starved cgroup "holding" the sem as readers, but not scheduling promptly. i.e., task __state == 0, but not running on a cpu. - btrfs_commit_transaction stuck trying to acquire the sem as a writer. (running delayed_refs, deleting csums for unreferenced data extents) This results in arbitrarily long transactions. This then results in seriously degraded performance for any cgroup using the filesystem (the victim cgroup in the script). It isn't an academic problem, as we see this exact problem in production at Meta with one cgroup over its memory limit ruining btrfs performance for the whole system, stalling critical system services that depend on btrfs syncs. The underlying scheduling "problem" with global rwsems is sort of thorny and apparently well known and was discussed at LPC 2024, for example. As a result, our main lever in the short term is just trying to reduce contention on our various rwsems with an eye to reducing the frequency of write locking, to avoid disabling the read lock fast acquisition path. Luckily, it seems likely that many reads are for old extents written many transactions ago, and that for those we can in fact search the commit root. The commit_root_sem only gets taken write once, near the end of transaction commit, no matter how much memory pressure there is, so we have much less contention between readers and writers. This change detects when we are trying to read an old extent (according to extent map generation) and then wires that through bio_ctrl to the btrfs_bio, which unfortunately isn't allocated yet when we have this information. When we go to lookup the csums in lookup_bio_sums we can check this condition on the btrfs_bio and do the commit root lookup accordingly. Note that a single bio_ctrl might collect a few extent_maps into a single bio, so it is important to track a maximum generation across all the extent_maps used for each bio to make an accurate decision on whether it is valid to look in the commit root. If any extent_map is updated in the current generation, we can't use the commit root. To test and reproduce this issue, I used the following script and accompanying C program (to avoid bottlenecks in constantly forking thousands of dd processes): ====== big-read.c ====== #include <fcntl.h> #include <stdio.h> #include <stdlib.h> #include <sys/mman.h> #include <sys/stat.h> #include <unistd.h> #include <errno.h> #define BUF_SZ (128 * (1 << 10UL)) int read_once(int fd, size_t sz) { char buf[BUF_SZ]; size_t rd = 0; int ret = 0; while (rd < sz) { ret = read(fd, buf, BUF_SZ); if (ret < 0) { if (errno == EINTR) continue; fprintf(stderr, "read failed: %d\n", errno); return -errno; } else if (ret == 0) { break; } else { rd += ret; } } return rd; } int read_loop(char fname) { int fd; struct stat st; size_t sz = 0; int ret; while (1) { fd = open(fname, O_RDONLY); if (fd == -1) { perror("open"); return 1; } if (!sz) { if (!fstat(fd, &st)) { sz = st.st_size; } else { perror("stat"); return 1; } } ret = read_once(fd, sz); close(fd); } } int main(int argc, char argv[]) { int fd; struct stat st; off_t sz; char *buf; int ret; if (argc != 2) { fprintf(stderr, "Usage: %s <filename>\n", argv[0]); return 1; } return read_loop(argv[1]); } ====== repro.sh ====== #!/usr/bin/env bash SCRIPT=$(readlink -f "$0") DIR=$(dirname "$SCRIPT") dev=$1 mnt=$2 shift shift CG_ROOT=/sys/fs/cgroup BAD_CG=$CG_ROOT/bad-nbr GOOD_CG=$CG_ROOT/good-nbr NR_BIGGOS=1 NR_LITTLE=10 NR_VICTIMS=32 NR_VILLAINS=512 START_SEC=$(date +%s) _elapsed() { echo "elapsed: $(($(date +%s) - $START_SEC))" } _stats() { local sysfs=/sys/fs/btrfs/$(findmnt -no UUID $dev) echo "================" date _elapsed cat $sysfs/commit_stats cat $BAD_CG/memory.pressure } _setup_cgs() { echo "+memory +cpuset" > $CG_ROOT/cgroup.subtree_control mkdir -p $GOOD_CG mkdir -p $BAD_CG echo max > $BAD_CG/memory.max # memory.high much less than the working set will cause heavy reclaim echo $((1 << 30)) > $BAD_CG/memory.high # victims get a subset of villain CPUs echo 0 > $GOOD_CG/cpuset.cpus echo 0,1,2,3 > $BAD_CG/cpuset.cpus } _kill_cg() { local cg=$1 local attempts=0 echo "kill cgroup $cg" [ -f $cg/cgroup.procs ] \|\| return while true; do attempts=$((attempts + 1)) echo 1 > $cg/cgroup.kill sleep 1 procs=$(wc -l $cg/cgroup.procs \| cut -d' ' -f1) [ $procs -eq 0 ] && break done rmdir $cg echo "killed cgroup $cg in $attempts attempts" } _biggo_vol() { echo $mnt/biggo_vol.$1 } _biggo_file() { echo $(_biggo_vol $1)/biggo } _subvoled_biggos() { total_sz=$((10 << 30)) per_sz=$((total_sz / $NR_VILLAINS)) dd_count=$((per_sz >> 20)) echo "create $NR_VILLAINS subvols with a file of size $per_sz bytes for a total of $total_sz bytes." for i in $(seq $NR_VILLAINS) do btrfs subvol create $(_biggo_vol $i) &>/dev/null dd if=/dev/zero of=$(_biggo_file $i) bs=1M count=$dd_count &>/dev/null done echo "done creating subvols." } _setup() { [ -f .done ] && rm .done findmnt -n $dev && exit 1 if [ -f .re-mkfs ]; then mkfs.btrfs -f -m single -d single $dev >/dev/null \|\| exit 2 else echo "touch .re-mkfs to populate the test fs" fi mount -o noatime $dev $mnt \|\| exit 3 [ -f .re-mkfs ] && _subvoled_biggos _setup_cgs } _my_cleanup() { echo "CLEANUP!" _kill_cg $BAD_CG _kill_cg $GOOD_CG sleep 1 umount $mnt } _bad_exit() { _err "Unexpected Exit! $?" _stats exit $? } trap _my_cleanup EXIT trap _bad_exit INT TERM _setup # Use a lot of page cache reading the big file _villain() { local i=$1 echo $BASHPID > $BAD_CG/cgroup.procs $DIR/big-read $(_biggo_file $i) } # Hit del_csum a lot by overwriting lots of small new files _victim() { echo $BASHPID > $GOOD_CG/cgroup.procs i=0; while (true) do local tmp=$mnt/tmp.$i dd if=/dev/zero of=$tmp bs=4k count=2 >/dev/null 2>&1 i=$((i+1)) [ $i -eq $NR_LITTLE ] && i=0 done } _one_sync() { echo "sync..." before=$(date +%s) sync after=$(date +%s) echo "sync done in $((after - before))s" _stats } # sync in a loop _sync() { echo "start sync loop" syncs=0 echo $BASHPID > $GOOD_CG/cgroup.procs while true do [ -f .done ] && break _one_sync syncs=$((syncs + 1)) [ -f .done ] && break sleep 10 done if [ $syncs -eq 0 ]; then echo "do at least one sync!" _one_sync fi echo "sync loop done." } _sleep() { local time=${1-60} local now=$(date +%s) local end=$((now + time)) while [ $now -lt $end ]; do echo "SLEEP: $((end - now))s left. Sleep 10." sleep 10 now=$(date +%s) done } echo "start $NR_VILLAINS villains" for i in $(seq $NR_VILLAINS) do _villain $i & disown # get rid of annoying log on kill (done via cgroup anyway) done echo "start $NR_VICTIMS victims" for i in $(seq $NR_VICTIMS) do _victim & disown done _sync & SYNC_PID=$! _sleep $1 _elapsed touch .done wait $SYNC_PID echo "OK" exit 0 Without this patch, that reproducer: - Ran for 6+ minutes instead of 60s - Hung hundreds of threads in D state on the csum reader lock - Got a commit stuck for 3 minutes sync done in 388s ================ Wed Jul 9 09:52:31 PM UTC 2025 elapsed: 420 commits 2 cur_commit_ms 0 last_commit_ms 159446 max_commit_ms 159446 total_commit_ms 160058 some avg10=99.03 avg60=98.97 avg300=75.43 total=418033386 full avg10=82.79 avg60=80.52 avg300=59.45 total=324995274 419 hits state R, D comms big-read btrfs_tree_read_lock_nested btrfs_read_lock_root_node btrfs_search_slot btrfs_lookup_csum btrfs_lookup_bio_sums btrfs_submit_bbio 1 hits state D comms btrfs-transacti btrfs_tree_lock_nested btrfs_lock_root_node btrfs_search_slot btrfs_del_csums __btrfs_run_delayed_refs btrfs_run_delayed_refs With the patch, the reproducer exits naturally, in 65s, completing a pretty decent 4 commits, despite heavy memory pressure. Occasionally you can still trigger a rather long commit (couple seconds) but never one that is minutes long. sync done in 3s ================ elapsed: 65 commits 4 cur_commit_ms 0 last_commit_ms 485 max_commit_ms 689 total_commit_ms 2453 some avg10=98.28 avg60=64.54 avg300=19.39 total=64849893 full avg10=74.43 avg60=48.50 avg300=14.53 total=48665168 some random rwalker samples showed the most common stack in reclaim, rather than the csum tree: 145 hits state R comms bash, sleep, dd, shuf shrink_folio_list shrink_lruvec shrink_node do_try_to_free_pages try_to_free_mem_cgroup_pages reclaim_high Link: https://lpc.events/event/18/contributions/1883/ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:31 +02:00
Qu Wenruo	b98b208300	btrfs: reject invalid compression level Inspired by recent changes to compression level parsing in `6db1df415d` ("btrfs: accept and ignore compression level for lzo") it turns out that we do not do any extra validation for compression level input string, thus allowing things like "compress=lzo:invalid" to be accepted without warnings. Although we accept levels that are beyond the supported algorithm ranges, accepting completely invalid level specification is not correct. Fix the too loose checks for compression level, by doing proper error handling of kstrtoint(), so that we will reject not only too large values (beyond int range) but also completely wrong levels like "lzo:invalid". Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-18 13:18:49 +02:00
David Sterba	ab5fcbb1ad	btrfs: use pgoff_t for page index variables Any conversion of offsets in the logical or the physical mapping space of the pages is done by a shift and the target type should be pgoff_t (type of struct page::index). Fix the locations where it's still unsigned long. Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:05 +02:00
George Hu	afd1dacbd0	btrfs: replace nested usage of min & max with clamp in btrfs_compress_set_level() Refactor the btrfs_compress_set_level() function by replacing the nested usage of min() and max() macro with clamp() to simplify the code and improve readability. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: George Hu <integral@archlinux.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:05 +02:00
David Sterba	44cac52341	btrfs: use our message helpers instead of pr_err/pr_warn/pr_info Our message helpers accept NULL for the fs_info in the context that does not provide and print the common header of the message. The use of pr_* helpers is only for special reasons, like module loading, device scanning or multi-line output (print-tree). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:04 +02:00
David Sterba	d2080c7a00	btrfs: rename ret2 to ret in btrfs_submit_compressed_read() We can now rename 'ret2' to 'ret' and use it for generic errors. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:49 +02:00
David Sterba	a83134b48a	btrfs: rename ret to status in btrfs_submit_compressed_read() We're using 'status' for the blk_status_t variables, rename 'ret' so we can use it for generic errors. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:48 +02:00
David Sterba	79cbc151f9	btrfs: simplify reading bio status in end_compressed_writeback() We don't need to have a separate variable to read the bio status, 'ret' works for that just fine so remove 'error'. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:48 +02:00
Christoph Hellwig	8d243aa9a8	btrfs: use bvec_kmap_local() in btrfs_decompress_buf2page() This removes the last direct poke into bvec internals in btrfs. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:47 +02:00
Filipe Manana	d846a6d3b0	btrfs: rename remaining exported extent map functions Rename all the exported functions from extent_map.h that don't have a 'btrfs_' prefix in their names, so that they are consistent with all the other functions, to make it clear they are btrfs specific functions and to avoid potential name collisions in the future with functions defined elsewhere in the kernel. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:45 +02:00
Filipe Manana	ae98ae2a50	btrfs: rename functions to allocate and free extent maps These functions are exported and don't have a 'btrfs_' prefix in their names, which goes against coding style conventions. Rename them to have such prefix, making it clear they are from btrfs and avoiding potential collisions in the future with functions defined elsewhere outside btrfs. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:45 +02:00
Filipe Manana	2e871330ce	btrfs: rename extent map functions to get block start, end and check if in tree These functions are exported and don't have a 'btrfs_' prefix in their names, which goes against coding style conventions. Rename them to have such prefix, making it clear they are from btrfs and avoiding potential collisions in the future with functions defined elsewhere outside btrfs. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:45 +02:00
Filipe Manana	962162ffa6	btrfs: rename exported extent map compression functions These functions are exported and don't have a 'btrfs_' prefix in their names, which goes against coding style conventions. Rename them to have such prefix, making it clear they are from btrfs and avoiding potential collisions in the future with functions defined elsewhere outside btrfs. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:45 +02:00
Filipe Manana	242570e80b	btrfs: add btrfs prefix to main lock, try lock and unlock extent functions These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. So add a prefix to their name. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:43 +02:00
Qu Wenruo	2b14b74b99	btrfs: use folio_contains() for EOF detection Currently we use the following pattern to detect if the folio contains the end of a file: if (folio->index == end_index) folio_zero_range(); But that only works if the folio is page sized. For the following case, it will not work and leave the range beyond EOF uninitialized: The page size is 4K, and the fs block size is also 4K. 16K 20K 24K \| \| \| \| \| EOF at 22K And we have a large folio sized 8K at file offset 16K. In that case, the old "folio->index == end_index" will not work, thus the range [22K, 24K) will not be zeroed out. Fix the following call sites which use the above pattern: - add_ra_bio_pages() - extent_writepage() Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:43 +02:00
Qu Wenruo	af566bdaff	btrfs: fix the file offset calculation inside btrfs_decompress_buf2page() [BUG WITH EXPERIMENTAL LARGE FOLIOS] When testing the experimental large data folio support with compression, there are several ASSERT()s triggered from btrfs_decompress_buf2page() when running fsstress with compress=zstd mount option: - ASSERT(copy_len) from btrfs_decompress_buf2page() - VM_BUG_ON(offset + len > PAGE_SIZE) from memcpy_to_page() [CAUSE] Inside btrfs_decompress_buf2page(), we need to grab the file offset from the current bvec.bv_page, to check if we even need to copy data into the bio. And since we're using single page bvec, and no large folio, every page inside the folio should have its index properly setup. But when large folios are involved, only the first page (aka, the head page) of a large folio has its index properly initialized. The other pages inside the large folio will not have their indexes properly initialized. Thus the page_offset() call inside btrfs_decompress_buf2page() will result garbage, and completely screw up the @copy_len calculation. [FIX] Instead of using page->index directly, go with page_pgoff(), which can handle non-head pages correctly. So introduce a helper, file_offset_from_bvec(), to get the file offset from a single page bio_vec, so the copy_len calculation can be done correctly. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:40 +02:00
Kees Cook	6f9a8ab796	btrfs: compression: adjust cb->compressed_folios allocation type In preparation for making the kmalloc() family of allocators type aware, we need to make sure that the returned type from the allocation matches the type of the variable being assigned. (Before, the allocator would always return "void ", which can be implicitly cast to any pointer type.) The assigned type is "struct folio " but the returned type will be "struct page *". These are the same allocation size (pointer size), but the types don't match. Adjust the allocation type to match the assignment. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Kees Cook <kees@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-02 13:19:52 +02:00
Daniel Vacek	fc5c0c5825	btrfs: defrag: extend ioctl to accept compression levels The zstd and zlib compression types support setting compression levels. Enhance the defrag interface to specify the levels as well. For zstd the negative (realtime) levels are also accepted. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-03-18 20:35:50 +01:00
Daniel Vacek	da798fa519	btrfs: zstd: enable negative compression levels mount option Allow using the fast modes (negative compression levels) of zstd as a mount option. As per the results, the compression ratio is (expectedly) lower: for level in {-15..-1} 1 2 3; \ do printf "level %3d\n" $level; \ mount -o compress=zstd:$level /dev/sdb /mnt/test/; \ grep sdb /proc/mounts; \ cp -r /usr/bin /mnt/test/; sync; compsize /mnt/test/bin; \ cp -r /usr/share/doc /mnt/test/; sync; compsize /mnt/test/doc; \ cp enwik9 /mnt/test/; sync; compsize /mnt/test/enwik9; \ cp linux-6.13.tar /mnt/test/; sync; compsize /mnt/test/linux-6.13.tar; \ rm -r /mnt/test/{bin,doc,enwik9,linux-6.13.tar}; \ umount /mnt/test/; \ done \|& tee results \| \ awk '/^level/{print}/^TOTAL/{print$3"\t"$2" \|"}' \| paste - - - - - 266M bin \| 45M doc \| 953M wiki \| 1.4G source =============================+===============+===============+===============+ level -15 180M 67% \| 30M 68% \| 694M 72% \| 598M 40% \| level -14 180M 67% \| 30M 67% \| 683M 71% \| 581M 39% \| level -13 177M 66% \| 29M 66% \| 671M 70% \| 566M 38% \| level -12 174M 65% \| 29M 65% \| 658M 69% \| 548M 37% \| level -11 174M 65% \| 28M 64% \| 645M 67% \| 530M 35% \| level -10 171M 64% \| 28M 62% \| 631M 66% \| 512M 34% \| level -9 165M 62% \| 27M 61% \| 615M 64% \| 493M 33% \| level -8 161M 60% \| 27M 59% \| 598M 62% \| 475M 32% \| level -7 155M 58% \| 26M 58% \| 582M 61% \| 457M 30% \| level -6 151M 56% \| 25M 56% \| 565M 59% \| 437M 29% \| level -5 145M 54% \| 24M 55% \| 545M 57% \| 417M 28% \| level -4 139M 52% \| 23M 52% \| 520M 54% \| 391M 26% \| level -3 135M 50% \| 22M 50% \| 495M 51% \| 369M 24% \| level -2 127M 47% \| 22M 48% \| 470M 49% \| 349M 23% \| level -1 120M 45% \| 21M 47% \| 452M 47% \| 332M 22% \| level 1 110M 41% \| 17M 39% \| 362M 38% \| 290M 19% \| level 2 106M 40% \| 17M 38% \| 349M 36% \| 288M 19% \| level 3 104M 39% \| 16M 37% \| 340M 35% \| 276M 18% \| The samples represent some data sets that can be commonly found and show approximate compressibility. The fast levels trade off speed for ratio and are best suitable for highly compressible data. As can be seen above, comparing the results to the current default zstd level 3, the negative levels are roughly 2x worse at -15 and the ratio increases almost linearly with each level. Signed-off-by: Daniel Vacek <neelx@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2025-03-18 20:35:41 +01:00
Anand Jain	d07eaa9995	btrfs: use filemap_get_folio() helper When fgp_flags and gfp_flags are zero, use filemap_get_folio(A, B) instead of __filemap_get_folio(A, B, 0, 0)—no need for the extra arguments 0, 0. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:19 +01:00
Qu Wenruo	0f71202665	btrfs: rename btrfs_folio_(set\|start\|end)_writer_lock() Since there is no user of reader locks, rename the writer locks into a more generic name, by removing the "_writer" part from the name. And also rename btrfs_subpage::writer into btrfs_subpage::locked. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:18 +01:00
Qu Wenruo	336e69f302	btrfs: unify to use writer locks for subpage locking Since commit `d7172f52e9` ("btrfs: use per-buffer locking for extent_buffer reading"), metadata read no longer relies on the subpage reader locking. This means we do not need to maintain a different metadata/data split for locking, so we can convert the existing reader lock users by: - add_ra_bio_pages() Convert to btrfs_folio_set_writer_lock() - end_folio_read() Convert to btrfs_folio_end_writer_lock() - begin_folio_read() Convert to btrfs_folio_set_writer_lock() - folio_range_has_eb() Remove the subpage->readers checks, since it is always 0. - Remove btrfs_subpage_start_reader() and btrfs_subpage_end_reader() Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:18 +01:00
David Sterba	a9c50c9756	btrfs: drop unused parameter level from alloc_heuristic_ws() The compression heuristic pass does not need a level, so we can drop the parameter. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:17 +01:00
David Sterba	3f4b1bc1c0	btrfs: lzo: drop unused paramter level from lzo_alloc_workspace() The LZO compression has only one level, we don't need to pass the parameter. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:16 +01:00
Qu Wenruo	dd5e276254	btrfs: compression: add an ASSERT() to ensure the read-in length is sane There are already two bugs (one in zlib, one in zstd) that involved compression path is not handling sector size < page size cases well. So it makes more sense to make sure that btrfs_compress_folios() returns Since we already have two bugs (one in zlib, one in zstd) in the compression path resulting the @total_in be to larger than the to-be-compressed range length, there is enough reason to add an ASSERT() to make sure the total read-in length doesn't exceed the input length. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:12 +01:00
Li Zetao	aeb6d88148	btrfs: convert btrfs_decompress() to take a folio The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. Based on the previous patch, the compression path can be directly used in folio without converting to page. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:21 +02:00
Li Zetao	b70f3a4546	btrfs: convert zstd_decompress() to take a folio The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. And memcpy_to_page() can be replaced with memcpy_to_folio(). But there is no memzero_folio(), but it can be replaced equivalently by folio_zero_range(). Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:21 +02:00
Li Zetao	9f9a4e43a8	btrfs: convert lzo_decompress() to take a folio The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. And memcpy_to_page() can be replaced with memcpy_to_folio(). But there is no memzero_folio(), but it can be replaced equivalently by folio_zero_range(). Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:21 +02:00
Li Zetao	54c78d497b	btrfs: convert zlib_decompress() to take a folio The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. And memcpy_to_page() can be replaced with memcpy_to_folio(). But there is no memzero_folio(), but it can be replaced equivalently by folio_zero_range(). Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:21 +02:00
Josef Bacik	ac325fc2aa	btrfs: do not hold the extent lock for entire read Historically we've held the extent lock throughout the entire read. There's been a few reasons for this, but it's mostly just caused us problems. For example, this prevents us from allowing page faults during direct io reads, because we could deadlock. This has forced us to only allow 4k reads at a time for io_uring NOWAIT requests because we have no idea if we'll be forced to page fault and thus have to do a whole lot of work. On the buffered side we are protected by the page lock, as long as we're reading things like buffered writes, punch hole, and even direct IO to a certain degree will get hung up on the page lock while the page is in flight. On the direct side we have the dio extent lock, which acts much like the way the extent lock worked previously to this patch, however just for direct reads. This protects direct reads from concurrent direct writes, while we're protected from buffered writes via the inode lock. Now that we're protected in all cases, narrow the extent lock to the part where we're getting the extent map to submit the reads, no longer holding the extent lock for the entire read operation. Push the extent lock down into do_readpage() so that we're only grabbing it when looking up the extent map. This portion was contributed by Goldwyn. Co-developed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:20 +02:00
David Sterba	792e86ef31	btrfs: rename btrfs_submit_bio() to btrfs_submit_bbio() The function name is a bit misleading as it submits the btrfs_bio (bbio), rename it so we can use btrfs_submit_bio() when an actual bio is submitted. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:19 +02:00
Josef Bacik	c808c1dcb1	btrfs: convert add_ra_bio_pages() to use only folios Willy is going to get rid of page->index, and add_ra_bio_pages uses page->index. Make his life easier by converting add_ra_bio_pages to use folios so that we are no longer using page->index. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:13 +02:00
Filipe Manana	8e7860543a	btrfs: fix extent map use-after-free when adding pages to compressed bio At add_ra_bio_pages() we are accessing the extent map to calculate 'add_size' after we dropped our reference on the extent map, resulting in a use-after-free. Fix this by computing 'add_size' before dropping our extent map reference. Reported-by: syzbot+853d80cba98ce1157ae6@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/000000000000038144061c6d18f2@google.com/ Fixes: `6a40491020` ("btrfs: subpage: make add_ra_bio_pages() compatible") CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-07-11 16:32:22 +02:00
Qu Wenruo	fea91134c2	btrfs: remove the extra_gfp parameter from btrfs_alloc_folio_array() The function btrfs_alloc_folio_array() is only utilized in btrfs_submit_compressed_read() and no other location, and the only caller is not utilizing the @extra_gfp parameter. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-07-11 15:33:30 +02:00
David Sterba	e2877c2a03	btrfs: pass a btrfs_inode to btrfs_compress_heuristic() Pass a struct btrfs_inode to btrfs_compress_heuristic() as it's an internal interface, allowing to remove some use of BTRFS_I. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2024-07-11 15:33:28 +02:00
David Sterba	a1f4e3d7bd	btrfs: switch btrfs_ordered_extent::inode to struct btrfs_inode The structure is internal so we should use struct btrfs_inode for that, allowing to remove some use of BTRFS_I. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2024-07-11 15:33:28 +02:00
Qu Wenruo	c77a8c6100	btrfs: remove extent_map::block_start member The member extent_map::block_start can be calculated from extent_map::disk_bytenr + extent_map::offset for regular extents. And otherwise just extent_map::disk_bytenr. And this is already validated by the validate_extent_map(). Now we can remove the member. However there is a special case in btrfs_create_dio_extent() where we for NOCOW/PREALLOC ordered extents cannot directly use the resulting btrfs_file_extent, as btrfs_split_ordered_extent() cannot handle them yet. So for that call site, we pass file_extent->disk_bytenr + file_extent->num_bytes as disk_bytenr for the ordered extent, and 0 for offset. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-07-11 15:33:21 +02:00
Qu Wenruo	e28b851ed9	btrfs: remove extent_map::block_len member The extent_map::block_len is either extent_map::len (non-compressed extent) or extent_map::disk_num_bytes (compressed extent). Since we already have sanity checks to do the cross-checks between the new and old members, we can drop the old extent_map::block_len now. For most call sites, they can manually select extent_map::len or extent_map::disk_num_bytes, since most if not all of them have checked if the extent is compressed. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-07-11 15:33:20 +02:00
Qu Wenruo	4aa7b5d178	btrfs: remove extent_map::orig_start member Since we have extent_map::offset, the old extent_map::orig_start is just extent_map::start - extent_map::offset for non-hole/inline extents. And since the new extent_map::offset is already verified by validate_extent_map() while the old orig_start is not, let's just remove the old member from all call sites. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-07-11 15:33:20 +02:00
Filipe Manana	416d6ab49d	btrfs: fix misspelled end IO compression callbacks Fix typo in the end IO compression callbacks, from "comprssed" to "compressed". Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-07-11 15:33:16 +02:00
Josef Bacik	e094f48040	btrfs: change root->root_key.objectid to btrfs_root_id() A comment from Filipe on one of my previous cleanups brought my attention to a new helper we have for getting the root id of a root, which makes it easier to read in the code. The changes where made with the following Coccinelle semantic patch: // <smpl> @@ expression E,E1; @@ ( E->root_key.objectid = E1 \| - E->root_key.objectid + btrfs_root_id(E) ) // </smpl> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:06 +02:00

1 2 3 4 5 ...

415 commits