linux

mirror of https://github.com/torvalds/linux.git synced 2025-11-02 17:49:03 +02:00

Author	SHA1	Message	Date
David Sterba	cc53bd2085	btrfs: add unlikely annotations to branches leading to EIO The unlikely() annotation is a static prediction hint that compiler may use to reorder code out of hot path. We use it elsewhere (namely tree-checker.c) for error branches that almost never happen, where EIO is one of them. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:26 +02:00
Qu Wenruo	5fbaae4b85	btrfs: prepare scrub to support bs > ps cases This involves: - Migrate scrub_stripe::pages[] to folios[] - Use btrfs_alloc_folio_array() and folio_put() to alloc above array. - Migrate scrub_stripe_get_kaddr() and scrub_stripe_get_paddr() to use folio interfaces - Migrate raid56_parity_cache_data_pages() to raid56_parity_cache_data_folios() Since scrub is the only caller still using pages. This helper will copy the folio array contents into rbio::stripe_pages, with sector uptodate flags updated. And a new ASSERT() to make sure bs > ps cases will not hit this path. Since most scrub code is based on kaddr/paddr, the migration itself is pretty straightforward. And since we're here, also move the loop to set the stripe_sectors[].uptodate out of the copy loop. As we always mark all the sectors as uptodate for the data stripe, it's easier to do in one go, other than doing it inside the copy loop. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:25 +02:00
Qu Wenruo	7425a28940	btrfs: introduce btrfs_bio_for_each_block_all() helper Currently if we want to iterate all blocks inside a bio, we do something like this: bio_for_each_segment_all(bvec, bio, iter_all) { for (off = 0; off < bvec->bv_len; off += sectorsize) { /* Iterate blocks using bv + off / } } That's fine for now, but it will not handle future bs > ps, as bio_for_each_segment_all() is a single-page iterator, it will always return a bvec that's no larger than a page. But for bs > ps cases, we need a full folio (which covers at least one block) so that we can work on the block. To address this problem and handle future bs > ps cases better: - Introduce a helper btrfs_bio_for_each_block_all() This helper will create a local bvec_iter, which has the size of the target bio. Then grab the current physical address of the current location, then advance the iterator by block size. - Use btrfs_bio_for_each_block_all() to replace existing call sites Including: set_bio_pages_uptodate() in raid56 * verify_bio_data_sectors() in raid56 Both will result much easier to read code. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:17 +02:00
Qu Wenruo	9afc617265	btrfs: introduce btrfs_bio_for_each_block() helper Currently if we want to iterate a bio in block unit, we do something like this: while (iter->bi_size) { struct bio_vec bv = bio_iter_iovec(); /* Do something with using the bv / bio_advance_iter_single(&bbio->bio, iter, sectorsize); } That's fine for now, but it will not handle future bs > ps, as bio_iter_iovec() returns a single-page bvec, meaning the bv_len will not exceed page size. This means the code using that bv can only handle a block if bs <= ps. To address this problem and handle future bs > ps cases better: - Introduce a helper btrfs_bio_for_each_block() Instead of bio_vec, which has single and multiple page version and multiple page version has quite some limits, use my favorite way to represent a block, phys_addr_t. For bs <= ps cases, nothing is changed, except we will do a very small overhead to convert phys_addr_t to a folio, then use the proper folio helpers to handle the possible highmem cases. For bs > ps cases, all blocks will be backed by large folios, meaning every folio will cover at least one block. And still use proper folio helpers to handle highmem cases. With phys_addr_t, we will handle both large folio and highmem properly. So there is no better single variable to present a btrfs block than phys_addr_t. - Extract the data block csum calculation into a helper The new helper, btrfs_calculate_block_csum() will be utilized by btrfs_csum_one_bio(). - Use btrfs_bio_for_each_block() to replace existing call sites Including: index_one_bio() from raid56.c Very straight-forward. * btrfs_check_read_bio() Also update repair_one_sector() to grab the folio using phys_addr_t, and do extra checks to make sure the folio covers at least one block. We do not need to bother bv_len at all now. * btrfs_csum_one_bio() Now we can move the highmem handling into a dedicated helper, calculate_block_csum(), and use btrfs_bio_for_each_block() helper. There is one exception in btrfs_decompress_buf2page(), which is copying decompressed data into the original bio, which is not iterating using block size thus we don't need to bother. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:17 +02:00
Qu Wenruo	35aff706dc	btrfs: concentrate highmem handling for data verification Currently for btrfs checksum verification, we do it in the following pattern: kaddr = kmap_local_*(); ret = btrfs_check_csum_csum(kaddr); kunmap_local(kaddr); It's OK for now, but it's still not following the patterns of helpers inside linux/highmem.h, which never requires a virt memory address. In those highmem helpers, they mostly accept a folio, some offset/length inside the folio, and in the implementation they check if the folio needs partial kmap, and do the handling. Inspired by those formal highmem helpers, enhance the highmem handling of data checksum verification by: - Rename btrfs_check_sector_csum() to btrfs_check_block_csum() To follow the more common term "block" used in all other major filesystems. - Pass a physical address into btrfs_check_block_csum() and btrfs_data_csum_ok() The physical address is always available even for a highmem page. Since it's page frame number << PAGE_SHIFT + offset in page. And with that physical address, we can grab the folio covering the page, and do extra checks to ensure it covers at least one block. This also allows us to do the kmap inside btrfs_check_block_csum(). This means all the extra HIGHMEM handling will be concentrated into btrfs_check_block_csum(), and no callers will need to bother highmem by themselves. - Properly zero out the block if csum mismatch Since btrfs_data_csum_ok() only got a paddr, we can not and should not use memzero_bvec(), which only accepts single page bvec. Instead use paddr to grab the folio and call folio_zero_range() Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:16 +02:00
Filipe Manana	c5d12d5b62	btrfs: raid56: use list_last_entry() at cache_rbio() Instead of using list_entry() against the list's prev entry, use list_last_entry(), which removes the need to know the last member is accessed through the prev list pointer and the naming makes it easier to reason about what we are doing. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:54 +02:00
David Sterba	c779b7980c	btrfs: raid56: rename parameter err to status in endio helpers Trivial renames to unify the naming of blk_status_t variables/parameters. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:49 +02:00
David Sterba	ae8ce87165	btrfs: drop redundant local variable in raid_wait_write_end_io() The bio status is read only once, no variable needed for that. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:48 +02:00
David Sterba	05a6ec865d	btrfs: use unsigned types for constants defined as bit shifts The unsigned type is a recommended practice (CWE-190, CWE-194) for bit shifts to avoid problems with potential unwanted sign extensions. Although there are no such cases in btrfs codebase, follow the recommendation. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:48 +02:00
David Sterba	2d44a15afd	btrfs: use list_first_entry() everywhere Using the helper makes it a bit more clear that we're accessing the first list entry. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:47 +02:00
Qu Wenruo	cd678925e9	btrfs: raid56: store a physical address in structure sector_ptr Instead of using a @page + @pg_offset pair inside sector_ptr structure, use a single physical address instead. This allows us to grab both the page and offset from a single u64 value. Although we still need an extra bool value, @has_paddr, to distinguish if the sector is properly mapped (as the 0 physical address is totally valid). This change doesn't change the size of structure sector_ptr, but reduces the parameters of several functions. Note: the original idea and patch is from Christoph Hellwig (https://lore.kernel.org/linux-btrfs/20250409111055.3640328-7-hch@lst.de/) but the final implementation is different. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Christoph Hellwig <hch@lst.de> [ Use physical addresses instead to handle highmem. ] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:46 +02:00
Christoph Hellwig	6f3f722df7	btrfs: simplify bvec iteration in index_one_bio() Flatten the two loops by open coding bio_for_each_segment() and advancing the iterator one sector at a time. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> [ Fix a bug that @offset is not increased. ] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:46 +02:00
Christoph Hellwig	959ddf2839	btrfs: move kmapping out of btrfs_check_sector_csum() Move kmapping the page out of btrfs_check_sector_csum(). This allows using bvec_kmap_local() where suitable and reduces the number of kmap*() calls in the raid56 code. This also means btrfs_check_sector_csum() will only accept a properly kmapped address. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:46 +02:00
Qu Wenruo	c186345a6b	btrfs: make assert_rbio() to only check CONFIG_BTRFS_ASSERT According to the description, CONFIG_BTRFS_DEBUG is only for extra debug info, meanwhile sanity checks should be managed by CONFIG_BTRFS_ASSERT. There is no need to check both to enable assert_rbio(). Just remove the check for CONFIG_BTRFS_DEBUG. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:12 +01:00
Qu Wenruo	0fbf6cbd72	btrfs: rename the extra_gfp parameter of btrfs_alloc_page_array() There is only one caller utilizing the @extra_gfp parameter, alloc_eb_folio_array(). And in that case the extra_gfp is only assigned to __GFP_NOFAIL. Rename the @extra_gfp parameter to @nofail to indicate that. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-07-11 15:33:30 +02:00
Qu Wenruo	bbbee460aa	btrfs: raid56: do extra dumping for CONFIG_BTRFS_ASSERT There are several hard-to-hit ASSERT()s hit inside raid56. Unfortunately the ASSERT() expression is a little complex, and except the ASSERT(), there is nothing to provide any clue. Considering if race is involved, it's pretty hard to reproduce. Meanwhile sometimes the dump of the rbio structure can provide some pretty good clues, it's worth to do the extra multi-line dump for btrfs raid56 related code. The dump looks like this: BTRFS critical (device dm-3): bioc logical=4598530048 full_stripe=4598530048 size=0 map_type=0x81 mirror=0 replace_nr_stripes=0 replace_stripe_src=-1 num_stripes=5 BTRFS critical (device dm-3): nr=0 devid=1 physical=1166147584 BTRFS critical (device dm-3): nr=1 devid=2 physical=1145176064 BTRFS critical (device dm-3): nr=2 devid=4 physical=1145176064 BTRFS critical (device dm-3): nr=3 devid=5 physical=1145176064 BTRFS critical (device dm-3): nr=4 devid=3 physical=1145176064 BTRFS critical (device dm-3): rbio flags=0x0 nr_sectors=80 nr_data=4 real_stripes=5 stripe_nsectors=16 scrubp=0 dbitmap=0x0 BTRFS critical (device dm-3): logical=4598530048 assertion failed: orig_logical >= full_stripe_start && orig_logical + orig_len <= full_stripe_start + rbio->nr_data * BTRFS_STRIPE_LEN, in fs/btrfs/raid56.c:1702 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-07-11 15:33:16 +02:00
Christoph Hellwig	fa1af65bf8	btrfs use bio_list_merge_init Use bio_list_merge_init instead of open coding bio_list_merge and bio_list_init. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: David Sterba <dsterba@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240328084147.2954434-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-01 11:53:37 -06:00
Qu Wenruo	b2324e08b8	btrfs: raid56: extra debugging for raid6 syndrome generation [BUG] I have got at least two crash report for RAID6 syndrome generation, no matter if it's AVX2 or SSE2, they all seems to have a similar calltrace with corrupted RAX: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP PTI Workqueue: btrfs-rmw rmw_rbio_work [btrfs] RIP: 0010:raid6_sse21_gen_syndrome+0x9e/0x130 [raid6_pq] RAX: 0000000000000000 RBX: 0000000000001000 RCX: ffffa0ff4cfa3248 RDX: 0000000000000000 RSI: ffffa0f74cfa3238 RDI: 0000000000000000 Call Trace: <TASK> rmw_rbio+0x5c8/0xa80 [btrfs] process_one_work+0x1c7/0x3d0 worker_thread+0x4d/0x380 kthread+0xf3/0x120 ret_from_fork+0x2c/0x50 </TASK> [CAUSE] The cause is not known. Recently I also hit this in AVX512 path, and that's even in v5.15 backport, which doesn't have any of my RAID56 rework. Furthermore according to the registers: RAX: 0000000000000000 RBX: 0000000000001000 RCX: ffffa0ff4cfa3248 The RAX register is showing the number of stripes (including PQ), which is not correct (0). But the remaining two registers are all sane. - RBX is the sectorsize For x86_64 it should always be 4K and matches the output. - RCX is the pointers array Which is from rbio->finish_pointers, and it looks like a sane kernel address. [WORKAROUND] For now, I can only add extra debug ASSERT()s before we call raid6 gen_syndrome() helper and hopes to catch the problem. The debug requires both CONFIG_BTRFS_DEBUG and CONFIG_BTRFS_ASSERT enabled. My current guess is some use-after-free, but every report is only having corrupted RAX but seemingly valid pointers doesn't make much sense. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-03-04 16:24:52 +01:00
David Sterba	2b712e3bb2	btrfs: remove unused included headers With help of neovim, LSP and clangd we can identify header files that are not actually needed to be included in the .c files. This is focused only on removal (with minor fixups), further cleanups are possible but will require doing the header files properly with forward declarations, minimized includes and include-what-you-use care. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-03-04 16:24:46 +01:00
Qu Wenruo	09e6cef19c	btrfs: refactor alloc_extent_buffer() to allocate-then-attach method Currently alloc_extent_buffer() utilizes find_or_create_page() to allocate one page a time for an extent buffer. This method has the following disadvantages: - find_or_create_page() is the legacy way of allocating new pages With the new folio infrastructure, find_or_create_page() is just redirected to filemap_get_folio(). - Lacks the way to support higher order (order >= 1) folios As we can not yet let filemap give us a higher order folio. This patch would change the workflow by the following way: Old \| new -----------------------------------+------------------------------------- \| ret = btrfs_alloc_page_array(); for (i = 0; i < num_pages; i++) { \| for (i = 0; i < num_pages; i++) { p = find_or_create_page(); \| ret = filemap_add_folio(); /* Attach page private / \| / Reuse page cache if needed / / Reused eb if needed / \| \| / Attach page private and \| reuse eb if needed */ \| } By this we split the page allocation and private attaching into two parts, allowing future updates to each part more easily, and migrate to folio interfaces (especially for possible higher order folios). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-12-15 23:01:04 +01:00
David Sterba	a0df0a2680	btrfs: raid56: remove unused btrfs_plug_cb::work The raid56 changes in 6.2 reworked the IO path to RMW, commit `93723095b5` ("btrfs: raid56: switch write path to rmw_rbio()") in particular removed the last use of the work member so it can be removed as well. This was found by tool https://github.com/jirislaby/clang-struct . Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-12-15 20:27:01 +01:00
Qu Wenruo	3c771c1944	btrfs: scrub: avoid unnecessary csum tree search preparing stripes One of the bottleneck of the new scrub code is the extra csum tree search. The old code would only do the csum tree search for each scrub bio, which can be as large as 512KiB, thus they can afford to allocate a new path each time. But the new scrub code is doing csum tree search for each stripe, which is only 64KiB, this means we'd better re-use the same csum path during each search. This patch would introduce a per-sctx path for csum tree search, as we don't need to re-allocate the path every time we need to do a csum tree search. With this change we can further improve the queue depth and improve the scrub read performance: Before (with regression and cached extent tree path): Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util nvme0n1p3 15875.00 1013328.00 12.00 0.08 0.08 63.83 1.35 100.00 After (with both cached extent/csum tree path): nvme0n1p3 17759.00 1133280.00 10.00 0.06 0.08 63.81 1.50 100.00 Fixes: `e02ee89baa` ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure") CC: stable@vger.kernel.org # 6.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:54:48 +02:00
Qu Wenruo	dbb6ecb328	btrfs: tracepoints: simplify raid56 events After commit `6bfd0133be` ("btrfs: raid56: switch scrub path to use a single function"), the raid56 implementation no longer uses different endio functions for RMW/recover/scrub. All read operations end in submit_read_wait_bio_list(), while all write operations end in submit_write_bios(). This means quite some trace events are out-of-date and no longer utilized. This patch would unify the trace events into just two: - trace_raid56_read() Replaces trace_raid56_read_partial(), trace_raid56_scrub_read() and trace_raid56_scrub_read_recover(). - trace_raid56_write() Replaces trace_raid56_write_stripe() and trace_raid56_scrub_write_stripe(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:12 +02:00
Qu Wenruo	3a3c7a7f65	btrfs: raid56: remove unused BTRFS_RBIO_REBUILD_MISSING Commit `aca43fe839` ("btrfs: remove unused raid56 functions which were dedicated for scrub") removed the special handling of RAID56 scrub for missing device. As scrub goes full mirror_num based recovery, that means if it hits a missing device in RAID56, it would just try the next mirror, which would go through the BTRFS_RBIO_READ_REBUILD operation. This means there is no longer any use of BTRFS_RBIO_REBUILD_MISSING operation and we can safely remove it. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:12 +02:00
Qu Wenruo	486c737f7f	btrfs: raid56: always verify the P/Q contents for scrub [REGRESSION] Commit `75b4703329` ("btrfs: raid56: migrate recovery and scrub recovery path to use error_bitmap") changed the behavior of scrub_rbio(). Initially if we have no error reading the raid bio, we will assign @need_check to true, then finish_parity_scrub() would later verify the content of P/Q stripes before writeback. But after that commit we never verify the content of P/Q stripes and just writeback them. This can lead to unrepaired P/Q stripes during scrub, or already corrupted P/Q copied to the dev-replace target. [FIX] The situation is more complex than the regression, in fact the initial behavior is not 100% correct either. If we have the following rare case, it can still lead to the same problem using the old behavior: 0 16K 32K 48K 64K Data 1: \|IIIIIII\| \| Data 2: \| \| Parity: \| \|CCCCCCC\| \| Where "I" means IO error, "C" means corruption. In the above case, we're scrubbing the parity stripe, then read out all the contents of Data 1, Data 2, Parity stripes. But found IO error in Data 1, which leads to rebuild using Data 2 and Parity and got the correct data. In that case, we would not verify if the Parity is correct for range [16K, 32K). So here we have to always verify the content of Parity no matter if we did recovery or not. This patch would remove the @need_check parameter of finish_parity_scrub() completely, and would always do the P/Q verification before writeback. Fixes: `75b4703329` ("btrfs: raid56: migrate recovery and scrub recovery path to use error_bitmap") CC: stable@vger.kernel.org # 6.2+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-07-18 03:13:15 +02:00
Qu Wenruo	94ead93e63	btrfs: scrub: use recovered data stripes as cache to avoid unnecessary read For P/Q stripe scrub, we have quite some duplicated read IO: - Data stripes read for verification This is triggered by the scrub_submit_initial_read() inside scrub_raid56_parity_stripe(). - Data stripes read (again) for P/Q stripe verification This is triggered by scrub_assemble_read_bios() from scrub_rbio(). Although we can have hit rbio cache and avoid unnecessary read, the chance is very low, as scrub would easily flush the whole rbio cache. This means, even we're just scrubbing a single P/Q stripe, we would read the data stripes twice for the best case scenario. If we need to recover some data stripes, it would cause more reads on the same data stripes, again and again. However before we call raid56_parity_submit_scrub_rbio() we already have all data stripes repaired and their contents ready to use. But RAID56 cache is unaware about the scrub cache, thus RAID56 layer itself still needs to re-read the data stripes. To avoid such cache miss, this patch would: - Introduce a new helper, raid56_parity_cache_data_pages() This function would grab the pages from an array, and copy the content to the rbio, marking all the involved sectors uptodate. The page copy is unavoidable because of the cache pages of rbio are all self managed, thus can not utilize outside pages without screwing up the lifespan. - Use the repaired data stripes as cache inside scrub_raid56_parity_stripe() By this, we ensure all the data sectors of the scrub rbio are already uptodate, and no need to read them again from disk. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:24 +02:00
Anand Jain	adbe7e388e	btrfs: use SECTOR_SHIFT to convert LBA to physical offset Using SECTOR_SHIFT to convert LBA to physical address makes it more readable. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:23 +02:00
Anand Jain	29e70be261	btrfs: use SECTOR_SHIFT to convert physical offset to LBA Use SECTOR_SHIFT while converting a physical address to an LBA, makes it more readable. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:23 +02:00
Qu Wenruo	aca43fe839	btrfs: remove unused raid56 functions which were dedicated for scrub Since the scrub rework, the following RAID56 functions are no longer called: - raid56_add_scrub_pages() - raid56_alloc_missing_rbio() - raid56_submit_missing_rbio() Those functions are all utilized by scrub to handle missing device cases for RAID56. However the new scrub code handle them in a completely different way: - If it's data stripe, go recovery path through btrfs_submit_bio() - If it's P/Q stripe, it would be handled through raid56_parity_submit_scrub_rbio() And that function would handle dev-replace and repair properly. Thus we can safely remove those functions. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 19:52:18 +02:00
Qu Wenruo	b979547513	btrfs: scrub: introduce helper to find and fill sector info for a scrub_stripe The new helper will search the extent tree to find the first extent of a logical range, then fill the sectors array by two loops: - Loop 1 to fill common bits and metadata generation - Loop 2 to fill csum data (only for data bgs) This loop will use the new btrfs_lookup_csums_bitmap() to fill the full csum buffer, and set scrub_sector_verification::csum. With all the needed info filled by this function, later we only need to submit and verify the stripe. Here we temporarily export the helper to avoid warning on unused static function. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 18:01:23 +02:00
Johannes Thumshirn	cf32e41fa5	btrfs: use __bio_add_page to add single a page in rbio_add_io_sector The btrfs raid56 sector submission code uses bio_add_page() to add a page to a newly created bio. bio_add_page() can fail, but the return value is never checked. Use __bio_add_page() as adding a single page to a newly created bio is guaranteed to succeed. This brings us a step closer to marking bio_add_page() as __must_check. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 18:01:20 +02:00
Qu Wenruo	18d758a2d8	btrfs: replace btrfs_io_context::raid_map with a fixed u64 value In btrfs_io_context structure, we have a pointer raid_map, which indicates the logical bytenr for each stripe. But considering we always call sort_parity_stripes(), the result raid_map[] is always sorted, thus raid_map[0] is always the logical bytenr of the full stripe. So why we waste the space and time (for sorting) for raid_map? This patch will replace btrfs_io_context::raid_map with a single u64 number, full_stripe_start, by: - Replace btrfs_io_context::raid_map with full_stripe_start - Replace call sites using raid_map[0] to use full_stripe_start - Replace call sites using raid_map[i] to compare with nr_data_stripes. The benefits are: - Less memory wasted on raid_map It's sizeof(u64) * num_stripes vs sizeof(u64). It'll always save at least one u64, and the benefit grows larger with num_stripes. - No more weird alloc_btrfs_io_context() behavior As there is only one fixed size + one variable length array. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 18:01:14 +02:00
Qu Wenruo	1faf388506	btrfs: use an efficient way to represent source of duplicated stripes For btrfs dev-replace, we have to duplicate writes to the source device into the target device. For non-RAID56, all writes into the same mapped ranges are sharing the same content, thus they don't really need to bother anything. (E.g. in btrfs_submit_bio() for non-RAID56 range we just submit the same write to all involved devices). But for RAID56, all stripes contain different content, thus we must have a clear mapping of which stripe is duplicated from which original stripe. Currently we use a complex way using tgtdev_map[] array, e.g: num_tgtdevs = 1 tgtdev_map[0] = 0 <- Means stripes[0] is not involved in replace. tgtdev_map[1] = 3 <- Means stripes[1] is involved in replace, and it's duplicated to stripes[3]. tgtdev_map[2] = 0 <- Means stripes[2] is not involved in replace. But this is wasting some space, and ignores one important thing for dev-replace, there is at most one running replace. Thus we can change it to a fixed array to represent the mapping: replace_nr_stripes = 1 replace_stripe_src = 1 <- Means stripes[1] is involved in replace. thus the extra stripe is a copy of stripes[1] By this we can save some space for bioc on RAID56 chunks with many devices. And we get rid of one variable sized array from bioc. Thus the patch involves the following changes: - Replace @num_tgtdevs and @tgtdev_map[] with @replace_nr_stripes and @replace_stripe_src. @num_tgtdevs is just renamed to @replace_nr_stripes. While the mapping is completely changed. - Add extra ASSERT()s for RAID56 code - Only add two more extra stripes for dev-replace cases. As we have an upper limit on how many dev-replace stripes we can have. - Unify the behavior of handle_ops_on_dev_replace() Previously handle_ops_on_dev_replace() go two different paths for WRITE and GET_READ_MIRRORS. Now unify them by always going the WRITE path first (with at most 2 replace stripes), then if we're doing GET_READ_MIRRORS and we have 2 extra stripes, just drop one stripe. - Remove the @real_stripes argument from alloc_btrfs_io_context() As we don't need the old variable length array any more. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 18:01:14 +02:00
Christoph Hellwig	74cc3600e8	btrfs: raid56: no need for irqsafe locking These days all the operations that take locks in the raid56.c code are run from user context (mostly workqueues). Drop all the irqsafe locking that is not required any more. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 18:01:13 +02:00
Christoph Hellwig	08241d3c74	btrfs: raid56: handle endio in scrub_rbio The only caller of scrub_rbio calls rbio_orig_end_io right after it, move it into scrub_rbio to match the other work item helpers. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-15 19:38:55 +01:00
Christoph Hellwig	40f87ddb5d	btrfs: raid56: handle endio in recover_rbio Both callers of recover_rbio call rbio_orig_end_io right after it, so move the call into the shared function. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-15 19:38:54 +01:00
Christoph Hellwig	1d0ef1ca11	btrfs: raid56: handle endio in rmw_rbio Both callers of rmv_rbio call rbio_orig_end_io right after it, so move the call into the shared function. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-15 19:38:54 +01:00
Christoph Hellwig	52f0c19864	btrfs: raid56: submit the read bios from scrub_assemble_read_bios Instead of filling in a bio_list and submitting the bios in the only caller, do that in scrub_assemble_read_bios. This removes the need to pass the bio_list, and also makes it clear that the extra bio_list cleanup in the caller is entirely pointless. Rename the function to scrub_read_bios to make it clear that the bios are not only assembled. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-15 19:38:54 +01:00
Christoph Hellwig	02efa3a6ba	btrfs: raid56: fold rmw_read_wait_recover into rmw_read_bios There is very little extra code in rmw_read_bios, and a large part of it is the superfluous extra cleanup of the bio list. Merge the two functions, and only clean up the bio list after it has been added to but before it has been emptied again by submit_read_wait_bio_list. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-15 19:38:54 +01:00
Christoph Hellwig	d838d05ea5	btrfs: raid56: fold recover_assemble_read_bios into recover_rbio There is very little extra code in recover_rbio, and a large part of it is the superfluous extra cleanup of the bio list. Merge the two functions, and only clean up the bio list after it has been added to but before it has been emptied again by submit_read_wait_bio_list. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-15 19:38:54 +01:00
Christoph Hellwig	801fcfc5d7	btrfs: raid56: add a bio_list_put helper Add a helper to put all bios in a list. This does not need to be added to block layer as there are no other users of such code. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-15 19:38:54 +01:00
Christoph Hellwig	1c76fb7b31	btrfs: raid56: wait for I/O completion in submit_read_bios In addition to setting up the end_io handler and submitting the bios in submit_read_bios, also wait for them to be completed instead of waiting for the completion manually in all three callers. Rename submit_read_bios to submit_read_wait_bio_list to make it clear it waits for the bios as well. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-15 19:38:54 +01:00
Christoph Hellwig	4d7627010b	btrfs: raid56: simplify code flow in rmw_rbio Remove the write goto label by moving the data page allocation and data read into the branch. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-15 19:38:54 +01:00
Christoph Hellwig	abb49e8742	btrfs: raid56: simplify error handling and code flow in raid56_parity_write Handle the error return on alloc_rbio failure directly instead of using a goto and remove the queue_rbio goto label by moving the plugged check into the if branch. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-15 19:38:54 +01:00
Qu Wenruo	c9a43aaf09	btrfs: raid56: reduce overhead to calculate the bio length In rbio_update_error_bitmap(), we need to calculate the length of the rbio. As since it's called in the endio function, we can not directly grab the length from bi_iter. Currently we call bio_for_each_segment_all(), which will always return a range inside a page. But that's not necessary as we don't really care about anything inside the page. So use bio_for_each_bvec_all(), which can return a bvec across multiple continuous pages thus reduce the loops. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-15 19:38:50 +01:00
Colin Ian King	67da05b3f2	btrfs: fix spelling mistakes found using codespell There quite a few spelling mistakes as found using codespell. Fix them. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-15 19:38:50 +01:00
Qu Wenruo	a9ad4d87aa	btrfs: raid56: make error_bitmap update atomic In the rework of raid56 code, there is very limited concurrency in the endio context. Most of the work is done inside the sectors arrays, which different bios will never touch the same sector. But there is a concurrency here for error_bitmap. Both read and write endio functions need to touch them, and we can have multiple write bios touching the same error bitmap if they all hit some errors. Here we fix the unprotected bitmap operation by going set_bit() in a loop. Since we have a very small ceiling of the sectors (at most 16 sectors), such set_bit() in a loop should be very acceptable. Fixes: `2942a50dea` ("btrfs: raid56: introduce btrfs_raid_bio::error_bitmap") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-01-27 14:57:10 +01:00
Tanmay Bhushan	f7c11affde	btrfs: raid56: fix stripes if vertical errors are found We take two stripe numbers if vertical errors are found. In case it is just a pstripe it does not matter but in case of raid 6 it matters as both stripes need to be fixed. Fixes: `7a31507230` ("btrfs: raid56: do data csum verification during RMW cycle") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Tanmay Bhushan <007047221b@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-01-25 20:11:07 +01:00
Josef Bacik	e7fc357ec0	btrfs: scrub: fix uninitialized return value in recover_scrub_rbio Commit `75b4703329` ("btrfs: raid56: migrate recovery and scrub recovery path to use error_bitmap") introduced an uninitialized return variable. This can be caught by gcc 12.1 by -Wmaybe-uninitialized: CC [M] fs/btrfs/raid56.o fs/btrfs/raid56.c: In function ‘scrub_rbio’: fs/btrfs/raid56.c:2801:15: warning: ‘ret’ may be used uninitialized [-Wmaybe-uninitialized] 2801 \| ret = recover_scrub_rbio(rbio); \| ^~~~~~~~~~~~~~~~~~~~~~~~ fs/btrfs/raid56.c:2649:13: note: ‘ret’ was declared here 2649 \| int ret; The warning is disabled by default so we haven't caught that. Due to the bug the raid56 scrub fstests have been failing since the patch was merged, so initialize that. Fixes: `75b4703329` ("btrfs: raid56: migrate recovery and scrub recovery path to use error_bitmap") Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-20 19:43:45 +01:00
Qu Wenruo	7a31507230	btrfs: raid56: do data csum verification during RMW cycle [BUG] For the following small script, btrfs will be unable to recover the content of file1: mkfs.btrfs -f -m raid1 -d raid5 -b 1G $dev1 $dev2 $dev3 mount $dev1 $mnt xfs_io -f -c "pwrite -S 0xff 0 64k" -c sync $mnt/file1 md5sum $mnt/file1 umount $mnt # Corrupt the above 64K data stripe. xfs_io -f -c "pwrite -S 0x00 323026944 64K" -c sync $dev3 mount $dev1 $mnt # Write a new 64K, which should be in the other data stripe # And this is a sub-stripe write, which will cause RMW xfs_io -f -c "pwrite 0 64k" -c sync $mnt/file2 md5sum $mnt/file1 umount $mnt Above md5sum would fail. [CAUSE] There is a long existing problem for raid56 (not limited to btrfs raid56) that, if we already have some corrupted on-disk data, and then trigger a sub-stripe write (which needs RMW cycle), it can cause further damage into P/Q stripe. Disk 1: data 1 \|0x000000000000\| <- Corrupted Disk 2: data 2 \|0x000000000000\| Disk 2: parity \|0xffffffffffff\| In above case, data 1 is already corrupted, the original data should be 64KiB of 0xff. At this stage, if we read data 1, and it has data checksum, we can still recovery going via the regular RAID56 recovery path. But if now we decide to write some data into data 2, then we need to go RMW. Let's say we want to write 64KiB of '0x00' into data 2, then we read the on-disk data of data 1, calculate the new parity, resulting the following layout: Disk 1: data 1 \|0x000000000000\| <- Corrupted Disk 2: data 2 \|0x000000000000\| <- New '0x00' writes Disk 2: parity \|0x000000000000\| <- New Parity. But the new parity is calculated using the corrupted data 1, we can no longer recover the correct data of data1. Thus the corruption is forever there. [FIX] To solve above problem, this patch will do a full stripe data checksum verification at RMW time. This involves the following changes: - Always read the full stripe (including data/P/Q) when doing RMW Before we only read the missing data sectors, but since we may do a data csum verification and recovery, we need to read everything out. Please note that, if we have a cached rbio, we don't need to read anything, and can treat it the same as full stripe write. As only stripe with all its csum matches can be cached. - Verify the data csum during read. The goal is only the rbio stripe sectors, and only if the rbio already has csum_buf/csum_bitmap filled. And sectors which cannot pass csum verification will have their bit set in error_bitmap. - Always call recovery_sectors() after we read out all the sectors Since error_bitmap will be updated during read, recover_sectors() can easily find out all the bad sectors and try to recover (if still under tolerance). And since recovery_sectors() is already migrated to use error_bitmap, it can skip vertical stripes which don't have any error. - Verify the repaired sectors against its csum in recover_vertical() - Rename rmw_read_and_wait() to rmw_read_wait_recover() Since we will always recover the sectors, the old name is no longer accurate. Furthermore since recovery is already done in rmw_read_wait_recover(), we no longer need to call recovery_sectors() inside rmw_rbio(). Obviously this will have a performance impact, as we are doing more work during RMW cycle: - Fetch the data checksums - Do checksum verification for all data stripes - Do checksum verification again after repair But for full stripe write or cached rbio we won't have the overhead all, thus for fully optimized RAID56 workload (always full stripe write), there should be no extra overhead. To me, the extra overhead looks reasonable, as data consistency is way more important than performance. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:57 +01:00

1 2 3 4 5

237 commits