Commit graph

264 commits

Author SHA1 Message Date
Linus Torvalds
6238729bfc fuse update for 6.18
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCaNuYpQAKCRDh3BK/laaZ
 PBF6APoDWzheST2Gw3LPkU03xF6Yn+ZMP0p3L676yYBb07LrWQD8CBsfOkMei6W/
 L+/GaVAbpIsuCQ6GTPE/6HTLWevU2Q0=
 =kR5v
 -----END PGP SIGNATURE-----

Merge tag 'fuse-update-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse updates from Miklos Szeredi:

 - Extend copy_file_range interface to be fully 64bit capable (Miklos)

 - Add selftest for fusectl (Chen Linxuan)

 - Move fuse docs into a separate directory (Bagas Sanjaya)

 - Allow fuse to enter freezable state in some cases (Sergey
   Senozhatsky)

 - Clean up writeback accounting after removing tmp page copies (Joanne)

 - Optimize virtiofs request handling (Li RongQing)

 - Add synchronous FUSE_INIT support (Miklos)

 - Allow server to request prune of unused inodes (Miklos)

 - Fix deadlock with AIO/sync release (Darrick)

 - Add some prep patches for block/iomap support (Darrick)

 - Misc fixes and cleanups

* tag 'fuse-update-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (26 commits)
  fuse: move CREATE_TRACE_POINTS to a separate file
  fuse: move the backing file idr and code into a new source file
  fuse: enable FUSE_SYNCFS for all fuseblk servers
  fuse: capture the unique id of fuse commands being sent
  fuse: fix livelock in synchronous file put from fuseblk workers
  mm: fix lockdep issues in writeback handling
  fuse: add prune notification
  fuse: remove redundant calls to fuse_copy_finish() in fuse_notify()
  fuse: fix possibly missing fuse_copy_finish() call in fuse_notify()
  fuse: remove FUSE_NOTIFY_CODE_MAX from <uapi/linux/fuse.h>
  fuse: remove fuse_readpages_end() null mapping check
  fuse: fix references to fuse.rst -> fuse/fuse.rst
  fuse: allow synchronous FUSE_INIT
  fuse: zero initialize inode private data
  fuse: remove unused 'inode' parameter in fuse_passthrough_open
  virtio_fs: fix the hash table using in virtio_fs_enqueue_req()
  mm: remove BDI_CAP_WRITEBACK_ACCT
  fuse: use default writeback accounting
  virtio_fs: Remove redundant spinlock in virtio_fs_request_complete()
  fuse: remove unneeded offset assignment when filling write pages
  ...
2025-10-03 12:48:18 -07:00
Linus Torvalds
8804d970fa Summary of significant series in this pull request:
- The 3 patch series "mm, swap: improve cluster scan strategy" from
   Kairui Song improves performance and reduces the failure rate of swap
   cluster allocation.
 
 - The 4 patch series "support large align and nid in Rust allocators"
   from Vitaly Wool permits Rust allocators to set NUMA node and large
   alignment when perforning slub and vmalloc reallocs.
 
 - The 2 patch series "mm/damon/vaddr: support stat-purpose DAMOS" from
   Yueyang Pan extend DAMOS_STAT's handling of the DAMON operations sets
   for virtual address spaces for ops-level DAMOS filters.
 
 - The 3 patch series "execute PROCMAP_QUERY ioctl under per-vma lock"
   from Suren Baghdasaryan reduces mmap_lock contention during reads of
   /proc/pid/maps.
 
 - The 2 patch series "mm/mincore: minor clean up for swap cache
   checking" from Kairui Song performs some cleanup in the swap code.
 
 - The 11 patch series "mm: vm_normal_page*() improvements" from David
   Hildenbrand provides code cleanup in the pagemap code.
 
 - The 5 patch series "add persistent huge zero folio support" from
   Pankaj Raghav provides a block layer speedup by optionalls making the
   huge_zero_pagepersistent, instead of releasing it when its refcount
   falls to zero.
 
 - The 3 patch series "kho: fixes and cleanups" from Mike Rapoport adds a
   few touchups to the recently added Kexec Handover feature.
 
 - The 10 patch series "mm: make mm->flags a bitmap and 64-bit on all
   arches" from Lorenzo Stoakes turns mm_struct.flags into a bitmap.  To
   end the constant struggle with space shortage on 32-bit conflicting with
   64-bit's needs.
 
 - The 2 patch series "mm/swapfile.c and swap.h cleanup" from Chris Li
   cleans up some swap code.
 
 - The 7 patch series "selftests/mm: Fix false positives and skip
   unsupported tests" from Donet Tom fixes a few things in our selftests
   code.
 
 - The 7 patch series "prctl: extend PR_SET_THP_DISABLE to only provide
   THPs when advised" from David Hildenbrand "allows individual processes
   to opt-out of THP=always into THP=madvise, without affecting other
   workloads on the system".
 
   It's a long story - the [1/N] changelog spells out the considerations.
 
 - The 11 patch series "Add and use memdesc_flags_t" from Matthew Wilcox
   gets us started on the memdesc project.  Please see
   https://kernelnewbies.org/MatthewWilcox/Memdescs and
   https://blogs.oracle.com/linux/post/introducing-memdesc.
 
 - The 3 patch series "Tiny optimization for large read operations" from
   Chi Zhiling improves the efficiency of the pagecache read path.
 
 - The 5 patch series "Better split_huge_page_test result check" from Zi
   Yan improves our folio splitting selftest code.
 
 - The 2 patch series "test that rmap behaves as expected" from Wei Yang
   adds some rmap selftests.
 
 - The 3 patch series "remove write_cache_pages()" from Christoph Hellwig
   removes that function and converts its two remaining callers.
 
 - The 2 patch series "selftests/mm: uffd-stress fixes" from Dev Jain
   fixes some UFFD selftests issues.
 
 - The 3 patch series "introduce kernel file mapped folios" from Boris
   Burkov introduces the concept of "kernel file pages".  Using these
   permits btrfs to account its metadata pages to the root cgroup, rather
   than to the cgroups of random inappropriate tasks.
 
 - The 2 patch series "mm/pageblock: improve readability of some
   pageblock handling" from Wei Yang provides some readability improvements
   to the page allocator code.
 
 - The 11 patch series "mm/damon: support ARM32 with LPAE" from SeongJae
   Park teaches DAMON to understand arm32 highmem.
 
 - The 4 patch series "tools: testing: Use existing atomic.h for
   vma/maple tests" from Brendan Jackman performs some code cleanups and
   deduplication under tools/testing/.
 
 - The 2 patch series "maple_tree: Fix testing for 32bit compiles" from
   Liam Howlett fixes a couple of 32-bit issues in
   tools/testing/radix-tree.c.
 
 - The 2 patch series "kasan: unify kasan_enabled() and remove
   arch-specific implementations" from Sabyrzhan Tasbolatov moves KASAN
   arch-specific initialization code into a common arch-neutral
   implementation.
 
 - The 3 patch series "mm: remove zpool" from Johannes Weiner removes
   zspool - an indirection layer which now only redirects to a single thing
   (zsmalloc).
 
 - The 2 patch series "mm: task_stack: Stack handling cleanups" from
   Pasha Tatashin makes a couple of cleanups in the fork code.
 
 - The 37 patch series "mm: remove nth_page()" from David Hildenbrand
   makes rather a lot of adjustments at various nth_page() callsites,
   eventually permitting the removal of that undesirable helper function.
 
 - The 2 patch series "introduce kasan.write_only option in hw-tags" from
   Yeoreum Yun creates a KASAN read-only mode for ARM, using that
   architecture's memory tagging feature.  It is felt that a read-only mode
   KASAN is suitable for use in production systems rather than debug-only.
 
 - The 3 patch series "mm: hugetlb: cleanup hugetlb folio allocation"
   from Kefeng Wang does some tidying in the hugetlb folio allocation code.
 
 - The 12 patch series "mm: establish const-correctness for pointer
   parameters" from Max Kellermann makes quite a number of the MM API
   functions more accurate about the constness of their arguments.  This
   was getting in the way of subsystems (in this case CEPH) when they
   attempt to improving their own const/non-const accuracy.
 
 - The 7 patch series "Cleanup free_pages() misuse" from Vishal Moola
   fixes a number of code sites which were confused over when to use
   free_pages() vs __free_pages().
 
 - The 3 patch series "Add Rust abstraction for Maple Trees" from Alice
   Ryhl makes the mapletree code accessible to Rust.  Required by nouveau
   and by its forthcoming successor: the new Rust Nova driver.
 
 - The 2 patch series "selftests/mm: split_huge_page_test:
   split_pte_mapped_thp improvements" from David Hildenbrand adds a fix and
   some cleanups to the thp selftesting code.
 
 - The 14 patch series "mm, swap: introduce swap table as swap cache
   (phase I)" from Chris Li and Kairui Song is the first step along the
   path to implementing "swap tables" - a new approach to swap allocation
   and state tracking which is expected to yield speed and space
   improvements.  This patchset itself yields a 5-20% performance benefit
   in some situations.
 
 - The 3 patch series "Some ptdesc cleanups" from Matthew Wilcox utilizes
   the new memdesc layer to clean up the ptdesc code a little.
 
 - The 3 patch series "Fix va_high_addr_switch.sh test failure" from
   Chunyu Hu fixes some issues in our 5-level pagetable selftesting code.
 
 - The 2 patch series "Minor fixes for memory allocation profiling" from
   Suren Baghdasaryan addresses a couple of minor issues in relatively new
   memory allocation profiling feature.
 
 - The 3 patch series "Small cleanups" from Matthew Wilcox has a few
   cleanups in preparation for more memdesc work.
 
 - The 2 patch series "mm/damon: add addr_unit for DAMON_LRU_SORT and
   DAMON_RECLAIM" from Quanmin Yan makes some changes to DAMON in
   furtherance of supporting arm highmem.
 
 - The 2 patch series "selftests/mm: Add -Wunreachable-code and fix
   warnings" from Muhammad Anjum adds that compiler check to selftests code
   and fixes the fallout, by removing dead code.
 
 - The 10 patch series "Improvements to Victim Process Thawing and OOM
   Reaper Traversal Order" from zhongjinji makes a number of improvements
   in the OOM killer: mainly thawing a more appropriate group of victim
   threads so they can release resources.
 
 - The 5 patch series "mm/damon: misc fixups and improvements for 6.18"
   from SeongJae Park is a bunch of small and unrelated fixups for DAMON.
 
 - The 7 patch series "mm/damon: define and use DAMON initialization
   check function" from SeongJae Park implement reliability and
   maintainability improvements to a recently-added bug fix.
 
 - The 2 patch series "mm/damon/stat: expose auto-tuned intervals and
   non-idle ages" from SeongJae Park provides additional transparency to
   userspace clients of the DAMON_STAT information.
 
 - The 2 patch series "Expand scope of khugepaged anonymous collapse"
   from Dev Jain removes some constraints on khubepaged's collapsing of
   anon VMAs.  It also increases the success rate of MADV_COLLAPSE against
   an anon vma.
 
 - The 2 patch series "mm: do not assume file == vma->vm_file in
   compat_vma_mmap_prepare()" from Lorenzo Stoakes moves us further towards
   removal of file_operations.mmap().  This patchset concentrates upon
   clearing up the treatment of stacked filesystems.
 
 - The 6 patch series "mm: Improve mlock tracking for large folios" from
   Kiryl Shutsemau provides some fixes and improvements to mlock's tracking
   of large folios.  /proc/meminfo's "Mlocked" field became more accurate.
 
 - The 2 patch series "mm/ksm: Fix incorrect accounting of KSM counters
   during fork" from Donet Tom fixes several user-visible KSM stats
   inaccuracies across forks and adds selftest code to verify these
   counters.
 
 - The 2 patch series "mm_slot: fix the usage of mm_slot_entry" from Wei
   Yang addresses some potential but presently benign issues in KSM's
   mm_slot handling.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaN3cywAKCRDdBJ7gKXxA
 jtaPAQDmIuIu7+XnVUK5V11hsQ/5QtsUeLHV3OsAn4yW5/3dEQD/UddRU08ePN+1
 2VRB0EwkLAdfMWW7TfiNZ+yhuoiL/AA=
 =4mhY
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - "mm, swap: improve cluster scan strategy" from Kairui Song improves
   performance and reduces the failure rate of swap cluster allocation

 - "support large align and nid in Rust allocators" from Vitaly Wool
   permits Rust allocators to set NUMA node and large alignment when
   perforning slub and vmalloc reallocs

 - "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend
   DAMOS_STAT's handling of the DAMON operations sets for virtual
   address spaces for ops-level DAMOS filters

 - "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren
   Baghdasaryan reduces mmap_lock contention during reads of
   /proc/pid/maps

 - "mm/mincore: minor clean up for swap cache checking" from Kairui Song
   performs some cleanup in the swap code

 - "mm: vm_normal_page*() improvements" from David Hildenbrand provides
   code cleanup in the pagemap code

 - "add persistent huge zero folio support" from Pankaj Raghav provides
   a block layer speedup by optionalls making the
   huge_zero_pagepersistent, instead of releasing it when its refcount
   falls to zero

 - "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to
   the recently added Kexec Handover feature

 - "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo
   Stoakes turns mm_struct.flags into a bitmap. To end the constant
   struggle with space shortage on 32-bit conflicting with 64-bit's
   needs

 - "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap
   code

 - "selftests/mm: Fix false positives and skip unsupported tests" from
   Donet Tom fixes a few things in our selftests code

 - "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised"
   from David Hildenbrand "allows individual processes to opt-out of
   THP=always into THP=madvise, without affecting other workloads on the
   system".

   It's a long story - the [1/N] changelog spells out the considerations

 - "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on
   the memdesc project. Please see

      https://kernelnewbies.org/MatthewWilcox/Memdescs and
      https://blogs.oracle.com/linux/post/introducing-memdesc

 - "Tiny optimization for large read operations" from Chi Zhiling
   improves the efficiency of the pagecache read path

 - "Better split_huge_page_test result check" from Zi Yan improves our
   folio splitting selftest code

 - "test that rmap behaves as expected" from Wei Yang adds some rmap
   selftests

 - "remove write_cache_pages()" from Christoph Hellwig removes that
   function and converts its two remaining callers

 - "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD
   selftests issues

 - "introduce kernel file mapped folios" from Boris Burkov introduces
   the concept of "kernel file pages". Using these permits btrfs to
   account its metadata pages to the root cgroup, rather than to the
   cgroups of random inappropriate tasks

 - "mm/pageblock: improve readability of some pageblock handling" from
   Wei Yang provides some readability improvements to the page allocator
   code

 - "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON
   to understand arm32 highmem

 - "tools: testing: Use existing atomic.h for vma/maple tests" from
   Brendan Jackman performs some code cleanups and deduplication under
   tools/testing/

 - "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes
   a couple of 32-bit issues in tools/testing/radix-tree.c

 - "kasan: unify kasan_enabled() and remove arch-specific
   implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific
   initialization code into a common arch-neutral implementation

 - "mm: remove zpool" from Johannes Weiner removes zspool - an
   indirection layer which now only redirects to a single thing
   (zsmalloc)

 - "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a
   couple of cleanups in the fork code

 - "mm: remove nth_page()" from David Hildenbrand makes rather a lot of
   adjustments at various nth_page() callsites, eventually permitting
   the removal of that undesirable helper function

 - "introduce kasan.write_only option in hw-tags" from Yeoreum Yun
   creates a KASAN read-only mode for ARM, using that architecture's
   memory tagging feature. It is felt that a read-only mode KASAN is
   suitable for use in production systems rather than debug-only

 - "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does
   some tidying in the hugetlb folio allocation code

 - "mm: establish const-correctness for pointer parameters" from Max
   Kellermann makes quite a number of the MM API functions more accurate
   about the constness of their arguments. This was getting in the way
   of subsystems (in this case CEPH) when they attempt to improving
   their own const/non-const accuracy

 - "Cleanup free_pages() misuse" from Vishal Moola fixes a number of
   code sites which were confused over when to use free_pages() vs
   __free_pages()

 - "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the
   mapletree code accessible to Rust. Required by nouveau and by its
   forthcoming successor: the new Rust Nova driver

 - "selftests/mm: split_huge_page_test: split_pte_mapped_thp
   improvements" from David Hildenbrand adds a fix and some cleanups to
   the thp selftesting code

 - "mm, swap: introduce swap table as swap cache (phase I)" from Chris
   Li and Kairui Song is the first step along the path to implementing
   "swap tables" - a new approach to swap allocation and state tracking
   which is expected to yield speed and space improvements. This
   patchset itself yields a 5-20% performance benefit in some situations

 - "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc
   layer to clean up the ptdesc code a little

 - "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some
   issues in our 5-level pagetable selftesting code

 - "Minor fixes for memory allocation profiling" from Suren Baghdasaryan
   addresses a couple of minor issues in relatively new memory
   allocation profiling feature

 - "Small cleanups" from Matthew Wilcox has a few cleanups in
   preparation for more memdesc work

 - "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from
   Quanmin Yan makes some changes to DAMON in furtherance of supporting
   arm highmem

 - "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad
   Anjum adds that compiler check to selftests code and fixes the
   fallout, by removing dead code

 - "Improvements to Victim Process Thawing and OOM Reaper Traversal
   Order" from zhongjinji makes a number of improvements in the OOM
   killer: mainly thawing a more appropriate group of victim threads so
   they can release resources

 - "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park
   is a bunch of small and unrelated fixups for DAMON

 - "mm/damon: define and use DAMON initialization check function" from
   SeongJae Park implement reliability and maintainability improvements
   to a recently-added bug fix

 - "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from
   SeongJae Park provides additional transparency to userspace clients
   of the DAMON_STAT information

 - "Expand scope of khugepaged anonymous collapse" from Dev Jain removes
   some constraints on khubepaged's collapsing of anon VMAs. It also
   increases the success rate of MADV_COLLAPSE against an anon vma

 - "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()"
   from Lorenzo Stoakes moves us further towards removal of
   file_operations.mmap(). This patchset concentrates upon clearing up
   the treatment of stacked filesystems

 - "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau
   provides some fixes and improvements to mlock's tracking of large
   folios. /proc/meminfo's "Mlocked" field became more accurate

 - "mm/ksm: Fix incorrect accounting of KSM counters during fork" from
   Donet Tom fixes several user-visible KSM stats inaccuracies across
   forks and adds selftest code to verify these counters

 - "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses
   some potential but presently benign issues in KSM's mm_slot handling

* tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits)
  mm: swap: check for stable address space before operating on the VMA
  mm: convert folio_page() back to a macro
  mm/khugepaged: use start_addr/addr for improved readability
  hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list
  alloc_tag: fix boot failure due to NULL pointer dereference
  mm: silence data-race in update_hiwater_rss
  mm/memory-failure: don't select MEMORY_ISOLATION
  mm/khugepaged: remove definition of struct khugepaged_mm_slot
  mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL
  hugetlb: increase number of reserving hugepages via cmdline
  selftests/mm: add fork inheritance test for ksm_merging_pages counter
  mm/ksm: fix incorrect KSM counter handling in mm_struct during fork
  drivers/base/node: fix double free in register_one_node()
  mm: remove PMD alignment constraint in execmem_vmalloc()
  mm/memory_hotplug: fix typo 'esecially' -> 'especially'
  mm/rmap: improve mlock tracking for large folios
  mm/filemap: map entire large folio faultaround
  mm/fault: try to map the entire file folio in finish_fault()
  mm/rmap: mlock large folios in try_to_unmap_one()
  mm/rmap: fix a mlock race condition in folio_referenced_one()
  ...
2025-10-02 18:18:33 -07:00
Jan Kara
e1b849cfa6
writeback: Avoid contention on wb->list_lock when switching inodes
There can be multiple inode switch works that are trying to switch
inodes to / from the same wb. This can happen in particular if some
cgroup exits which owns many (thousands) inodes and we need to switch
them all. In this case several inode_switch_wbs_work_fn() instances will
be just spinning on the same wb->list_lock while only one of them makes
forward progress. This wastes CPU cycles and quickly leads to softlockup
reports and unusable system.

Instead of running several inode_switch_wbs_work_fn() instances in
parallel switching to the same wb and contending on wb->list_lock, run
just one work item per wb and manage a queue of isw items switching to
this wb.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2025-09-19 13:11:00 +02:00
Ye Liu
79e1c24285 mm: replace (20 - PAGE_SHIFT) with common macros for pages<->MB conversion
Replace repeated (20 - PAGE_SHIFT) calculations with standard macros:
- MB_TO_PAGES(mb)    converts MB to page count
- PAGES_TO_MB(pages) converts pages to MB

No functional change.

[akpm@linux-foundation.org: remove arc's private PAGES_TO_MB, remove its unused PAGES_TO_KB]
[akpm@linux-foundation.org: don't include mm.h due to include file ordering mess]
Link: https://lkml.kernel.org/r/20250718024134.1304745-1-ye.liu@linux.dev
Signed-off-by: Ye Liu <liuye@kylinos.cn>
Acked-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Chris Li <chrisl@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Lai jiangshan <jiangshanlai@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mel Gorman <mgorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13 16:54:42 -07:00
Joanne Koong
2841808f35 mm: remove BDI_CAP_WRITEBACK_ACCT
There are no users of BDI_CAP_WRITEBACK_ACCT now that fuse doesn't do
its own writeback accounting. This commit removes
BDI_CAP_WRITEBACK_ACCT.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-08-27 14:29:43 +02:00
Thomas Gleixner
8fa7292fee treewide: Switch/rename to timer_delete[_sync]()
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.

Conversion was done with coccinelle plus manual fixups where necessary.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-04-05 10:30:12 +02:00
Kemeng Shi
4b5bbc39d7 writeback: support retrieving per group debug writeback stats of bdi
Add /sys/kernel/debug/bdi/xxx/wb_stats to show per group writeback stats
of bdi.

Following domain hierarchy is tested:
                global domain (320G)
                /                 \
        cgroup domain1(10G)     cgroup domain2(10G)
                |                 |
bdi            wb1               wb2

/* per wb writeback info of bdi is collected */
cat wb_stats
WbCgIno:                    1
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                  0 kB
WbWritten:                  0 kB
WbWriteBandwidth:      102400 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      1

WbCgIno:                 4091
WbWriteback:             1792 kB
WbReclaimable:         820512 kB
WbDirtyThresh:        6004692 kB
WbDirtied:            1820448 kB
WbWritten:             999488 kB
WbWriteBandwidth:      169020 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  1
b_dirty_time:               0
state:                      5

WbCgIno:                 4131
WbWriteback:             1120 kB
WbReclaimable:         820064 kB
WbDirtyThresh:        6004728 kB
WbDirtied:            1822688 kB
WbWritten:            1002400 kB
WbWriteBandwidth:      153520 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  1
b_dirty_time:               0
state:                      5

[shikemeng@huaweicloud.com: fix build problems]
  Link: https://lkml.kernel.org/r/20240423034643.141219-4-shikemeng@huaweicloud.com
Link: https://lkml.kernel.org/r/20240423034643.141219-3-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Brian Foster <bfoster@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05 17:53:51 -07:00
Kemeng Shi
e32e27009f writeback: collect stats of all wb of bdi in bdi_debug_stats_show
Patch series "Improve visibility of writeback", v5.

This series tries to improve visilibity of writeback.  Patch 1 make
/sys/kernel/debug/bdi/xxx/stats show writeback info of whole bdi instead
of only writeback info in root cgroup.  Patch 2 add a new debug file
/sys/kernel/debug/bdi/xxx/wb_stats to show per wb writeback info.  Patch 3
add wb_monitor.py to monitor basic writeback info of running system, more
info could be added on demand.  Patch 4 is a random cleanup.  More details
can be found in respective patches.

Following domain hierarchy is tested:
                global domain (320G)
                /                 \
        cgroup domain1(10G)     cgroup domain2(10G)
                |                 |
bdi            wb1               wb2

/* all writeback info of bdi is successfully collected */
cat stats
BdiWriteback:             4704 kB
BdiReclaimable:        1294496 kB
BdiDirtyThresh:      204208088 kB
DirtyThresh:         195259944 kB
BackgroundThresh:     32503588 kB
BdiDirtied:           48519296 kB
BdiWritten:           47225696 kB
BdiWriteBandwidth:     1173892 kBps
b_dirty:                     1
b_io:                        0
b_more_io:                   1
b_dirty_time:                0
bdi_list:                    1
state:                       1

/* per wb writeback info of bdi is collected */
cat /sys/kernel/debug/bdi/252:16/wb_stats
WbCgIno:                    1
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                  0 kB
WbWritten:                  0 kB
WbWriteBandwidth:      102400 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      1

WbCgIno:                 4208
WbWriteback:            59808 kB
WbReclaimable:         676480 kB
WbDirtyThresh:        6004624 kB
WbDirtied:           23348192 kB
WbWritten:           22614592 kB
WbWriteBandwidth:      593204 kBps
b_dirty:                    1
b_io:                       1
b_more_io:                  0
b_dirty_time:               0
state:                      7

WbCgIno:                 4249
WbWriteback:           144256 kB
WbReclaimable:         432096 kB
WbDirtyThresh:        6004344 kB
WbDirtied:           25727744 kB
WbWritten:           25154752 kB
WbWriteBandwidth:      577904 kBps
b_dirty:                    0
b_io:                       1
b_more_io:                  0
b_dirty_time:               0
state:                      7

The wb_monitor.py script output is as following:
./wb_monitor.py 252:16 -c
                  writeback  reclaimable   dirtied   written    avg_bw
252:16_1                  0            0         0         0    102400
252:16_4284             672       820064   9230368   8410304    685612
252:16_4325             896       819840  10491264   9671648    652348
252:16                 1568      1639904  19721632  18081952   1440360

                  writeback  reclaimable   dirtied   written    avg_bw
252:16_1                  0            0         0         0    102400
252:16_4284             672       820064   9230368   8410304    685612
252:16_4325             896       819840  10491264   9671648    652348
252:16                 1568      1639904  19721632  18081952   1440360
...


This patch (of 5):

/sys/kernel/debug/bdi/xxx/stats is supposed to show writeback information
of whole bdi, but only writeback information of bdi in root cgroup is
collected.  So writeback information in non-root cgroup are missing now. 
To be more specific, considering following case:

/* create writeback cgroup */
cd /sys/fs/cgroup
echo "+memory +io" > cgroup.subtree_control
mkdir group1
cd group1
echo $$ > cgroup.procs
/* do writeback in cgroup */
fio -name test -filename=/dev/vdb ...
/* get writeback info of bdi */
cat /sys/kernel/debug/bdi/xxx/stats
The cat result unexpectedly implies that there is no writeback on target
bdi.

Fix this by collecting stats of all wb in bdi instead of only wb in
root cgroup.

Following domain hierarchy is tested:
                global domain (320G)
                /                 \
        cgroup domain1(10G)     cgroup domain2(10G)
                |                 |
bdi            wb1               wb2

/* all writeback info of bdi is successfully collected */
cat stats
BdiWriteback:             2912 kB
BdiReclaimable:        1598464 kB
BdiDirtyThresh:      167479028 kB
DirtyThresh:         195038532 kB
BackgroundThresh:     32466728 kB
BdiDirtied:           19141696 kB
BdiWritten:           17543456 kB
BdiWriteBandwidth:     1136172 kBps
b_dirty:                     2
b_io:                        0
b_more_io:                   1
b_dirty_time:                0
bdi_list:                    1
state:                       1

Link: https://lkml.kernel.org/r/20240423034643.141219-1-shikemeng@huaweicloud.com
Link: https://lkml.kernel.org/r/20240423034643.141219-2-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05 17:53:51 -07:00
Kefeng Wang
85109a8a9a mm: backing-dev: use group allocation/free of per-cpu counters API
Use group allocation/free of per-cpu counters api to accelerate
wb_init/exit() and simplify code.

Link: https://lkml.kernel.org/r/20240325035635.49342-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:12 -07:00
Linus Torvalds
7ea65c89d8 vfs-6.9.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZem3wQAKCRCRxhvAZXjc
 otRMAQDeo8qsuuIAcS2KUicKqZR5yMVvrY9r4sQzf7YRcJo5HQD+NQXkKwQuv1VO
 OUeScsic/+I+136AgdjWnlEYO5dp0go=
 =4WKU
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.9.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "Misc features, cleanups, and fixes for vfs and individual filesystems.

  Features:

   - Support idmapped mounts for hugetlbfs.

   - Add RWF_NOAPPEND flag for pwritev2(). This allows us to fix a bug
     where the passed offset is ignored if the file is O_APPEND. The new
     flag allows a caller to enforce that the offset is honored to
     conform to posix even if the file was opened in append mode.

   - Move i_mmap_rwsem in struct address_space to avoid false sharing
     between i_mmap and i_mmap_rwsem.

   - Convert efs, qnx4, and coda to use the new mount api.

   - Add a generic is_dot_dotdot() helper that's used by various
     filesystems and the VFS code instead of open-coding it multiple
     times.

   - Recently we've added stable offsets which allows stable ordering
     when iterating directories exported through NFS on e.g., tmpfs
     filesystems. Originally an xarray was used for the offset map but
     that caused slab fragmentation issues over time. This switches the
     offset map to the maple tree which has a dense mode that handles
     this scenario a lot better. Includes tests.

   - Finally merge the case-insensitive improvement series Gabriel has
     been working on for a long time. This cleanly propagates case
     insensitive operations through ->s_d_op which in turn allows us to
     remove the quite ugly generic_set_encrypted_ci_d_ops() operations.
     It also improves performance by trying a case-sensitive comparison
     first and then fallback to case-insensitive lookup if that fails.
     This also fixes a bug where overlayfs would be able to be mounted
     over a case insensitive directory which would lead to all sort of
     odd behaviors.

  Cleanups:

   - Make file_dentry() a simple accessor now that ->d_real() is
     simplified because of the backing file work we did the last two
     cycles.

   - Use the dedicated file_mnt_idmap helper in ntfs3.

   - Use smp_load_acquire/store_release() in the i_size_read/write
     helpers and thus remove the hack to handle i_size reads in the
     filemap code.

   - The SLAB_MEM_SPREAD is a nop now. Remove it from various places in
     fs/

   - It's no longer necessary to perform a second built-in initramfs
     unpack call because we retain the contents of the previous
     extraction. Remove it.

   - Now that we have removed various allocators kfree_rcu() always
     works with kmem caches and kmalloc(). So simplify various places
     that only use an rcu callback in order to handle the kmem cache
     case.

   - Convert the pipe code to use a lockdep comparison function instead
     of open-coding the nesting making lockdep validation easier.

   - Move code into fs-writeback.c that was located in a header but can
     be made static as it's only used in that one file.

   - Rewrite the alignment checking iterators for iovec and bvec to be
     easier to read, and also significantly more compact in terms of
     generated code. This saves 270 bytes of text on x86-64 (with
     clang-18) and 224 bytes on arm64 (with gcc-13). In profiles it also
     saves a bit of time for the same workload.

   - Switch various places to use KMEM_CACHE instead of
     kmem_cache_create().

   - Use inode_set_ctime_to_ts() in inode_set_ctime_current()

   - Use kzalloc() in name_to_handle_at() to avoid kernel infoleak.

   - Various smaller cleanups for eventfds.

  Fixes:

   - Fix various comments and typos, and unneeded initializations.

   - Fix stack allocation hack for clang in the select code.

   - Improve dump_mapping() debug code on a best-effort basis.

   - Fix build errors in various selftests.

   - Avoid wrap-around instrumentation in various places.

   - Don't allow user namespaces without an idmapping to be used for
     idmapped mounts.

   - Fix sysv sb_read() call.

   - Fix fallback implementation of the get_name() export operation"

* tag 'vfs-6.9.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (70 commits)
  hugetlbfs: support idmapped mounts
  qnx4: convert qnx4 to use the new mount api
  fs: use inode_set_ctime_to_ts to set inode ctime to current time
  libfs: Drop generic_set_encrypted_ci_d_ops
  ubifs: Configure dentry operations at dentry-creation time
  f2fs: Configure dentry operations at dentry-creation time
  ext4: Configure dentry operations at dentry-creation time
  libfs: Add helper to choose dentry operations at mount-time
  libfs: Merge encrypted_ci_dentry_ops and ci_dentry_ops
  fscrypt: Drop d_revalidate once the key is added
  fscrypt: Drop d_revalidate for valid dentries during lookup
  fscrypt: Factor out a helper to configure the lookup dentry
  ovl: Always reject mounting over case-insensitive directories
  libfs: Attempt exact-match comparison first during casefolded lookup
  efs: remove SLAB_MEM_SPREAD flag usage
  jfs: remove SLAB_MEM_SPREAD flag usage
  minix: remove SLAB_MEM_SPREAD flag usage
  openpromfs: remove SLAB_MEM_SPREAD flag usage
  proc: remove SLAB_MEM_SPREAD flag usage
  qnx6: remove SLAB_MEM_SPREAD flag usage
  ...
2024-03-11 09:38:17 -07:00
Jan Kara
f814bdda77 blk-wbt: Fix detection of dirty-throttled tasks
The detection of dirty-throttled tasks in blk-wbt has been subtly broken
since its beginning in 2016. Namely if we are doing cgroup writeback and
the throttled task is not in the root cgroup, balance_dirty_pages() will
set dirty_sleep for the non-root bdi_writeback structure. However
blk-wbt checks dirty_sleep only in the root cgroup bdi_writeback
structure. Thus detection of recently throttled tasks is not working in
this case (we noticed this when we switched to cgroup v2 and suddently
writeback was slow).

Since blk-wbt has no easy way to get to proper bdi_writeback and
furthermore its intention has always been to work on the whole device
rather than on individual cgroups, just move the dirty_sleep timestamp
from bdi_writeback to backing_dev_info. That fixes the checking for
recently throttled task and saves memory for everybody as a bonus.

CC: stable@vger.kernel.org
Fixes: b57d74aff9 ("writeback: track if we're sleeping on progress in balance_dirty_pages()")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240123175826.21452-1-jack@suse.cz
[axboe: fixup indentation errors]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-06 09:44:03 -07:00
Kemeng Shi
12f7900c57 writeback: move wb_wakeup_delayed defination to fs-writeback.c
The wb_wakeup_delayed is only used in fs-writeback.c. Move it to
fs-writeback.c after defination of wb_wakeup and make it static.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20240118203339.764093-1-shikemeng@huaweicloud.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-01-22 15:33:38 +01:00
Jinliang Zheng
9af7c7426c writeback: remove redundant checks for root memcg
The check for root memcg will be done in wb_get_lookup(), so remove the
redundant one to simplify the code.

Link: https://lkml.kernel.org/r/20230808084431.1632934-1-alexjlzheng@tencent.com
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21 13:37:48 -07:00
ZhangPeng
61f2973801 mm: remove redundant K() macro definition
Patch series "cleanup with helper macro K()".

Use helper macro K() to improve code readability.  No functional
modification involved.  Remove redundant K() macro definition.


This patch (of 7):

Since commit eb8589b4f8 ("mm: move mem_init_print_info() to mm_init.c"),
the K() macro definition has been moved to mm/internal.h.  Therefore, the
definitions in mm/memcontrol.c, mm/backing-dev.c and mm/oom_kill.c are
redundant.  Drop redundant definitions.

[akpm@linux-foundation.org: oom_kill.c: remove "#undef K", per Kefeng]
Link: https://lkml.kernel.org/r/20230804012559.2617515-1-zhangpeng362@huawei.com
Link: https://lkml.kernel.org/r/20230804012559.2617515-2-zhangpeng362@huawei.com
Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21 13:37:44 -07:00
Ivan Orlov
b5665cf936 mm: backing-dev: make bdi_class a static const structure
Now that the driver core allows for struct class to be in read-only
memory, move the bdi_class structure to be declared at build time placing
it into read-only memory, instead of having to be dynamically allocated at
load time.

Link: https://lkml.kernel.org/r/20230620183314.682822-2-gregkh@linuxfoundation.org
Signed-off-by: Ivan Orlov <ivan.orlov0322@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23 16:59:27 -07:00
Linus Torvalds
7fa8a8ee94 - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
switching from a user process to a kernel thread.
 
 - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav.
 
 - zsmalloc performance improvements from Sergey Senozhatsky.
 
 - Yue Zhao has found and fixed some data race issues around the
   alteration of memcg userspace tunables.
 
 - VFS rationalizations from Christoph Hellwig:
 
   - removal of most of the callers of write_one_page().
 
   - make __filemap_get_folio()'s return value more useful
 
 - Luis Chamberlain has changed tmpfs so it no longer requires swap
   backing.  Use `mount -o noswap'.
 
 - Qi Zheng has made the slab shrinkers operate locklessly, providing
   some scalability benefits.
 
 - Keith Busch has improved dmapool's performance, making part of its
   operations O(1) rather than O(n).
 
 - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
   permitting userspace to wr-protect anon memory unpopulated ptes.
 
 - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather
   than exclusive, and has fixed a bunch of errors which were caused by its
   unintuitive meaning.
 
 - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
   which causes minor faults to install a write-protected pte.
 
 - Vlastimil Babka has done some maintenance work on vma_merge():
   cleanups to the kernel code and improvements to our userspace test
   harness.
 
 - Cleanups to do_fault_around() by Lorenzo Stoakes.
 
 - Mike Rapoport has moved a lot of initialization code out of various
   mm/ files and into mm/mm_init.c.
 
 - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
   DRM, but DRM doesn't use it any more.
 
 - Lorenzo has also coverted read_kcore() and vread() to use iterators
   and has thereby removed the use of bounce buffers in some cases.
 
 - Lorenzo has also contributed further cleanups of vma_merge().
 
 - Chaitanya Prakash provides some fixes to the mmap selftesting code.
 
 - Matthew Wilcox changes xfs and afs so they no longer take sleeping
   locks in ->map_page(), a step towards RCUification of pagefaults.
 
 - Suren Baghdasaryan has improved mmap_lock scalability by switching to
   per-VMA locking.
 
 - Frederic Weisbecker has reworked the percpu cache draining so that it
   no longer causes latency glitches on cpu isolated workloads.
 
 - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
   logic.
 
 - Liu Shixin has changed zswap's initialization so we no longer waste a
   chunk of memory if zswap is not being used.
 
 - Yosry Ahmed has improved the performance of memcg statistics flushing.
 
 - David Stevens has fixed several issues involving khugepaged,
   userfaultfd and shmem.
 
 - Christoph Hellwig has provided some cleanup work to zram's IO-related
   code paths.
 
 - David Hildenbrand has fixed up some issues in the selftest code's
   testing of our pte state changing.
 
 - Pankaj Raghav has made page_endio() unneeded and has removed it.
 
 - Peter Xu contributed some rationalizations of the userfaultfd
   selftests.
 
 - Yosry Ahmed has fixed an issue around memcg's page recalim accounting.
 
 - Chaitanya Prakash has fixed some arm-related issues in the
   selftests/mm code.
 
 - Longlong Xia has improved the way in which KSM handles hwpoisoned
   pages.
 
 - Peter Xu fixes a few issues with uffd-wp at fork() time.
 
 - Stefan Roesch has changed KSM so that it may now be used on a
   per-process and per-cgroup basis.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZEr3zQAKCRDdBJ7gKXxA
 jlLoAP0fpQBipwFxED0Us4SKQfupV6z4caXNJGPeay7Aj11/kQD/aMRC2uPfgr96
 eMG3kwn2pqkB9ST2QpkaRbxA//eMbQY=
 =J+Dj
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
   switching from a user process to a kernel thread.

 - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj
   Raghav.

 - zsmalloc performance improvements from Sergey Senozhatsky.

 - Yue Zhao has found and fixed some data race issues around the
   alteration of memcg userspace tunables.

 - VFS rationalizations from Christoph Hellwig:
     - removal of most of the callers of write_one_page()
     - make __filemap_get_folio()'s return value more useful

 - Luis Chamberlain has changed tmpfs so it no longer requires swap
   backing. Use `mount -o noswap'.

 - Qi Zheng has made the slab shrinkers operate locklessly, providing
   some scalability benefits.

 - Keith Busch has improved dmapool's performance, making part of its
   operations O(1) rather than O(n).

 - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
   permitting userspace to wr-protect anon memory unpopulated ptes.

 - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive
   rather than exclusive, and has fixed a bunch of errors which were
   caused by its unintuitive meaning.

 - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
   which causes minor faults to install a write-protected pte.

 - Vlastimil Babka has done some maintenance work on vma_merge():
   cleanups to the kernel code and improvements to our userspace test
   harness.

 - Cleanups to do_fault_around() by Lorenzo Stoakes.

 - Mike Rapoport has moved a lot of initialization code out of various
   mm/ files and into mm/mm_init.c.

 - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
   DRM, but DRM doesn't use it any more.

 - Lorenzo has also coverted read_kcore() and vread() to use iterators
   and has thereby removed the use of bounce buffers in some cases.

 - Lorenzo has also contributed further cleanups of vma_merge().

 - Chaitanya Prakash provides some fixes to the mmap selftesting code.

 - Matthew Wilcox changes xfs and afs so they no longer take sleeping
   locks in ->map_page(), a step towards RCUification of pagefaults.

 - Suren Baghdasaryan has improved mmap_lock scalability by switching to
   per-VMA locking.

 - Frederic Weisbecker has reworked the percpu cache draining so that it
   no longer causes latency glitches on cpu isolated workloads.

 - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
   logic.

 - Liu Shixin has changed zswap's initialization so we no longer waste a
   chunk of memory if zswap is not being used.

 - Yosry Ahmed has improved the performance of memcg statistics
   flushing.

 - David Stevens has fixed several issues involving khugepaged,
   userfaultfd and shmem.

 - Christoph Hellwig has provided some cleanup work to zram's IO-related
   code paths.

 - David Hildenbrand has fixed up some issues in the selftest code's
   testing of our pte state changing.

 - Pankaj Raghav has made page_endio() unneeded and has removed it.

 - Peter Xu contributed some rationalizations of the userfaultfd
   selftests.

 - Yosry Ahmed has fixed an issue around memcg's page recalim
   accounting.

 - Chaitanya Prakash has fixed some arm-related issues in the
   selftests/mm code.

 - Longlong Xia has improved the way in which KSM handles hwpoisoned
   pages.

 - Peter Xu fixes a few issues with uffd-wp at fork() time.

 - Stefan Roesch has changed KSM so that it may now be used on a
   per-process and per-cgroup basis.

* tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
  mm,unmap: avoid flushing TLB in batch if PTE is inaccessible
  shmem: restrict noswap option to initial user namespace
  mm/khugepaged: fix conflicting mods to collapse_file()
  sparse: remove unnecessary 0 values from rc
  mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area()
  hugetlb: pte_alloc_huge() to replace huge pte_alloc_map()
  maple_tree: fix allocation in mas_sparse_area()
  mm: do not increment pgfault stats when page fault handler retries
  zsmalloc: allow only one active pool compaction context
  selftests/mm: add new selftests for KSM
  mm: add new KSM process and sysfs knobs
  mm: add new api to enable ksm per process
  mm: shrinkers: fix debugfs file permissions
  mm: don't check VMA write permissions if the PTE/PMD indicates write permissions
  migrate_pages_batch: fix statistics for longterm pin retry
  userfaultfd: use helper function range_in_vma()
  lib/show_mem.c: use for_each_populated_zone() simplify code
  mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()
  fs/buffer: convert create_page_buffers to folio_create_buffers
  fs/buffer: add folio_create_empty_buffers helper
  ...
2023-04-27 19:42:02 -07:00
Linus Torvalds
556eb8b791 Driver core changes for 6.4-rc1
Here is the large set of driver core changes for 6.4-rc1.
 
 Once again, a busy development cycle, with lots of changes happening in
 the driver core in the quest to be able to move "struct bus" and "struct
 class" into read-only memory, a task now complete with these changes.
 
 This will make the future rust interactions with the driver core more
 "provably correct" as well as providing more obvious lifetime rules for
 all busses and classes in the kernel.
 
 The changes required for this did touch many individual classes and
 busses as many callbacks were changed to take const * parameters
 instead.  All of these changes have been submitted to the various
 subsystem maintainers, giving them plenty of time to review, and most of
 them actually did so.
 
 Other than those changes, included in here are a small set of other
 things:
   - kobject logging improvements
   - cacheinfo improvements and updates
   - obligatory fw_devlink updates and fixes
   - documentation updates
   - device property cleanups and const * changes
   - firwmare loader dependency fixes.
 
 All of these have been in linux-next for a while with no reported
 problems.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZEp7Sw8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ykitQCfamUHpxGcKOAGuLXMotXNakTEsxgAoIquENm5
 LEGadNS38k5fs+73UaxV
 =7K4B
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the large set of driver core changes for 6.4-rc1.

  Once again, a busy development cycle, with lots of changes happening
  in the driver core in the quest to be able to move "struct bus" and
  "struct class" into read-only memory, a task now complete with these
  changes.

  This will make the future rust interactions with the driver core more
  "provably correct" as well as providing more obvious lifetime rules
  for all busses and classes in the kernel.

  The changes required for this did touch many individual classes and
  busses as many callbacks were changed to take const * parameters
  instead. All of these changes have been submitted to the various
  subsystem maintainers, giving them plenty of time to review, and most
  of them actually did so.

  Other than those changes, included in here are a small set of other
  things:

   - kobject logging improvements

   - cacheinfo improvements and updates

   - obligatory fw_devlink updates and fixes

   - documentation updates

   - device property cleanups and const * changes

   - firwmare loader dependency fixes.

  All of these have been in linux-next for a while with no reported
  problems"

* tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (120 commits)
  device property: make device_property functions take const device *
  driver core: update comments in device_rename()
  driver core: Don't require dynamic_debug for initcall_debug probe timing
  firmware_loader: rework crypto dependencies
  firmware_loader: Strip off \n from customized path
  zram: fix up permission for the hot_add sysfs file
  cacheinfo: Add use_arch[|_cache]_info field/function
  arch_topology: Remove early cacheinfo error message if -ENOENT
  cacheinfo: Check cache properties are present in DT
  cacheinfo: Check sib_leaf in cache_leaves_are_shared()
  cacheinfo: Allow early level detection when DT/ACPI info is missing/broken
  cacheinfo: Add arm64 early level initializer implementation
  cacheinfo: Add arch specific early level initializer
  tty: make tty_class a static const structure
  driver core: class: remove struct class_interface * from callbacks
  driver core: class: mark the struct class in struct class_interface constant
  driver core: class: make class_register() take a const *
  driver core: class: mark class_release() as taking a const *
  driver core: remove incorrect comment for device_create*
  MIPS: vpe-cmp: remove module owner pointer from struct class usage.
  ...
2023-04-27 11:53:57 -07:00
Tom Rix
f6365881bf mm: backing-dev: set variables dev_attr_min,max_bytes storage-class-specifier to static
smatch reports
mm/backing-dev.c:266:1: warning: symbol
  'dev_attr_min_bytes' was not declared. Should it be static?
mm/backing-dev.c:294:1: warning: symbol
  'dev_attr_max_bytes' was not declared. Should it be static?

These variables are only used in one file so should be static.

Link: https://lkml.kernel.org/r/20230408141609.2703647-1-trix@redhat.com
Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18 16:29:56 -07:00
Baokun Li
1ba1199ec5 writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbs
KASAN report null-ptr-deref:
==================================================================
BUG: KASAN: null-ptr-deref in bdi_split_work_to_wbs+0x5c5/0x7b0
Write of size 8 at addr 0000000000000000 by task sync/943
CPU: 5 PID: 943 Comm: sync Tainted: 6.3.0-rc5-next-20230406-dirty #461
Call Trace:
 <TASK>
 dump_stack_lvl+0x7f/0xc0
 print_report+0x2ba/0x340
 kasan_report+0xc4/0x120
 kasan_check_range+0x1b7/0x2e0
 __kasan_check_write+0x24/0x40
 bdi_split_work_to_wbs+0x5c5/0x7b0
 sync_inodes_sb+0x195/0x630
 sync_inodes_one_sb+0x3a/0x50
 iterate_supers+0x106/0x1b0
 ksys_sync+0x98/0x160
[...]
==================================================================

The race that causes the above issue is as follows:

           cpu1                     cpu2
-------------------------|-------------------------
inode_switch_wbs
 INIT_WORK(&isw->work, inode_switch_wbs_work_fn)
 queue_rcu_work(isw_wq, &isw->work)
 // queue_work async
  inode_switch_wbs_work_fn
   wb_put_many(old_wb, nr_switched)
    percpu_ref_put_many
     ref->data->release(ref)
     cgwb_release
      queue_work(cgwb_release_wq, &wb->release_work)
      // queue_work async
       &wb->release_work
       cgwb_release_workfn
                            ksys_sync
                             iterate_supers
                              sync_inodes_one_sb
                               sync_inodes_sb
                                bdi_split_work_to_wbs
                                 kmalloc(sizeof(*work), GFP_ATOMIC)
                                 // alloc memory failed
        percpu_ref_exit
         ref->data = NULL
         kfree(data)
                                 wb_get(wb)
                                  percpu_ref_get(&wb->refcnt)
                                   percpu_ref_get_many(ref, 1)
                                    atomic_long_add(nr, &ref->data->count)
                                     atomic64_add(i, v)
                                     // trigger null-ptr-deref

bdi_split_work_to_wbs() traverses &bdi->wb_list to split work into all
wbs.  If the allocation of new work fails, the on-stack fallback will be
used and the reference count of the current wb is increased afterwards. 
If cgroup writeback membership switches occur before getting the reference
count and the current wb is released as old_wd, then calling wb_get() or
wb_put() will trigger the null pointer dereference above.

This issue was introduced in v4.3-rc7 (see fix tag1).  Both
sync_inodes_sb() and __writeback_inodes_sb_nr() calls to
bdi_split_work_to_wbs() can trigger this issue.  For scenarios called via
sync_inodes_sb(), originally commit 7fc5854f8c ("writeback: synchronize
sync(2) against cgroup writeback membership switches") reduced the
possibility of the issue by adding wb_switch_rwsem, but in v5.14-rc1 (see
fix tag2) removed the "inode_io_list_del_locked(inode, old_wb)" from
inode_switch_wbs_work_fn() so that wb->state contains WB_has_dirty_io,
thus old_wb is not skipped when traversing wbs in bdi_split_work_to_wbs(),
and the issue becomes easily reproducible again.

To solve this problem, percpu_ref_exit() is called under RCU protection to
avoid race between cgwb_release_workfn() and bdi_split_work_to_wbs(). 
Moreover, replace wb_get() with wb_tryget() in bdi_split_work_to_wbs(),
and skip the current wb if wb_tryget() fails because the wb has already
been shutdown.

Link: https://lkml.kernel.org/r/20230410130826.1492525-1-libaokun1@huawei.com
Fixes: b817525a4a ("writeback: bdi_writeback iteration must not skip dying ones")
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Hou Tao <houtao1@huawei.com>
Cc: yangerkun <yangerkun@huawei.com>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-16 10:41:26 -07:00
Greg Kroah-Hartman
1aaba11da9 driver core: class: remove module * from class_create()
The module pointer in class_create() never actually did anything, and it
shouldn't have been requred to be set as a parameter even if it did
something.  So just remove it and fix up all callers of the function in
the kernel tree at the same time.

Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Acked-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Link: https://lore.kernel.org/r/20230313181843.1207845-4-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-17 15:16:33 +01:00
Stefan Roesch
ad3e6dabf6 mm: add /sys/class/bdi/<bdi>/min_ratio_fine knob
This adds the min_ratio_fine knob. The knob specifies the values not
based on 1 of 100, but instead 1 per million.

Link: https://lkml.kernel.org/r/20221119005215.3052436-20-shr@devkernel.io
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Cc: Chris Mason <clm@meta.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30 15:59:06 -08:00
Stefan Roesch
bca52dcbad mm: add /sys/class/bdi/<bdi>/max_ratio_fine knob
This adds the max_ratio_fine knob. The knob specifies the values not
based on 1 of 100, but instead 1 per million.

Link: https://lkml.kernel.org/r/20221119005215.3052436-17-shr@devkernel.io
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Cc: Chris Mason <clm@meta.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30 15:59:06 -08:00
Stefan Roesch
9c84819bd6 mm: add /sys/class/bdi/<bdi>/min_bytes knob
bdi has two existing knobs to limit the amount of dirty memory:
min_ratio and max_ratio. However the granularity of the knobs is limited
and often it is more convenient to specify limits in terms of bytes.
This change adds the min_bytes knob.

It does not store the min_bytes value, instead it converts the max_bytes
value to a ratio. The value is therefore more an approximation than an
absolute value.

It also maintains the sum over all the bdi min_ratio values stored in
the variable bdi_min_ratio.

Link: https://lkml.kernel.org/r/20221119005215.3052436-14-shr@devkernel.io
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Cc: Chris Mason <clm@meta.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30 15:59:05 -08:00
Stefan Roesch
c56e049a5e mm: add knob /sys/class/bdi/<bdi>/max_bytes
This adds the new knob max_bytes to specify a dirty memory limit for the
corresponding bdi. The specified bytes value is converted to a ratio.

Link: https://lkml.kernel.org/r/20221119005215.3052436-9-shr@devkernel.io
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Cc: Chris Mason <clm@meta.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30 15:59:04 -08:00
Stefan Roesch
ae82291e9c mm: use part per 1000000 for bdi ratios
To get finer granularity for ratio calculations use part per million
instead of percentiles. This is especially important if we want to
automatically convert byte values to ratios. Otherwise the values that
are actually used can be quite different. This is also important for
machines with more main memory (1% of 256GB is already 2.5GB).

Link: https://lkml.kernel.org/r/20221119005215.3052436-5-shr@devkernel.io
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Cc: Chris Mason <clm@meta.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30 15:59:03 -08:00
Stefan Roesch
27bbe9d48d mm: add knob /sys/class/bdi/<bdi>/strict_limit
Add a new knob to /sys/class/bdi/<bdi>/strict_limit. This new knob
allows to set/unset the flag BDI_CAP_STRICTLIMIT in the bdi
capabilities.

Link: https://lkml.kernel.org/r/20221119005215.3052436-3-shr@devkernel.io
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Cc: Chris Mason <clm@meta.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30 15:59:03 -08:00
ye xingchen
3083da7bcf mm: backing-dev: Remove the unneeded result variable
Return the value cgwb_bdi_init() directly instead of storing it in another
redundant variable.

Link: https://lkml.kernel.org/r/20220826071906.252419-1-ye.xingchen@zte.com.cn
Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
Reported-by: Zeal Robot <zealci@zte.com.cn>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11 20:26:02 -07:00
Khazhismel Kumykov
f87904c075 writeback: avoid use-after-free after removing device
When a disk is removed, bdi_unregister gets called to stop further
writeback and wait for associated delayed work to complete.  However,
wb_inode_writeback_end() may schedule bandwidth estimation dwork after
this has completed, which can result in the timer attempting to access the
just freed bdi_writeback.

Fix this by checking if the bdi_writeback is alive, similar to when
scheduling writeback work.

Since this requires wb->work_lock, and wb_inode_writeback_end() may get
called from interrupt, switch wb->work_lock to an irqsafe lock.

Link: https://lkml.kernel.org/r/20220801155034.3772543-1-khazhy@google.com
Fixes: 45a2966fd6 ("writeback: fix bandwidth estimate for spiky workload")
Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Michael Stapelberg <stapelberg+linux@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-28 14:02:43 -07:00
Jan Kara
4bca7e80b6 init: Initialize noop_backing_dev_info early
noop_backing_dev_info is used by superblocks of various
pseudofilesystems such as kdevtmpfs. After commit 10e1407310
("writeback: Fix inode->i_io_list not be protected by inode->i_lock
error") this broke because __mark_inode_dirty() started to access more
fields from noop_backing_dev_info and this led to crashes inside
locked_inode_to_wb_and_lock_list() called from __mark_inode_dirty().
Fix the problem by initializing noop_backing_dev_info before the
filesystems get mounted.

Fixes: 10e1407310 ("writeback: Fix inode->i_io_list not be protected by inode->i_lock error")
Reported-and-tested-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Reported-and-tested-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reported-and-tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2022-06-16 10:55:57 +02:00
Christoph Hellwig
c97ab27157 blk-cgroup: remove unneeded includes from <linux/blk-cgroup.h>
Remove all the includes that aren't actually needed from
<linux/blk-cgroup.h> and push them to the actual source files where
needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220420042723.1010598-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-02 14:06:20 -06:00
Christoph Hellwig
dec223c92a blk-cgroup: move struct blkcg to block/blk-cgroup.h
There is no real need to expose the blkcg structure to the whole kernel.
Move it to the private header an expose a helper to let the writeback
code access the cgwb_list member.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220420042723.1010598-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-02 14:06:20 -06:00
Christoph Hellwig
397c9f46ee blk-cgroup: move blkcg_{pin,unpin}_online out of line
Move these two functions out of line as there is no good reason
to inline them.  Also switch to passing a cgroup_subsys_state
instead of doing the conversion in the caller to prepare for making
the blkcg structure private to blk-cgroup.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220420042723.1010598-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-02 14:06:20 -06:00
NeilBrown
a88f2096d5 remove congestion tracking framework
This framework is no longer used - so discard it.

Link: https://lkml.kernel.org/r/164549983747.9187.6171768583526866601.stgit@noble.brown
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Paolo Valente <paolo.valente@linaro.org>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22 15:57:01 -07:00
Manjong Lee
3c376dfafb mm: bdi: initialize bdi_min_ratio when bdi is unregistered
Initialize min_ratio if it is set during bdi unregistration.  This can
prevent problems that may occur a when bdi is removed without resetting
min_ratio.

For example.
1) insert external sdcard
2) set external sdcard's min_ratio 70
3) remove external sdcard without setting min_ratio 0
4) insert external sdcard
5) set external sdcard's min_ratio 70 << error occur(can't set)

Because when an sdcard is removed, the present bdi_min_ratio value will
remain.  Currently, the only way to reset bdi_min_ratio is to reboot.

[akpm@linux-foundation.org: tweak comment and coding style]

Link: https://lkml.kernel.org/r/20211021161942.5983-1-mj0123.lee@samsung.com
Signed-off-by: Manjong Lee <mj0123.lee@samsung.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Changheun Lee <nanich.lee@samsung.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <seunghwan.hyun@samsung.com>
Cc: <sookwan7.kim@samsung.com>
Cc: <yt0928.kim@samsung.com>
Cc: <junho89.kim@samsung.com>
Cc: <jisoo2146.oh@samsung.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-12-10 17:10:56 -08:00
Linus Torvalds
512b7931ad Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
 "257 patches.

  Subsystems affected by this patch series: scripts, ocfs2, vfs, and
  mm (slab-generic, slab, slub, kconfig, dax, kasan, debug, pagecache,
  gup, swap, memcg, pagemap, mprotect, mremap, iomap, tracing, vmalloc,
  pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, tools,
  memblock, oom-kill, hugetlbfs, migration, thp, readahead, nommu, ksm,
  vmstat, madvise, memory-hotplug, rmap, zsmalloc, highmem, zram,
  cleanups, kfence, and damon)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (257 commits)
  mm/damon: remove return value from before_terminate callback
  mm/damon: fix a few spelling mistakes in comments and a pr_debug message
  mm/damon: simplify stop mechanism
  Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions
  Docs/admin-guide/mm/damon/start: simplify the content
  Docs/admin-guide/mm/damon/start: fix a wrong link
  Docs/admin-guide/mm/damon/start: fix wrong example commands
  mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on
  mm/damon: remove unnecessary variable initialization
  Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM
  mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM)
  selftests/damon: support watermarks
  mm/damon/dbgfs: support watermarks
  mm/damon/schemes: activate schemes based on a watermarks mechanism
  tools/selftests/damon: update for regions prioritization of schemes
  mm/damon/dbgfs: support prioritization weights
  mm/damon/vaddr,paddr: support pageout prioritization
  mm/damon/schemes: prioritize regions within the quotas
  mm/damon/selftests: support schemes quotas
  mm/damon/dbgfs: support quotas of schemes
  ...
2021-11-06 14:08:17 -07:00
Mel Gorman
8cd7c588de mm/vmscan: throttle reclaim until some writeback completes if congested
Patch series "Remove dependency on congestion_wait in mm/", v5.

This series that removes all calls to congestion_wait in mm/ and deletes
wait_iff_congested.  It's not a clever implementation but
congestion_wait has been broken for a long time [1].

Even if congestion throttling worked, it was never a great idea.  While
excessive dirty/writeback pages at the tail of the LRU is one
possibility that reclaim may be slow, there is also the problem of too
many pages being isolated and reclaim failing for other reasons
(elevated references, too many pages isolated, excessive LRU contention
etc).

This series replaces the "congestion" throttling with 3 different types.

 - If there are too many dirty/writeback pages, sleep until a timeout or
   enough pages get cleaned

 - If too many pages are isolated, sleep until enough isolated pages are
   either reclaimed or put back on the LRU

 - If no progress is being made, direct reclaim tasks sleep until
   another task makes progress with acceptable efficiency.

This was initially tested with a mix of workloads that used to trigger
corner cases that no longer work.  A new test case was created called
"stutterp" (pagereclaim-stutterp-noreaders in mmtests) using a freshly
created XFS filesystem.  Note that it may be necessary to increase the
timeout of ssh if executing remotely as ssh itself can get throttled and
the connection may timeout.

stutterp varies the number of "worker" processes from 4 up to NR_CPUS*4
to check the impact as the number of direct reclaimers increase.  It has
four types of worker.

 - One "anon latency" worker creates small mappings with mmap() and
   times how long it takes to fault the mapping reading it 4K at a time

 - X file writers which is fio randomly writing X files where the total
   size of the files add up to the allowed dirty_ratio. fio is allowed
   to run for a warmup period to allow some file-backed pages to
   accumulate. The duration of the warmup is based on the best-case
   linear write speed of the storage.

 - Y file readers which is fio randomly reading small files

 - Z anon memory hogs which continually map (100-dirty_ratio)% of memory

 - Total estimated WSS = (100+dirty_ration) percentage of memory

X+Y+Z+1 == NR_WORKERS varying from 4 up to NR_CPUS*4

The intent is to maximise the total WSS with a mix of file and anon
memory where some anonymous memory must be swapped and there is a high
likelihood of dirty/writeback pages reaching the end of the LRU.

The test can be configured to have no background readers to stress
dirty/writeback pages.  The results below are based on having zero
readers.

The short summary of the results is that the series works and stalls
until some event occurs but the timeouts may need adjustment.

The test results are not broken down by patch as the series should be
treated as one block that replaces a broken throttling mechanism with a
working one.

Finally, three machines were tested but I'm reporting the worst set of
results.  The other two machines had much better latencies for example.

First the results of the "anon latency" latency

  stutterp
                                5.15.0-rc1             5.15.0-rc1
                                   vanilla mm-reclaimcongest-v5r4
  Amean     mmap-4      31.4003 (   0.00%)   2661.0198 (-8374.52%)
  Amean     mmap-7      38.1641 (   0.00%)    149.2891 (-291.18%)
  Amean     mmap-12     60.0981 (   0.00%)    187.8105 (-212.51%)
  Amean     mmap-21    161.2699 (   0.00%)    213.9107 ( -32.64%)
  Amean     mmap-30    174.5589 (   0.00%)    377.7548 (-116.41%)
  Amean     mmap-48   8106.8160 (   0.00%)   1070.5616 (  86.79%)
  Stddev    mmap-4      41.3455 (   0.00%)  27573.9676 (-66591.66%)
  Stddev    mmap-7      53.5556 (   0.00%)   4608.5860 (-8505.23%)
  Stddev    mmap-12    171.3897 (   0.00%)   5559.4542 (-3143.75%)
  Stddev    mmap-21   1506.6752 (   0.00%)   5746.2507 (-281.39%)
  Stddev    mmap-30    557.5806 (   0.00%)   7678.1624 (-1277.05%)
  Stddev    mmap-48  61681.5718 (   0.00%)  14507.2830 (  76.48%)
  Max-90    mmap-4      31.4243 (   0.00%)     83.1457 (-164.59%)
  Max-90    mmap-7      41.0410 (   0.00%)     41.0720 (  -0.08%)
  Max-90    mmap-12     66.5255 (   0.00%)     53.9073 (  18.97%)
  Max-90    mmap-21    146.7479 (   0.00%)    105.9540 (  27.80%)
  Max-90    mmap-30    193.9513 (   0.00%)     64.3067 (  66.84%)
  Max-90    mmap-48    277.9137 (   0.00%)    591.0594 (-112.68%)
  Max       mmap-4    1913.8009 (   0.00%) 299623.9695 (-15555.96%)
  Max       mmap-7    2423.9665 (   0.00%) 204453.1708 (-8334.65%)
  Max       mmap-12   6845.6573 (   0.00%) 221090.3366 (-3129.64%)
  Max       mmap-21  56278.6508 (   0.00%) 213877.3496 (-280.03%)
  Max       mmap-30  19716.2990 (   0.00%) 216287.6229 (-997.00%)
  Max       mmap-48 477923.9400 (   0.00%) 245414.8238 (  48.65%)

For most thread counts, the time to mmap() is unfortunately increased.
In earlier versions of the series, this was lower but a large number of
throttling events were reaching their timeout increasing the amount of
inefficient scanning of the LRU.  There is no prioritisation of reclaim
tasks making progress based on each tasks rate of page allocation versus
progress of reclaim.  The variance is also impacted for high worker
counts but in all cases, the differences in latency are not
statistically significant due to very large maximum outliers.  Max-90
shows that 90% of the stalls are comparable but the Max results show the
massive outliers which are increased to to stalling.

It is expected that this will be very machine dependant.  Due to the
test design, reclaim is difficult so allocations stall and there are
variances depending on whether THPs can be allocated or not.  The amount
of memory will affect exactly how bad the corner cases are and how often
they trigger.  The warmup period calculation is not ideal as it's based
on linear writes where as fio is randomly writing multiple files from
multiple tasks so the start state of the test is variable.  For example,
these are the latencies on a single-socket machine that had more memory

  Amean     mmap-4      42.2287 (   0.00%)     49.6838 * -17.65%*
  Amean     mmap-7     216.4326 (   0.00%)     47.4451 *  78.08%*
  Amean     mmap-12   2412.0588 (   0.00%)     51.7497 (  97.85%)
  Amean     mmap-21   5546.2548 (   0.00%)     51.8862 (  99.06%)
  Amean     mmap-30   1085.3121 (   0.00%)     72.1004 (  93.36%)

The overall system CPU usage and elapsed time is as follows

                    5.15.0-rc3  5.15.0-rc3
                       vanilla mm-reclaimcongest-v5r4
  Duration User        6989.03      983.42
  Duration System      7308.12      799.68
  Duration Elapsed     2277.67     2092.98

The patches reduce system CPU usage by 89% as the vanilla kernel is rarely
stalling.

The high-level /proc/vmstats show

                                       5.15.0-rc1     5.15.0-rc1
                                          vanilla mm-reclaimcongest-v5r2
  Ops Direct pages scanned          1056608451.00   503594991.00
  Ops Kswapd pages scanned           109795048.00   147289810.00
  Ops Kswapd pages reclaimed          63269243.00    31036005.00
  Ops Direct pages reclaimed          10803973.00     6328887.00
  Ops Kswapd efficiency %                   57.62          21.07
  Ops Kswapd velocity                    48204.98       57572.86
  Ops Direct efficiency %                    1.02           1.26
  Ops Direct velocity                   463898.83      196845.97

Kswapd scanned less pages but the detailed pattern is different.  The
vanilla kernel scans slowly over time where as the patches exhibits
burst patterns of scan activity.  Direct reclaim scanning is reduced by
52% due to stalling.

The pattern for stealing pages is also slightly different.  Both kernels
exhibit spikes but the vanilla kernel when reclaiming shows pages being
reclaimed over a period of time where as the patches tend to reclaim in
spikes.  The difference is that vanilla is not throttling and instead
scanning constantly finding some pages over time where as the patched
kernel throttles and reclaims in spikes.

  Ops Percentage direct scans               90.59          77.37

For direct reclaim, vanilla scanned 90.59% of pages where as with the
patches, 77.37% were direct reclaim due to throttling

  Ops Page writes by reclaim           2613590.00     1687131.00

Page writes from reclaim context are reduced.

  Ops Page writes anon                 2932752.00     1917048.00

And there is less swapping.

  Ops Page reclaim immediate         996248528.00   107664764.00

The number of pages encountered at the tail of the LRU tagged for
immediate reclaim but still dirty/writeback is reduced by 89%.

  Ops Slabs scanned                     164284.00      153608.00

Slab scan activity is similar.

ftrace was used to gather stall activity

  Vanilla
  -------
      1 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=16000
      2 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=12000
      8 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=8000
     29 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=4000
  82394 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=0

The fast majority of wait_iff_congested calls do not stall at all.  What
is likely happening is that cond_resched() reschedules the task for a
short period when the BDI is not registering congestion (which it never
will in this test setup).

      1 writeback_congestion_wait: usec_timeout=100000 usec_delayed=120000
      2 writeback_congestion_wait: usec_timeout=100000 usec_delayed=132000
      4 writeback_congestion_wait: usec_timeout=100000 usec_delayed=112000
    380 writeback_congestion_wait: usec_timeout=100000 usec_delayed=108000
    778 writeback_congestion_wait: usec_timeout=100000 usec_delayed=104000

congestion_wait if called always exceeds the timeout as there is no
trigger to wake it up.

Bottom line: Vanilla will throttle but it's not effective.

Patch series
------------

Kswapd throttle activity was always due to scanning pages tagged for
immediate reclaim at the tail of the LRU

      1 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
      4 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
      5 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
      6 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
     11 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK
     11 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
     94 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
    112 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK

The majority of events did not stall or stalled for a short period.
Roughly 16% of stalls reached the timeout before expiry.  For direct
reclaim, the number of times stalled for each reason were

   6624 reason=VMSCAN_THROTTLE_ISOLATED
  93246 reason=VMSCAN_THROTTLE_NOPROGRESS
  96934 reason=VMSCAN_THROTTLE_WRITEBACK

The most common reason to stall was due to excessive pages tagged for
immediate reclaim at the tail of the LRU followed by a failure to make
forward.  A relatively small number were due to too many pages isolated
from the LRU by parallel threads

For VMSCAN_THROTTLE_ISOLATED, the breakdown of delays was

      9 usec_timeout=20000 usect_delayed=4000 reason=VMSCAN_THROTTLE_ISOLATED
     12 usec_timeout=20000 usect_delayed=16000 reason=VMSCAN_THROTTLE_ISOLATED
     83 usec_timeout=20000 usect_delayed=20000 reason=VMSCAN_THROTTLE_ISOLATED
   6520 usec_timeout=20000 usect_delayed=0 reason=VMSCAN_THROTTLE_ISOLATED

Most did not stall at all.  A small number reached the timeout.

For VMSCAN_THROTTLE_NOPROGRESS, the breakdown of stalls were all over
the map

      1 usec_timeout=500000 usect_delayed=324000 reason=VMSCAN_THROTTLE_NOPROGRESS
      1 usec_timeout=500000 usect_delayed=332000 reason=VMSCAN_THROTTLE_NOPROGRESS
      1 usec_timeout=500000 usect_delayed=348000 reason=VMSCAN_THROTTLE_NOPROGRESS
      1 usec_timeout=500000 usect_delayed=360000 reason=VMSCAN_THROTTLE_NOPROGRESS
      2 usec_timeout=500000 usect_delayed=228000 reason=VMSCAN_THROTTLE_NOPROGRESS
      2 usec_timeout=500000 usect_delayed=260000 reason=VMSCAN_THROTTLE_NOPROGRESS
      2 usec_timeout=500000 usect_delayed=340000 reason=VMSCAN_THROTTLE_NOPROGRESS
      2 usec_timeout=500000 usect_delayed=364000 reason=VMSCAN_THROTTLE_NOPROGRESS
      2 usec_timeout=500000 usect_delayed=372000 reason=VMSCAN_THROTTLE_NOPROGRESS
      2 usec_timeout=500000 usect_delayed=428000 reason=VMSCAN_THROTTLE_NOPROGRESS
      2 usec_timeout=500000 usect_delayed=460000 reason=VMSCAN_THROTTLE_NOPROGRESS
      2 usec_timeout=500000 usect_delayed=464000 reason=VMSCAN_THROTTLE_NOPROGRESS
      3 usec_timeout=500000 usect_delayed=244000 reason=VMSCAN_THROTTLE_NOPROGRESS
      3 usec_timeout=500000 usect_delayed=252000 reason=VMSCAN_THROTTLE_NOPROGRESS
      3 usec_timeout=500000 usect_delayed=272000 reason=VMSCAN_THROTTLE_NOPROGRESS
      4 usec_timeout=500000 usect_delayed=188000 reason=VMSCAN_THROTTLE_NOPROGRESS
      4 usec_timeout=500000 usect_delayed=268000 reason=VMSCAN_THROTTLE_NOPROGRESS
      4 usec_timeout=500000 usect_delayed=328000 reason=VMSCAN_THROTTLE_NOPROGRESS
      4 usec_timeout=500000 usect_delayed=380000 reason=VMSCAN_THROTTLE_NOPROGRESS
      4 usec_timeout=500000 usect_delayed=392000 reason=VMSCAN_THROTTLE_NOPROGRESS
      4 usec_timeout=500000 usect_delayed=432000 reason=VMSCAN_THROTTLE_NOPROGRESS
      5 usec_timeout=500000 usect_delayed=204000 reason=VMSCAN_THROTTLE_NOPROGRESS
      5 usec_timeout=500000 usect_delayed=220000 reason=VMSCAN_THROTTLE_NOPROGRESS
      5 usec_timeout=500000 usect_delayed=412000 reason=VMSCAN_THROTTLE_NOPROGRESS
      5 usec_timeout=500000 usect_delayed=436000 reason=VMSCAN_THROTTLE_NOPROGRESS
      6 usec_timeout=500000 usect_delayed=488000 reason=VMSCAN_THROTTLE_NOPROGRESS
      7 usec_timeout=500000 usect_delayed=212000 reason=VMSCAN_THROTTLE_NOPROGRESS
      7 usec_timeout=500000 usect_delayed=300000 reason=VMSCAN_THROTTLE_NOPROGRESS
      7 usec_timeout=500000 usect_delayed=316000 reason=VMSCAN_THROTTLE_NOPROGRESS
      7 usec_timeout=500000 usect_delayed=472000 reason=VMSCAN_THROTTLE_NOPROGRESS
      8 usec_timeout=500000 usect_delayed=248000 reason=VMSCAN_THROTTLE_NOPROGRESS
      8 usec_timeout=500000 usect_delayed=356000 reason=VMSCAN_THROTTLE_NOPROGRESS
      8 usec_timeout=500000 usect_delayed=456000 reason=VMSCAN_THROTTLE_NOPROGRESS
      9 usec_timeout=500000 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPROGRESS
      9 usec_timeout=500000 usect_delayed=376000 reason=VMSCAN_THROTTLE_NOPROGRESS
      9 usec_timeout=500000 usect_delayed=484000 reason=VMSCAN_THROTTLE_NOPROGRESS
     10 usec_timeout=500000 usect_delayed=172000 reason=VMSCAN_THROTTLE_NOPROGRESS
     10 usec_timeout=500000 usect_delayed=420000 reason=VMSCAN_THROTTLE_NOPROGRESS
     10 usec_timeout=500000 usect_delayed=452000 reason=VMSCAN_THROTTLE_NOPROGRESS
     11 usec_timeout=500000 usect_delayed=256000 reason=VMSCAN_THROTTLE_NOPROGRESS
     12 usec_timeout=500000 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPROGRESS
     12 usec_timeout=500000 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPROGRESS
     12 usec_timeout=500000 usect_delayed=144000 reason=VMSCAN_THROTTLE_NOPROGRESS
     12 usec_timeout=500000 usect_delayed=152000 reason=VMSCAN_THROTTLE_NOPROGRESS
     12 usec_timeout=500000 usect_delayed=264000 reason=VMSCAN_THROTTLE_NOPROGRESS
     12 usec_timeout=500000 usect_delayed=384000 reason=VMSCAN_THROTTLE_NOPROGRESS
     12 usec_timeout=500000 usect_delayed=424000 reason=VMSCAN_THROTTLE_NOPROGRESS
     12 usec_timeout=500000 usect_delayed=492000 reason=VMSCAN_THROTTLE_NOPROGRESS
     13 usec_timeout=500000 usect_delayed=184000 reason=VMSCAN_THROTTLE_NOPROGRESS
     13 usec_timeout=500000 usect_delayed=444000 reason=VMSCAN_THROTTLE_NOPROGRESS
     14 usec_timeout=500000 usect_delayed=308000 reason=VMSCAN_THROTTLE_NOPROGRESS
     14 usec_timeout=500000 usect_delayed=440000 reason=VMSCAN_THROTTLE_NOPROGRESS
     14 usec_timeout=500000 usect_delayed=476000 reason=VMSCAN_THROTTLE_NOPROGRESS
     16 usec_timeout=500000 usect_delayed=140000 reason=VMSCAN_THROTTLE_NOPROGRESS
     17 usec_timeout=500000 usect_delayed=232000 reason=VMSCAN_THROTTLE_NOPROGRESS
     17 usec_timeout=500000 usect_delayed=240000 reason=VMSCAN_THROTTLE_NOPROGRESS
     17 usec_timeout=500000 usect_delayed=280000 reason=VMSCAN_THROTTLE_NOPROGRESS
     18 usec_timeout=500000 usect_delayed=404000 reason=VMSCAN_THROTTLE_NOPROGRESS
     20 usec_timeout=500000 usect_delayed=148000 reason=VMSCAN_THROTTLE_NOPROGRESS
     20 usec_timeout=500000 usect_delayed=216000 reason=VMSCAN_THROTTLE_NOPROGRESS
     20 usec_timeout=500000 usect_delayed=468000 reason=VMSCAN_THROTTLE_NOPROGRESS
     21 usec_timeout=500000 usect_delayed=448000 reason=VMSCAN_THROTTLE_NOPROGRESS
     23 usec_timeout=500000 usect_delayed=168000 reason=VMSCAN_THROTTLE_NOPROGRESS
     23 usec_timeout=500000 usect_delayed=296000 reason=VMSCAN_THROTTLE_NOPROGRESS
     25 usec_timeout=500000 usect_delayed=132000 reason=VMSCAN_THROTTLE_NOPROGRESS
     25 usec_timeout=500000 usect_delayed=352000 reason=VMSCAN_THROTTLE_NOPROGRESS
     26 usec_timeout=500000 usect_delayed=180000 reason=VMSCAN_THROTTLE_NOPROGRESS
     27 usec_timeout=500000 usect_delayed=284000 reason=VMSCAN_THROTTLE_NOPROGRESS
     28 usec_timeout=500000 usect_delayed=164000 reason=VMSCAN_THROTTLE_NOPROGRESS
     29 usec_timeout=500000 usect_delayed=136000 reason=VMSCAN_THROTTLE_NOPROGRESS
     30 usec_timeout=500000 usect_delayed=200000 reason=VMSCAN_THROTTLE_NOPROGRESS
     30 usec_timeout=500000 usect_delayed=400000 reason=VMSCAN_THROTTLE_NOPROGRESS
     31 usec_timeout=500000 usect_delayed=196000 reason=VMSCAN_THROTTLE_NOPROGRESS
     32 usec_timeout=500000 usect_delayed=156000 reason=VMSCAN_THROTTLE_NOPROGRESS
     33 usec_timeout=500000 usect_delayed=224000 reason=VMSCAN_THROTTLE_NOPROGRESS
     35 usec_timeout=500000 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPROGRESS
     35 usec_timeout=500000 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPROGRESS
     36 usec_timeout=500000 usect_delayed=368000 reason=VMSCAN_THROTTLE_NOPROGRESS
     36 usec_timeout=500000 usect_delayed=496000 reason=VMSCAN_THROTTLE_NOPROGRESS
     37 usec_timeout=500000 usect_delayed=312000 reason=VMSCAN_THROTTLE_NOPROGRESS
     38 usec_timeout=500000 usect_delayed=304000 reason=VMSCAN_THROTTLE_NOPROGRESS
     40 usec_timeout=500000 usect_delayed=288000 reason=VMSCAN_THROTTLE_NOPROGRESS
     43 usec_timeout=500000 usect_delayed=408000 reason=VMSCAN_THROTTLE_NOPROGRESS
     55 usec_timeout=500000 usect_delayed=416000 reason=VMSCAN_THROTTLE_NOPROGRESS
     56 usec_timeout=500000 usect_delayed=76000 reason=VMSCAN_THROTTLE_NOPROGRESS
     58 usec_timeout=500000 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPROGRESS
     59 usec_timeout=500000 usect_delayed=208000 reason=VMSCAN_THROTTLE_NOPROGRESS
     61 usec_timeout=500000 usect_delayed=68000 reason=VMSCAN_THROTTLE_NOPROGRESS
     71 usec_timeout=500000 usect_delayed=192000 reason=VMSCAN_THROTTLE_NOPROGRESS
     71 usec_timeout=500000 usect_delayed=480000 reason=VMSCAN_THROTTLE_NOPROGRESS
     79 usec_timeout=500000 usect_delayed=60000 reason=VMSCAN_THROTTLE_NOPROGRESS
     82 usec_timeout=500000 usect_delayed=320000 reason=VMSCAN_THROTTLE_NOPROGRESS
     82 usec_timeout=500000 usect_delayed=92000 reason=VMSCAN_THROTTLE_NOPROGRESS
     85 usec_timeout=500000 usect_delayed=64000 reason=VMSCAN_THROTTLE_NOPROGRESS
     85 usec_timeout=500000 usect_delayed=80000 reason=VMSCAN_THROTTLE_NOPROGRESS
     88 usec_timeout=500000 usect_delayed=84000 reason=VMSCAN_THROTTLE_NOPROGRESS
     90 usec_timeout=500000 usect_delayed=160000 reason=VMSCAN_THROTTLE_NOPROGRESS
     90 usec_timeout=500000 usect_delayed=292000 reason=VMSCAN_THROTTLE_NOPROGRESS
     94 usec_timeout=500000 usect_delayed=56000 reason=VMSCAN_THROTTLE_NOPROGRESS
    118 usec_timeout=500000 usect_delayed=88000 reason=VMSCAN_THROTTLE_NOPROGRESS
    119 usec_timeout=500000 usect_delayed=72000 reason=VMSCAN_THROTTLE_NOPROGRESS
    126 usec_timeout=500000 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPROGRESS
    146 usec_timeout=500000 usect_delayed=52000 reason=VMSCAN_THROTTLE_NOPROGRESS
    148 usec_timeout=500000 usect_delayed=36000 reason=VMSCAN_THROTTLE_NOPROGRESS
    148 usec_timeout=500000 usect_delayed=48000 reason=VMSCAN_THROTTLE_NOPROGRESS
    159 usec_timeout=500000 usect_delayed=28000 reason=VMSCAN_THROTTLE_NOPROGRESS
    178 usec_timeout=500000 usect_delayed=44000 reason=VMSCAN_THROTTLE_NOPROGRESS
    183 usec_timeout=500000 usect_delayed=40000 reason=VMSCAN_THROTTLE_NOPROGRESS
    237 usec_timeout=500000 usect_delayed=100000 reason=VMSCAN_THROTTLE_NOPROGRESS
    266 usec_timeout=500000 usect_delayed=32000 reason=VMSCAN_THROTTLE_NOPROGRESS
    313 usec_timeout=500000 usect_delayed=24000 reason=VMSCAN_THROTTLE_NOPROGRESS
    347 usec_timeout=500000 usect_delayed=96000 reason=VMSCAN_THROTTLE_NOPROGRESS
    470 usec_timeout=500000 usect_delayed=20000 reason=VMSCAN_THROTTLE_NOPROGRESS
    559 usec_timeout=500000 usect_delayed=16000 reason=VMSCAN_THROTTLE_NOPROGRESS
    964 usec_timeout=500000 usect_delayed=12000 reason=VMSCAN_THROTTLE_NOPROGRESS
   2001 usec_timeout=500000 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPROGRESS
   2447 usec_timeout=500000 usect_delayed=8000 reason=VMSCAN_THROTTLE_NOPROGRESS
   7888 usec_timeout=500000 usect_delayed=4000 reason=VMSCAN_THROTTLE_NOPROGRESS
  22727 usec_timeout=500000 usect_delayed=0 reason=VMSCAN_THROTTLE_NOPROGRESS
  51305 usec_timeout=500000 usect_delayed=500000 reason=VMSCAN_THROTTLE_NOPROGRESS

The full timeout is often hit but a large number also do not stall at
all.  The remainder slept a little allowing other reclaim tasks to make
progress.

While this timeout could be further increased, it could also negatively
impact worst-case behaviour when there is no prioritisation of what task
should make progress.

For VMSCAN_THROTTLE_WRITEBACK, the breakdown was

      1 usec_timeout=100000 usect_delayed=44000 reason=VMSCAN_THROTTLE_WRITEBACK
      2 usec_timeout=100000 usect_delayed=76000 reason=VMSCAN_THROTTLE_WRITEBACK
      3 usec_timeout=100000 usect_delayed=80000 reason=VMSCAN_THROTTLE_WRITEBACK
      5 usec_timeout=100000 usect_delayed=48000 reason=VMSCAN_THROTTLE_WRITEBACK
      5 usec_timeout=100000 usect_delayed=84000 reason=VMSCAN_THROTTLE_WRITEBACK
      6 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
      7 usec_timeout=100000 usect_delayed=88000 reason=VMSCAN_THROTTLE_WRITEBACK
     11 usec_timeout=100000 usect_delayed=56000 reason=VMSCAN_THROTTLE_WRITEBACK
     12 usec_timeout=100000 usect_delayed=64000 reason=VMSCAN_THROTTLE_WRITEBACK
     16 usec_timeout=100000 usect_delayed=92000 reason=VMSCAN_THROTTLE_WRITEBACK
     24 usec_timeout=100000 usect_delayed=68000 reason=VMSCAN_THROTTLE_WRITEBACK
     28 usec_timeout=100000 usect_delayed=32000 reason=VMSCAN_THROTTLE_WRITEBACK
     30 usec_timeout=100000 usect_delayed=60000 reason=VMSCAN_THROTTLE_WRITEBACK
     30 usec_timeout=100000 usect_delayed=96000 reason=VMSCAN_THROTTLE_WRITEBACK
     32 usec_timeout=100000 usect_delayed=52000 reason=VMSCAN_THROTTLE_WRITEBACK
     42 usec_timeout=100000 usect_delayed=40000 reason=VMSCAN_THROTTLE_WRITEBACK
     77 usec_timeout=100000 usect_delayed=28000 reason=VMSCAN_THROTTLE_WRITEBACK
     99 usec_timeout=100000 usect_delayed=36000 reason=VMSCAN_THROTTLE_WRITEBACK
    137 usec_timeout=100000 usect_delayed=24000 reason=VMSCAN_THROTTLE_WRITEBACK
    190 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
    339 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
    518 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
    852 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
   3359 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK
   7147 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
  83962 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK

The majority hit the timeout in direct reclaim context although a
sizable number did not stall at all.  This is very different to kswapd
where only a tiny percentage of stalls due to writeback reached the
timeout.

Bottom line, the throttling appears to work and the wakeup events may
limit worst case stalls.  There might be some grounds for adjusting
timeouts but it's likely futile as the worst-case scenarios depend on
the workload, memory size and the speed of the storage.  A better
approach to improve the series further would be to prioritise tasks
based on their rate of allocation with the caveat that it may be very
expensive to track.

This patch (of 5):

Page reclaim throttles on wait_iff_congested under the following
conditions:

 - kswapd is encountering pages under writeback and marked for immediate
   reclaim implying that pages are cycling through the LRU faster than
   pages can be cleaned.

 - Direct reclaim will stall if all dirty pages are backed by congested
   inodes.

wait_iff_congested is almost completely broken with few exceptions.
This patch adds a new node-based workqueue and tracks the number of
throttled tasks and pages written back since throttling started.  If
enough pages belonging to the node are written back then the throttled
tasks will wake early.  If not, the throttled tasks sleeps until the
timeout expires.

[neilb@suse.de: Uninterruptible sleep and simpler wakeups]
[hdanton@sina.com: Avoid race when reclaim starts]
[vbabka@suse.cz: vmstat irq-safe api, clarifications]

Link: https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@kernel.dk/ [1]
Link: https://lkml.kernel.org/r/20211022144651.19914-1-mgorman@techsingularity.net
Link: https://lkml.kernel.org/r/20211022144651.19914-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: NeilBrown <neilb@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Darrick J . Wong" <djwong@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06 13:30:40 -07:00
Christoph Hellwig
efee17134c mm: simplify bdi refcounting
Move grabbing and releasing the bdi refcount out of the common
wb_init/wb_exit helpers into code that is only used for the non-default
memcg driven bdi_writeback structures.

[hch@lst.de: add comment]
  Link: https://lkml.kernel.org/r/20211027074207.GA12793@lst.de
[akpm@linux-foundation.org: fix typo]

Link: https://lkml.kernel.org/r/20211021124441.668816-6-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Vignesh Raghavendra <vigneshr@ti.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06 13:30:34 -07:00
Christoph Hellwig
702f2d1e3b mm: don't automatically unregister bdis
All BDI users now unregister explicitly.

Link: https://lkml.kernel.org/r/20211021124441.668816-5-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Vignesh Raghavendra <vigneshr@ti.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06 13:30:34 -07:00
Christoph Hellwig
c6fd3ac0fc mm: export bdi_unregister
Patch series "simplify bdi unregistation".

This series simplifies the BDI code to get rid of the magic
auto-unregister feature that hid a recent block layer refcounting bug.

This patch (of 5):

To wind down the magic auto-unregister semantics we'll need to push this
into modular code.

Link: https://lkml.kernel.org/r/20211021124441.668816-1-hch@lst.de
Link: https://lkml.kernel.org/r/20211021124441.668816-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Vignesh Raghavendra <vigneshr@ti.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06 13:30:34 -07:00
Christoph Hellwig
ccdf774189 mm: don't include <linux/blkdev.h> in <linux/backing-dev.h>
Move inode_to_bdi out of line to avoid having to include blkdev.h.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:01 -06:00
Christoph Hellwig
e41d12f539 mm: don't include <linux/blk-cgroup.h> in <linux/backing-dev.h>
There is no need to pull blk-cgroup.h and thus blkdev.h in here, so
break the include chain.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:01 -06:00
Linus Torvalds
14726903c8 Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
 "173 patches.

  Subsystems affected by this series: ia64, ocfs2, block, and mm (debug,
  pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
  bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure,
  hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock,
  oom-kill, migration, ksm, percpu, vmstat, and madvise)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (173 commits)
  mm/madvise: add MADV_WILLNEED to process_madvise()
  mm/vmstat: remove unneeded return value
  mm/vmstat: simplify the array size calculation
  mm/vmstat: correct some wrong comments
  mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()
  selftests: vm: add COW time test for KSM pages
  selftests: vm: add KSM merging time test
  mm: KSM: fix data type
  selftests: vm: add KSM merging across nodes test
  selftests: vm: add KSM zero page merging test
  selftests: vm: add KSM unmerge test
  selftests: vm: add KSM merge test
  mm/migrate: correct kernel-doc notation
  mm: wire up syscall process_mrelease
  mm: introduce process_mrelease system call
  memblock: make memblock_find_in_range method private
  mm/mempolicy.c: use in_task() in mempolicy_slab_node()
  mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
  mm/mempolicy: advertise new MPOL_PREFERRED_MANY
  mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
  ...
2021-09-03 10:08:28 -07:00
Jan Kara
45a2966fd6 writeback: fix bandwidth estimate for spiky workload
Michael Stapelberg has reported that for workload with short big spikes of
writes (GCC linker seem to trigger this frequently) the write throughput
is heavily underestimated and tends to steadily sink until it reaches
zero.  This has rather bad impact on writeback throttling (causing
stalls).  The problem is that writeback throughput estimate gets updated
at most once per 200 ms.  One update happens early after we submit pages
for writeback (at that point writeout of only small fraction of pages is
completed and thus observed throughput is tiny).  Next update happens only
during the next write spike (updates happen only from inode writeback and
dirty throttling code) and if that is more than 1s after previous spike,
we decide system was idle and just ignore whatever was written until this
moment.

Fix the problem by making sure writeback throughput estimate is also
updated shortly after writeback completes to get reasonable estimate of
throughput for spiky workloads.

[jack@suse.cz: avoid division by 0 in wb_update_dirty_ratelimit()]

Link: https://lore.kernel.org/lkml/20210617095309.3542373-1-stapelberg+linux@google.com
Link: https://lkml.kernel.org/r/20210713104716.22868-3-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Michael Stapelberg <stapelberg+linux@google.com>
Tested-by: Michael Stapelberg <stapelberg+linux@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 09:58:10 -07:00
Jan Kara
633a2abb9e writeback: track number of inodes under writeback
Patch series "writeback: Fix bandwidth estimates", v4.

Fix estimate of writeback throughput when device is not fully busy doing
writeback.  Michael Stapelberg has reported that such workload (e.g.
generated by linking) tends to push estimated throughput down to 0 and as
a result writeback on the device is practically stalled.

The first three patches fix the reported issue, the remaining two patches
are unrelated cleanups of problems I've noticed when reading the code.

This patch (of 4):

Track number of inodes under writeback for each bdi_writeback structure.
We will use this to decide whether wb does any IO and so we can estimate
its writeback throughput.  In principle we could use number of pages under
writeback (WB_WRITEBACK counter) for this however normal percpu counter
reads are too inaccurate for our purposes and summing the counter is too
expensive.

Link: https://lkml.kernel.org/r/20210713104519.16394-1-jack@suse.cz
Link: https://lkml.kernel.org/r/20210713104716.22868-1-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Michael Stapelberg <stapelberg+linux@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 09:58:10 -07:00
Christoph Hellwig
5ed964f8e5 mm: hide laptop_mode_wb_timer entirely behind the BDI API
Don't leak the detaŃ–ls of the timer into the block layer, instead
initialize the timer in bdi_alloc and delete it in bdi_unregister.
Note that this means the timer is initialized (but not armed) for
non-block queues as well now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210809141744.1203023-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09 11:52:28 -06:00
Roman Gushchin
b43a9e76b4 writeback, cgroup: remove wb from offline list before releasing refcnt
Boyang reported that the commit c22d70a162 ("writeback, cgroup:
release dying cgwbs by switching attached inodes") causes the kernel to
crash while running xfstests generic/256 on ext4 on aarch64 and ppc64le.

  run fstests generic/256 at 2021-07-12 05:41:40
  EXT4-fs (vda3): mounted filesystem with ordered data mode. Opts: . Quota mode: none.
  Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
  Mem abort info:
     ESR = 0x96000005
     EC = 0x25: DABT (current EL), IL = 32 bits
     SET = 0, FnV = 0
     EA = 0, S1PTW = 0
     FSC = 0x05: level 1 translation fault
  Data abort info:
     ISV = 0, ISS = 0x00000005
     CM = 0, WnR = 0
  user pgtable: 64k pages, 48-bit VAs, pgdp=00000000b0502000
  [0000000000000000] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
  Internal error: Oops: 96000005 [#1] SMP
  Modules linked in: dm_flakey dm_snapshot dm_bufio dm_zero dm_mod loop tls rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs rfkill sunrpc ext4 vfat fat mbcache jbd2 drm fuse xfs libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce virtio_blk virtio_net net_failover virtio_console failover virtio_mmio aes_neon_bs [last unloaded: scsi_debug]
  CPU: 0 PID: 408468 Comm: kworker/u8:5 Tainted: G X --------- ---  5.14.0-0.rc1.15.bx.el9.aarch64 #1
  Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
  Workqueue: events_unbound cleanup_offline_cgwbs_workfn
  pstate: 004000c5 (nzcv daIF +PAN -UAO -TCO BTYPE=--)
  pc : cleanup_offline_cgwbs_workfn+0x320/0x394
  lr : cleanup_offline_cgwbs_workfn+0xe0/0x394
  sp : ffff80001554fd10
  x29: ffff80001554fd10 x28: 0000000000000000 x27: 0000000000000001
  x26: 0000000000000000 x25: 00000000000000e0 x24: ffffd2a2fbe671a8
  x23: ffff80001554fd88 x22: ffffd2a2fbe67198 x21: ffffd2a2fc25a730
  x20: ffff210412bc3000 x19: ffff210412bc3280 x18: 0000000000000000
  x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
  x14: 0000000000000000 x13: 0000000000000030 x12: 0000000000000040
  x11: ffff210481572238 x10: ffff21048157223a x9 : ffffd2a2fa276c60
  x8 : ffff210484106b60 x7 : 0000000000000000 x6 : 000000000007d18a
  x5 : ffff210416a86400 x4 : ffff210412bc0280 x3 : 0000000000000000
  x2 : ffff80001554fd88 x1 : ffff210412bc0280 x0 : 0000000000000003
  Call trace:
     cleanup_offline_cgwbs_workfn+0x320/0x394
     process_one_work+0x1f4/0x4b0
     worker_thread+0x184/0x540
     kthread+0x114/0x120
     ret_from_fork+0x10/0x18
  Code: d63f0020 97f99963 17ffffa6 f8588263 (f9400061)
  ---[ end trace e250fe289272792a ]---
  Kernel panic - not syncing: Oops: Fatal exception
  SMP: stopping secondary CPUs
  SMP: failed to stop secondary CPUs 0-2
  Kernel Offset: 0x52a2e9fa0000 from 0xffff800010000000
  PHYS_OFFSET: 0xfff0defca0000000
  CPU features: 0x00200251,23200840
  Memory Limit: none
  ---[ end Kernel panic - not syncing: Oops: Fatal exception ]---

The problem happens when cgwb_release_workfn() races with
cleanup_offline_cgwbs_workfn(): wb_tryget() in
cleanup_offline_cgwbs_workfn() can be called after percpu_ref_exit() is
cgwb_release_workfn(), which is basically a use-after-free error.

Fix the problem by making removing the writeback structure from the
offline list before releasing the percpu reference counter.  It will
guarantee that cleanup_offline_cgwbs_workfn() will not see and not
access writeback structures which are about to be released.

Link: https://lkml.kernel.org/r/20210716201039.3762203-1-guro@fb.com
Fixes: c22d70a162 ("writeback, cgroup: release dying cgwbs by switching attached inodes")
Signed-off-by: Roman Gushchin <guro@fb.com>
Reported-by: Boyang Xue <bxue@redhat.com>
Suggested-by: Jan Kara <jack@suse.cz>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Murphy Zhou <jencce.kernel@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-23 17:43:28 -07:00
Roman Gushchin
c22d70a162 writeback, cgroup: release dying cgwbs by switching attached inodes
Asynchronously try to release dying cgwbs by switching attached inodes to
the nearest living ancestor wb.  It helps to get rid of per-cgroup
writeback structures themselves and of pinned memory and block cgroups,
which are significantly larger structures (mostly due to large per-cpu
statistics data).  This prevents memory waste and helps to avoid different
scalability problems caused by large piles of dying cgroups.

Reuse the existing mechanism of inode switching used for foreign inode
detection.  To speed things up batch up to 115 inode switching in a single
operation (the maximum number is selected so that the resulting struct
inode_switch_wbs_context can fit into 1024 bytes).  Because every
switching consists of two steps divided by an RCU grace period, it would
be too slow without batching.  Please note that the whole batch counts as
a single operation (when increasing/decreasing isw_nr_in_flight).  This
allows to keep umounting working (flush the switching queue), however
prevents cleanups from consuming the whole switching quota and effectively
blocking the frn switching.

A cgwb cleanup operation can fail due to different reasons (e.g.  not
enough memory, the cgwb has an in-flight/pending io, an attached inode in
a wrong state, etc).  In this case the next scheduled cleanup will make a
new attempt.  An attempt is made each time a new cgwb is offlined (in
other words a memcg and/or a blkcg is deleted by a user).  In the future
an additional attempt scheduled by a timer can be implemented.

[guro@fb.com: replace open-coded "115" with arithmetic]
  Link: https://lkml.kernel.org/r/YMEcSBcq/VXMiPPO@carbon.dhcp.thefacebook.com
[guro@fb.com: add smp_mb() to inode_prepare_wbs_switch()]
  Link: https://lkml.kernel.org/r/YMFa+guFw7OFjf3X@carbon.dhcp.thefacebook.com
[willy@infradead.org: fix documentation]
  Link: https://lkml.kernel.org/r/20210615200242.1716568-2-willy@infradead.org

Link: https://lkml.kernel.org/r/20210608230225.2078447-9-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:48 -07:00
Roman Gushchin
f3b6a6df38 writeback, cgroup: keep list of inodes attached to bdi_writeback
Currently there is no way to iterate over inodes attached to a specific
cgwb structure.  It limits the ability to efficiently reclaim the
writeback structure itself and associated memory and block cgroup
structures without scanning all inodes belonging to a sb, which can be
prohibitively expensive.

While dirty/in-active-writeback an inode belongs to one of the
bdi_writeback's io lists: b_dirty, b_io, b_more_io and b_dirty_time.  Once
cleaned up, it's removed from all io lists.  So the inode->i_io_list can
be reused to maintain the list of inodes, attached to a bdi_writeback
structure.

This patch introduces a new wb->b_attached list, which contains all inodes
which were dirty at least once and are attached to the given cgwb.  Inodes
attached to the root bdi_writeback structures are never placed on such
list.  The following patch will use this list to try to release cgwbs
structures more efficiently.

Link: https://lkml.kernel.org/r/20210608230225.2078447-6-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:48 -07:00
Daniel Vetter
c1ca59a1f2 mm/backing-dev.c: use might_alloc()
Now that my little helper has landed, use it more.  On top of the existing
check this also uses lockdep through the fs_reclaim annotations.

[akpm@linux-foundation.org: include linux/sched/mm.h]

Link: https://lkml.kernel.org/r/20210113135009.3606813-2-daniel.vetter@ffwll.ch
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-26 09:41:01 -08:00
Baolin Wang
6986c3e2b1 mm: backing-dev: Remove duplicated macro definition
Move the K() macro a little forward to remove the same macro definition.

Link: https://lkml.kernel.org/r/d1ccdf2d3116dce9814f2bcc1f0415ecb4c76ea5.1612862230.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:28 -08:00