forked from mirrors/linux
Uros Bizjak uses x86 named address space qualifiers to provide compile-time checking of percpu area accesses. This has caused a small amount of fallout - two or three issues were reported. In all cases the calling code was founf to be incorrect. - The 4 patch series "Some cleanup for memcg" from Chen Ridong implements some relatively monir cleanups for the memcontrol code. - The 17 patch series "mm: fixes for device-exclusive entries (hmm)" from David Hildenbrand fixes a boatload of issues which David found then using device-exclusive PTE entries when THP is enabled. More work is needed, but this makes thins better - our own HMM selftests now succeed. - The 2 patch series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed remove the z3fold and zbud implementations. They have been deprecated for half a year and nobody has complained. - The 5 patch series "mm: further simplify VMA merge operation" from Lorenzo Stoakes implements numerous simplifications in this area. No runtime effects are anticipated. - The 4 patch series "mm/madvise: remove redundant mmap_lock operations from process_madvise()" from SeongJae Park rationalizes the locking in the madvise() implementation. Performance gains of 20-25% were observed in one MADV_DONTNEED microbenchmark. - The 12 patch series "Tiny cleanup and improvements about SWAP code" from Baoquan He contains a number of touchups to issues which Baoquan noticed when working on the swap code. - The 2 patch series "mm: kmemleak: Usability improvements" from Catalin Marinas implements a couple of improvements to the kmemleak user-visible output. - The 2 patch series "mm/damon/paddr: fix large folios access and schemes handling" from Usama Arif provides a couple of fixes for DAMON's handling of large folios. - The 3 patch series "mm/damon/core: fix wrong and/or useless damos_walk() behaviors" from SeongJae Park fixes a few issues with the accuracy of kdamond's walking of DAMON regions. - The 3 patch series "expose mapping wrprotect, fix fb_defio use" from Lorenzo Stoakes changes the interaction between framebuffer deferred-io and core MM. No functional changes are anticipated - this is preparatory work for the future removal of page structure fields. - The 4 patch series "mm/damon: add support for hugepage_size DAMOS filter" from Usama Arif adds a DAMOS filter which permits the filtering by huge page sizes. - The 4 patch series "mm: permit guard regions for file-backed/shmem mappings" from Lorenzo Stoakes extends the guard region feature from its present "anon mappings only" state. The feature now covers shmem and file-backed mappings. - The 4 patch series "mm: batched unmap lazyfree large folios during reclamation" from Barry Song cleans up and speeds up the unmapping for pte-mapped large folios. - The 18 patch series "reimplement per-vma lock as a refcount" from Suren Baghdasaryan puts the vm_lock back into the vma. Our reasons for pulling it out were largely bogus and that change made the code more messy. This patchset provides small (0-10%) improvements on one microbenchmark. - The 5 patch series "Docs/mm/damon: misc DAMOS filters documentation fixes and improves" from SeongJae Park does some maintenance work on the DAMON docs. - The 27 patch series "hugetlb/CMA improvements for large systems" from Frank van der Linden addresses a pile of issues which have been observed when using CMA on large machines. - The 2 patch series "mm/damon: introduce DAMOS filter type for unmapped pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the page's mapped/unmapped status. - The 19 patch series "zsmalloc/zram: there be preemption" from Sergey Senozhatsky teaches zram to run its compression and decompression operations preemptibly. - The 12 patch series "selftests/mm: Some cleanups from trying to run them" from Brendan Jackman fixes a pile of unrelated issues which Brendan encountered while runnimg our selftests. - The 2 patch series "fs/proc/task_mmu: add guard region bit to pagemap" from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to determine whether a particular page is a guard page. - The 7 patch series "mm, swap: remove swap slot cache" from Kairui Song removes the swap slot cache from the allocation path - it simply wasn't being effective. - The 5 patch series "mm: cleanups for device-exclusive entries (hmm)" from David Hildenbrand implements a number of unrelated cleanups in this code. - The 5 patch series "mm: Rework generic PTDUMP configs" from Anshuman Khandual implements a number of preparatoty cleanups to the GENERIC_PTDUMP Kconfig logic. - The 8 patch series "mm/damon: auto-tune aggregation interval" from SeongJae Park implements a feedback-driven automatic tuning feature for DAMON's aggregation interval tuning. - The 5 patch series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in powerpc, sparc and x86 lazy MMU implementations. Ryan did this in preparation for implementing lazy mmu mode for arm64 to optimize vmalloc. - The 2 patch series "mm/page_alloc: Some clarifications for migratetype fallback" from Brendan Jackman reworks some commentary to make the code easier to follow. - The 3 patch series "page_counter cleanup and size reduction" from Shakeel Butt cleans up the page_counter code and fixes a size increase which we accidentally added late last year. - The 3 patch series "Add a command line option that enables control of how many threads should be used to allocate huge pages" from Thomas Prescher does that. It allows the careful operator to significantly reduce boot time by tuning the parallalization of huge page initialization. - The 3 patch series "Fix calculations in trace_balance_dirty_pages() for cgwb" from Tang Yizhou fixes the tracing output from the dirty page balancing code. - The 9 patch series "mm/damon: make allow filters after reject filters useful and intuitive" from SeongJae Park improves the handling of allow and reject filters. Behaviour is made more consistent and the documention is updated accordingly. - The 5 patch series "Switch zswap to object read/write APIs" from Yosry Ahmed updates zswap to the new object read/write APIs and thus permits the removal of some legacy code from zpool and zsmalloc. - The 6 patch series "Some trivial cleanups for shmem" from Baolin Wang does as it claims. - The 20 patch series "fs/dax: Fix ZONE_DEVICE page reference counts" from Alistair Popple regularizes the weird ZONE_DEVICE page refcount handling in DAX, permittig the removal of a number of special-case checks. - The 4 patch series "refactor mremap and fix bug" from Lorenzo Stoakes is a preparatoty refactoring and cleanup of the mremap() code. - The 20 patch series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in which we determine whether a large folio is known to be mapped exclusively into a single MM. - The 8 patch series "mm/damon: add sysfs dirs for managing DAMOS filters based on handling layers" from SeongJae Park adds a couple of new sysfs directories to ease the management of DAMON/DAMOS filters. - The 13 patch series "arch, mm: reduce code duplication in mem_init()" from Mike Rapoport consolidates many per-arch implementations of mem_init() into code generic code, where that is practical. - The 13 patch series "mm/damon/sysfs: commit parameters online via damon_call()" from SeongJae Park continues the cleaning up of sysfs access to DAMON internal data. - The 3 patch series "mm: page_ext: Introduce new iteration API" from Luiz Capitulino reworks the page_ext initialization to fix a boot-time crash which was observed with an unusual combination of compile and cmdline options. - The 8 patch series "Buddy allocator like (or non-uniform) folio split" from Zi Yan reworks the code to split a folio into smaller folios. The main benefit is lessened memory consumption: fewer post-split folios are generated. - The 2 patch series "Minimize xa_node allocation during xarry split" from Zi Yan reduces the number of xarray xa_nodes which are generated during an xarray split. - The 2 patch series "drivers/base/memory: Two cleanups" from Gavin Shan performs some maintenance work on the drivers/base/memory code. - The 3 patch series "Add tracepoints for lowmem reserves, watermarks and totalreserve_pages" from Martin Liu adds some more tracepoints to the page allocator code. - The 4 patch series "mm/madvise: cleanup requests validations and classifications" from SeongJae Park cleans up some warts which SeongJae observed during his earlier madvise work. - The 3 patch series "mm/hwpoison: Fix regressions in memory failure handling" from Shuai Xue addresses two quite serious regressions which Shuai has observed in the memory-failure implementation. - The 5 patch series "mm: reliable huge page allocator" from Johannes Weiner makes huge page allocations cheaper and more reliable by reducing fragmentation. - The 5 patch series "Minor memcg cleanups & prep for memdescs" from Matthew Wilcox is preparatory work for the future implementation of memdescs. - The 4 patch series "track memory used by balloon drivers" from Nico Pache introduces a way to track memory used by our various balloon drivers. - The 2 patch series "mm/damon: introduce DAMOS filter type for active pages" from Nhat Pham permits users to filter for active/inactive pages, separately for file and anon pages. - The 2 patch series "Adding Proactive Memory Reclaim Statistics" from Hao Jia separates the proactive reclaim statistics from the direct reclaim statistics. - The 2 patch series "mm/vmscan: don't try to reclaim hwpoison folio" from Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim code. -----BEGIN PGP SIGNATURE----- iHQEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ+nZaAAKCRDdBJ7gKXxA jsOWAPiP4r7CJHMZRK4eyJOkvS1a1r+TsIarrFZtjwvf/GIfAQCEG+JDxVfUaUSF Ee93qSSLR1BkNdDw+931Pu0mXfbnBw== =Pn2K -----END PGP SIGNATURE----- Merge tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - The series "Enable strict percpu address space checks" from Uros Bizjak uses x86 named address space qualifiers to provide compile-time checking of percpu area accesses. This has caused a small amount of fallout - two or three issues were reported. In all cases the calling code was found to be incorrect. - The series "Some cleanup for memcg" from Chen Ridong implements some relatively monir cleanups for the memcontrol code. - The series "mm: fixes for device-exclusive entries (hmm)" from David Hildenbrand fixes a boatload of issues which David found then using device-exclusive PTE entries when THP is enabled. More work is needed, but this makes thins better - our own HMM selftests now succeed. - The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed remove the z3fold and zbud implementations. They have been deprecated for half a year and nobody has complained. - The series "mm: further simplify VMA merge operation" from Lorenzo Stoakes implements numerous simplifications in this area. No runtime effects are anticipated. - The series "mm/madvise: remove redundant mmap_lock operations from process_madvise()" from SeongJae Park rationalizes the locking in the madvise() implementation. Performance gains of 20-25% were observed in one MADV_DONTNEED microbenchmark. - The series "Tiny cleanup and improvements about SWAP code" from Baoquan He contains a number of touchups to issues which Baoquan noticed when working on the swap code. - The series "mm: kmemleak: Usability improvements" from Catalin Marinas implements a couple of improvements to the kmemleak user-visible output. - The series "mm/damon/paddr: fix large folios access and schemes handling" from Usama Arif provides a couple of fixes for DAMON's handling of large folios. - The series "mm/damon/core: fix wrong and/or useless damos_walk() behaviors" from SeongJae Park fixes a few issues with the accuracy of kdamond's walking of DAMON regions. - The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo Stoakes changes the interaction between framebuffer deferred-io and core MM. No functional changes are anticipated - this is preparatory work for the future removal of page structure fields. - The series "mm/damon: add support for hugepage_size DAMOS filter" from Usama Arif adds a DAMOS filter which permits the filtering by huge page sizes. - The series "mm: permit guard regions for file-backed/shmem mappings" from Lorenzo Stoakes extends the guard region feature from its present "anon mappings only" state. The feature now covers shmem and file-backed mappings. - The series "mm: batched unmap lazyfree large folios during reclamation" from Barry Song cleans up and speeds up the unmapping for pte-mapped large folios. - The series "reimplement per-vma lock as a refcount" from Suren Baghdasaryan puts the vm_lock back into the vma. Our reasons for pulling it out were largely bogus and that change made the code more messy. This patchset provides small (0-10%) improvements on one microbenchmark. - The series "Docs/mm/damon: misc DAMOS filters documentation fixes and improves" from SeongJae Park does some maintenance work on the DAMON docs. - The series "hugetlb/CMA improvements for large systems" from Frank van der Linden addresses a pile of issues which have been observed when using CMA on large machines. - The series "mm/damon: introduce DAMOS filter type for unmapped pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the page's mapped/unmapped status. - The series "zsmalloc/zram: there be preemption" from Sergey Senozhatsky teaches zram to run its compression and decompression operations preemptibly. - The series "selftests/mm: Some cleanups from trying to run them" from Brendan Jackman fixes a pile of unrelated issues which Brendan encountered while runnimg our selftests. - The series "fs/proc/task_mmu: add guard region bit to pagemap" from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to determine whether a particular page is a guard page. - The series "mm, swap: remove swap slot cache" from Kairui Song removes the swap slot cache from the allocation path - it simply wasn't being effective. - The series "mm: cleanups for device-exclusive entries (hmm)" from David Hildenbrand implements a number of unrelated cleanups in this code. - The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual implements a number of preparatoty cleanups to the GENERIC_PTDUMP Kconfig logic. - The series "mm/damon: auto-tune aggregation interval" from SeongJae Park implements a feedback-driven automatic tuning feature for DAMON's aggregation interval tuning. - The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in powerpc, sparc and x86 lazy MMU implementations. Ryan did this in preparation for implementing lazy mmu mode for arm64 to optimize vmalloc. - The series "mm/page_alloc: Some clarifications for migratetype fallback" from Brendan Jackman reworks some commentary to make the code easier to follow. - The series "page_counter cleanup and size reduction" from Shakeel Butt cleans up the page_counter code and fixes a size increase which we accidentally added late last year. - The series "Add a command line option that enables control of how many threads should be used to allocate huge pages" from Thomas Prescher does that. It allows the careful operator to significantly reduce boot time by tuning the parallalization of huge page initialization. - The series "Fix calculations in trace_balance_dirty_pages() for cgwb" from Tang Yizhou fixes the tracing output from the dirty page balancing code. - The series "mm/damon: make allow filters after reject filters useful and intuitive" from SeongJae Park improves the handling of allow and reject filters. Behaviour is made more consistent and the documention is updated accordingly. - The series "Switch zswap to object read/write APIs" from Yosry Ahmed updates zswap to the new object read/write APIs and thus permits the removal of some legacy code from zpool and zsmalloc. - The series "Some trivial cleanups for shmem" from Baolin Wang does as it claims. - The series "fs/dax: Fix ZONE_DEVICE page reference counts" from Alistair Popple regularizes the weird ZONE_DEVICE page refcount handling in DAX, permittig the removal of a number of special-case checks. - The series "refactor mremap and fix bug" from Lorenzo Stoakes is a preparatoty refactoring and cleanup of the mremap() code. - The series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in which we determine whether a large folio is known to be mapped exclusively into a single MM. - The series "mm/damon: add sysfs dirs for managing DAMOS filters based on handling layers" from SeongJae Park adds a couple of new sysfs directories to ease the management of DAMON/DAMOS filters. - The series "arch, mm: reduce code duplication in mem_init()" from Mike Rapoport consolidates many per-arch implementations of mem_init() into code generic code, where that is practical. - The series "mm/damon/sysfs: commit parameters online via damon_call()" from SeongJae Park continues the cleaning up of sysfs access to DAMON internal data. - The series "mm: page_ext: Introduce new iteration API" from Luiz Capitulino reworks the page_ext initialization to fix a boot-time crash which was observed with an unusual combination of compile and cmdline options. - The series "Buddy allocator like (or non-uniform) folio split" from Zi Yan reworks the code to split a folio into smaller folios. The main benefit is lessened memory consumption: fewer post-split folios are generated. - The series "Minimize xa_node allocation during xarry split" from Zi Yan reduces the number of xarray xa_nodes which are generated during an xarray split. - The series "drivers/base/memory: Two cleanups" from Gavin Shan performs some maintenance work on the drivers/base/memory code. - The series "Add tracepoints for lowmem reserves, watermarks and totalreserve_pages" from Martin Liu adds some more tracepoints to the page allocator code. - The series "mm/madvise: cleanup requests validations and classifications" from SeongJae Park cleans up some warts which SeongJae observed during his earlier madvise work. - The series "mm/hwpoison: Fix regressions in memory failure handling" from Shuai Xue addresses two quite serious regressions which Shuai has observed in the memory-failure implementation. - The series "mm: reliable huge page allocator" from Johannes Weiner makes huge page allocations cheaper and more reliable by reducing fragmentation. - The series "Minor memcg cleanups & prep for memdescs" from Matthew Wilcox is preparatory work for the future implementation of memdescs. - The series "track memory used by balloon drivers" from Nico Pache introduces a way to track memory used by our various balloon drivers. - The series "mm/damon: introduce DAMOS filter type for active pages" from Nhat Pham permits users to filter for active/inactive pages, separately for file and anon pages. - The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia separates the proactive reclaim statistics from the direct reclaim statistics. - The series "mm/vmscan: don't try to reclaim hwpoison folio" from Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim code. * tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits) mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex() x86/mm: restore early initialization of high_memory for 32-bits mm/vmscan: don't try to reclaim hwpoison folio mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper cgroup: docs: add pswpin and pswpout items in cgroup v2 doc mm: vmscan: split proactive reclaim statistics from direct reclaim statistics selftests/mm: speed up split_huge_page_test selftests/mm: uffd-unit-tests support for hugepages > 2M docs/mm/damon/design: document active DAMOS filter type mm/damon: implement a new DAMOS filter type for active pages fs/dax: don't disassociate zero page entries MM documentation: add "Unaccepted" meminfo entry selftests/mm: add commentary about 9pfs bugs fork: use __vmalloc_node() for stack allocation docs/mm: Physical Memory: Populate the "Zones" section xen: balloon: update the NR_BALLOON_PAGES state hv_balloon: update the NR_BALLOON_PAGES state balloon_compaction: update the NR_BALLOON_PAGES state meminfo: add a per node counter for balloon drivers mm: remove references to folio in __memcg_kmem_uncharge_page() ...
686 lines
21 KiB
C
686 lines
21 KiB
C
/* SPDX-License-Identifier: GPL-2.0 */
|
|
#ifndef _LINUX_SWAP_H
|
|
#define _LINUX_SWAP_H
|
|
|
|
#include <linux/spinlock.h>
|
|
#include <linux/linkage.h>
|
|
#include <linux/mmzone.h>
|
|
#include <linux/list.h>
|
|
#include <linux/memcontrol.h>
|
|
#include <linux/sched.h>
|
|
#include <linux/node.h>
|
|
#include <linux/fs.h>
|
|
#include <linux/pagemap.h>
|
|
#include <linux/atomic.h>
|
|
#include <linux/page-flags.h>
|
|
#include <uapi/linux/mempolicy.h>
|
|
#include <asm/page.h>
|
|
|
|
struct notifier_block;
|
|
|
|
struct bio;
|
|
|
|
struct pagevec;
|
|
|
|
#define SWAP_FLAG_PREFER 0x8000 /* set if swap priority specified */
|
|
#define SWAP_FLAG_PRIO_MASK 0x7fff
|
|
#define SWAP_FLAG_DISCARD 0x10000 /* enable discard for swap */
|
|
#define SWAP_FLAG_DISCARD_ONCE 0x20000 /* discard swap area at swapon-time */
|
|
#define SWAP_FLAG_DISCARD_PAGES 0x40000 /* discard page-clusters after use */
|
|
|
|
#define SWAP_FLAGS_VALID (SWAP_FLAG_PRIO_MASK | SWAP_FLAG_PREFER | \
|
|
SWAP_FLAG_DISCARD | SWAP_FLAG_DISCARD_ONCE | \
|
|
SWAP_FLAG_DISCARD_PAGES)
|
|
#define SWAP_BATCH 64
|
|
|
|
static inline int current_is_kswapd(void)
|
|
{
|
|
return current->flags & PF_KSWAPD;
|
|
}
|
|
|
|
/*
|
|
* MAX_SWAPFILES defines the maximum number of swaptypes: things which can
|
|
* be swapped to. The swap type and the offset into that swap type are
|
|
* encoded into pte's and into pgoff_t's in the swapcache. Using five bits
|
|
* for the type means that the maximum number of swapcache pages is 27 bits
|
|
* on 32-bit-pgoff_t architectures. And that assumes that the architecture packs
|
|
* the type/offset into the pte as 5/27 as well.
|
|
*/
|
|
#define MAX_SWAPFILES_SHIFT 5
|
|
|
|
/*
|
|
* Use some of the swap files numbers for other purposes. This
|
|
* is a convenient way to hook into the VM to trigger special
|
|
* actions on faults.
|
|
*/
|
|
|
|
/*
|
|
* PTE markers are used to persist information onto PTEs that otherwise
|
|
* should be a none pte. As its name "PTE" hints, it should only be
|
|
* applied to the leaves of pgtables.
|
|
*/
|
|
#define SWP_PTE_MARKER_NUM 1
|
|
#define SWP_PTE_MARKER (MAX_SWAPFILES + SWP_HWPOISON_NUM + \
|
|
SWP_MIGRATION_NUM + SWP_DEVICE_NUM)
|
|
|
|
/*
|
|
* Unaddressable device memory support. See include/linux/hmm.h and
|
|
* Documentation/mm/hmm.rst. Short description is we need struct pages for
|
|
* device memory that is unaddressable (inaccessible) by CPU, so that we can
|
|
* migrate part of a process memory to device memory.
|
|
*
|
|
* When a page is migrated from CPU to device, we set the CPU page table entry
|
|
* to a special SWP_DEVICE_{READ|WRITE} entry.
|
|
*
|
|
* When a page is mapped by the device for exclusive access we set the CPU page
|
|
* table entries to a special SWP_DEVICE_EXCLUSIVE entry.
|
|
*/
|
|
#ifdef CONFIG_DEVICE_PRIVATE
|
|
#define SWP_DEVICE_NUM 3
|
|
#define SWP_DEVICE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM)
|
|
#define SWP_DEVICE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+1)
|
|
#define SWP_DEVICE_EXCLUSIVE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+2)
|
|
#else
|
|
#define SWP_DEVICE_NUM 0
|
|
#endif
|
|
|
|
/*
|
|
* Page migration support.
|
|
*
|
|
* SWP_MIGRATION_READ_EXCLUSIVE is only applicable to anonymous pages and
|
|
* indicates that the referenced (part of) an anonymous page is exclusive to
|
|
* a single process. For SWP_MIGRATION_WRITE, that information is implicit:
|
|
* (part of) an anonymous page that are mapped writable are exclusive to a
|
|
* single process.
|
|
*/
|
|
#ifdef CONFIG_MIGRATION
|
|
#define SWP_MIGRATION_NUM 3
|
|
#define SWP_MIGRATION_READ (MAX_SWAPFILES + SWP_HWPOISON_NUM)
|
|
#define SWP_MIGRATION_READ_EXCLUSIVE (MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)
|
|
#define SWP_MIGRATION_WRITE (MAX_SWAPFILES + SWP_HWPOISON_NUM + 2)
|
|
#else
|
|
#define SWP_MIGRATION_NUM 0
|
|
#endif
|
|
|
|
/*
|
|
* Handling of hardware poisoned pages with memory corruption.
|
|
*/
|
|
#ifdef CONFIG_MEMORY_FAILURE
|
|
#define SWP_HWPOISON_NUM 1
|
|
#define SWP_HWPOISON MAX_SWAPFILES
|
|
#else
|
|
#define SWP_HWPOISON_NUM 0
|
|
#endif
|
|
|
|
#define MAX_SWAPFILES \
|
|
((1 << MAX_SWAPFILES_SHIFT) - SWP_DEVICE_NUM - \
|
|
SWP_MIGRATION_NUM - SWP_HWPOISON_NUM - \
|
|
SWP_PTE_MARKER_NUM)
|
|
|
|
/*
|
|
* Magic header for a swap area. The first part of the union is
|
|
* what the swap magic looks like for the old (limited to 128MB)
|
|
* swap area format, the second part of the union adds - in the
|
|
* old reserved area - some extra information. Note that the first
|
|
* kilobyte is reserved for boot loader or disk label stuff...
|
|
*
|
|
* Having the magic at the end of the PAGE_SIZE makes detecting swap
|
|
* areas somewhat tricky on machines that support multiple page sizes.
|
|
* For 2.5 we'll probably want to move the magic to just beyond the
|
|
* bootbits...
|
|
*/
|
|
union swap_header {
|
|
struct {
|
|
char reserved[PAGE_SIZE - 10];
|
|
char magic[10]; /* SWAP-SPACE or SWAPSPACE2 */
|
|
} magic;
|
|
struct {
|
|
char bootbits[1024]; /* Space for disklabel etc. */
|
|
__u32 version;
|
|
__u32 last_page;
|
|
__u32 nr_badpages;
|
|
unsigned char sws_uuid[16];
|
|
unsigned char sws_volume[16];
|
|
__u32 padding[117];
|
|
__u32 badpages[1];
|
|
} info;
|
|
};
|
|
|
|
/*
|
|
* current->reclaim_state points to one of these when a task is running
|
|
* memory reclaim
|
|
*/
|
|
struct reclaim_state {
|
|
/* pages reclaimed outside of LRU-based reclaim */
|
|
unsigned long reclaimed;
|
|
#ifdef CONFIG_LRU_GEN
|
|
/* per-thread mm walk data */
|
|
struct lru_gen_mm_walk *mm_walk;
|
|
#endif
|
|
};
|
|
|
|
/*
|
|
* mm_account_reclaimed_pages(): account reclaimed pages outside of LRU-based
|
|
* reclaim
|
|
* @pages: number of pages reclaimed
|
|
*
|
|
* If the current process is undergoing a reclaim operation, increment the
|
|
* number of reclaimed pages by @pages.
|
|
*/
|
|
static inline void mm_account_reclaimed_pages(unsigned long pages)
|
|
{
|
|
if (current->reclaim_state)
|
|
current->reclaim_state->reclaimed += pages;
|
|
}
|
|
|
|
#ifdef __KERNEL__
|
|
|
|
struct address_space;
|
|
struct sysinfo;
|
|
struct writeback_control;
|
|
struct zone;
|
|
|
|
/*
|
|
* A swap extent maps a range of a swapfile's PAGE_SIZE pages onto a range of
|
|
* disk blocks. A rbtree of swap extents maps the entire swapfile (Where the
|
|
* term `swapfile' refers to either a blockdevice or an IS_REG file). Apart
|
|
* from setup, they're handled identically.
|
|
*
|
|
* We always assume that blocks are of size PAGE_SIZE.
|
|
*/
|
|
struct swap_extent {
|
|
struct rb_node rb_node;
|
|
pgoff_t start_page;
|
|
pgoff_t nr_pages;
|
|
sector_t start_block;
|
|
};
|
|
|
|
/*
|
|
* Max bad pages in the new format..
|
|
*/
|
|
#define MAX_SWAP_BADPAGES \
|
|
((offsetof(union swap_header, magic.magic) - \
|
|
offsetof(union swap_header, info.badpages)) / sizeof(int))
|
|
|
|
enum {
|
|
SWP_USED = (1 << 0), /* is slot in swap_info[] used? */
|
|
SWP_WRITEOK = (1 << 1), /* ok to write to this swap? */
|
|
SWP_DISCARDABLE = (1 << 2), /* blkdev support discard */
|
|
SWP_DISCARDING = (1 << 3), /* now discarding a free cluster */
|
|
SWP_SOLIDSTATE = (1 << 4), /* blkdev seeks are cheap */
|
|
SWP_CONTINUED = (1 << 5), /* swap_map has count continuation */
|
|
SWP_BLKDEV = (1 << 6), /* its a block device */
|
|
SWP_ACTIVATED = (1 << 7), /* set after swap_activate success */
|
|
SWP_FS_OPS = (1 << 8), /* swapfile operations go through fs */
|
|
SWP_AREA_DISCARD = (1 << 9), /* single-time swap area discards */
|
|
SWP_PAGE_DISCARD = (1 << 10), /* freed swap page-cluster discards */
|
|
SWP_STABLE_WRITES = (1 << 11), /* no overwrite PG_writeback pages */
|
|
SWP_SYNCHRONOUS_IO = (1 << 12), /* synchronous IO is efficient */
|
|
/* add others here before... */
|
|
};
|
|
|
|
#define SWAP_CLUSTER_MAX 32UL
|
|
#define SWAP_CLUSTER_MAX_SKIPPED (SWAP_CLUSTER_MAX << 10)
|
|
#define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
|
|
|
|
/* Bit flag in swap_map */
|
|
#define SWAP_HAS_CACHE 0x40 /* Flag page is cached, in first swap_map */
|
|
#define COUNT_CONTINUED 0x80 /* Flag swap_map continuation for full count */
|
|
|
|
/* Special value in first swap_map */
|
|
#define SWAP_MAP_MAX 0x3e /* Max count */
|
|
#define SWAP_MAP_BAD 0x3f /* Note page is bad */
|
|
#define SWAP_MAP_SHMEM 0xbf /* Owned by shmem/tmpfs */
|
|
|
|
/* Special value in each swap_map continuation */
|
|
#define SWAP_CONT_MAX 0x7f /* Max count */
|
|
|
|
/*
|
|
* We use this to track usage of a cluster. A cluster is a block of swap disk
|
|
* space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
|
|
* free clusters are organized into a list. We fetch an entry from the list to
|
|
* get a free cluster.
|
|
*
|
|
* The flags field determines if a cluster is free. This is
|
|
* protected by cluster lock.
|
|
*/
|
|
struct swap_cluster_info {
|
|
spinlock_t lock; /*
|
|
* Protect swap_cluster_info fields
|
|
* other than list, and swap_info_struct->swap_map
|
|
* elements corresponding to the swap cluster.
|
|
*/
|
|
u16 count;
|
|
u8 flags;
|
|
u8 order;
|
|
struct list_head list;
|
|
};
|
|
|
|
/* All on-list cluster must have a non-zero flag. */
|
|
enum swap_cluster_flags {
|
|
CLUSTER_FLAG_NONE = 0, /* For temporary off-list cluster */
|
|
CLUSTER_FLAG_FREE,
|
|
CLUSTER_FLAG_NONFULL,
|
|
CLUSTER_FLAG_FRAG,
|
|
/* Clusters with flags above are allocatable */
|
|
CLUSTER_FLAG_USABLE = CLUSTER_FLAG_FRAG,
|
|
CLUSTER_FLAG_FULL,
|
|
CLUSTER_FLAG_DISCARD,
|
|
CLUSTER_FLAG_MAX,
|
|
};
|
|
|
|
/*
|
|
* The first page in the swap file is the swap header, which is always marked
|
|
* bad to prevent it from being allocated as an entry. This also prevents the
|
|
* cluster to which it belongs being marked free. Therefore 0 is safe to use as
|
|
* a sentinel to indicate an entry is not valid.
|
|
*/
|
|
#define SWAP_ENTRY_INVALID 0
|
|
|
|
#ifdef CONFIG_THP_SWAP
|
|
#define SWAP_NR_ORDERS (PMD_ORDER + 1)
|
|
#else
|
|
#define SWAP_NR_ORDERS 1
|
|
#endif
|
|
|
|
/*
|
|
* We keep using same cluster for rotational device so IO will be sequential.
|
|
* The purpose is to optimize SWAP throughput on these device.
|
|
*/
|
|
struct swap_sequential_cluster {
|
|
unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
|
|
};
|
|
|
|
/*
|
|
* The in-memory structure used to track swap areas.
|
|
*/
|
|
struct swap_info_struct {
|
|
struct percpu_ref users; /* indicate and keep swap device valid. */
|
|
unsigned long flags; /* SWP_USED etc: see above */
|
|
signed short prio; /* swap priority of this type */
|
|
struct plist_node list; /* entry in swap_active_head */
|
|
signed char type; /* strange name for an index */
|
|
unsigned int max; /* extent of the swap_map */
|
|
unsigned char *swap_map; /* vmalloc'ed array of usage counts */
|
|
unsigned long *zeromap; /* kvmalloc'ed bitmap to track zero pages */
|
|
struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
|
|
struct list_head free_clusters; /* free clusters list */
|
|
struct list_head full_clusters; /* full clusters list */
|
|
struct list_head nonfull_clusters[SWAP_NR_ORDERS];
|
|
/* list of cluster that contains at least one free slot */
|
|
struct list_head frag_clusters[SWAP_NR_ORDERS];
|
|
/* list of cluster that are fragmented or contented */
|
|
atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];
|
|
unsigned int pages; /* total of usable pages of swap */
|
|
atomic_long_t inuse_pages; /* number of those currently in use */
|
|
struct swap_sequential_cluster *global_cluster; /* Use one global cluster for rotating device */
|
|
spinlock_t global_cluster_lock; /* Serialize usage of global cluster */
|
|
struct rb_root swap_extent_root;/* root of the swap extent rbtree */
|
|
struct block_device *bdev; /* swap device or bdev of swap file */
|
|
struct file *swap_file; /* seldom referenced */
|
|
struct completion comp; /* seldom referenced */
|
|
spinlock_t lock; /*
|
|
* protect map scan related fields like
|
|
* swap_map, lowest_bit, highest_bit,
|
|
* inuse_pages, cluster_next,
|
|
* cluster_nr, lowest_alloc,
|
|
* highest_alloc, free/discard cluster
|
|
* list. other fields are only changed
|
|
* at swapon/swapoff, so are protected
|
|
* by swap_lock. changing flags need
|
|
* hold this lock and swap_lock. If
|
|
* both locks need hold, hold swap_lock
|
|
* first.
|
|
*/
|
|
spinlock_t cont_lock; /*
|
|
* protect swap count continuation page
|
|
* list.
|
|
*/
|
|
struct work_struct discard_work; /* discard worker */
|
|
struct work_struct reclaim_work; /* reclaim worker */
|
|
struct list_head discard_clusters; /* discard clusters list */
|
|
struct plist_node avail_lists[]; /*
|
|
* entries in swap_avail_heads, one
|
|
* entry per node.
|
|
* Must be last as the number of the
|
|
* array is nr_node_ids, which is not
|
|
* a fixed value so have to allocate
|
|
* dynamically.
|
|
* And it has to be an array so that
|
|
* plist_for_each_* can work.
|
|
*/
|
|
};
|
|
|
|
static inline swp_entry_t page_swap_entry(struct page *page)
|
|
{
|
|
struct folio *folio = page_folio(page);
|
|
swp_entry_t entry = folio->swap;
|
|
|
|
entry.val += folio_page_idx(folio, page);
|
|
return entry;
|
|
}
|
|
|
|
/* linux/mm/workingset.c */
|
|
bool workingset_test_recent(void *shadow, bool file, bool *workingset,
|
|
bool flush);
|
|
void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages);
|
|
void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg);
|
|
void workingset_refault(struct folio *folio, void *shadow);
|
|
void workingset_activation(struct folio *folio);
|
|
|
|
/* linux/mm/page_alloc.c */
|
|
extern unsigned long totalreserve_pages;
|
|
|
|
/* Definition of global_zone_page_state not available yet */
|
|
#define nr_free_pages() global_zone_page_state(NR_FREE_PAGES)
|
|
|
|
|
|
/* linux/mm/swap.c */
|
|
void lru_note_cost(struct lruvec *lruvec, bool file,
|
|
unsigned int nr_io, unsigned int nr_rotated);
|
|
void lru_note_cost_refault(struct folio *);
|
|
void folio_add_lru(struct folio *);
|
|
void folio_add_lru_vma(struct folio *, struct vm_area_struct *);
|
|
void mark_page_accessed(struct page *);
|
|
void folio_mark_accessed(struct folio *);
|
|
|
|
extern atomic_t lru_disable_count;
|
|
|
|
static inline bool lru_cache_disabled(void)
|
|
{
|
|
return atomic_read(&lru_disable_count);
|
|
}
|
|
|
|
static inline void lru_cache_enable(void)
|
|
{
|
|
atomic_dec(&lru_disable_count);
|
|
}
|
|
|
|
extern void lru_cache_disable(void);
|
|
extern void lru_add_drain(void);
|
|
extern void lru_add_drain_cpu(int cpu);
|
|
extern void lru_add_drain_cpu_zone(struct zone *zone);
|
|
extern void lru_add_drain_all(void);
|
|
void folio_deactivate(struct folio *folio);
|
|
void folio_mark_lazyfree(struct folio *folio);
|
|
extern void swap_setup(void);
|
|
|
|
/* linux/mm/vmscan.c */
|
|
extern unsigned long zone_reclaimable_pages(struct zone *zone);
|
|
extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
|
|
gfp_t gfp_mask, nodemask_t *mask);
|
|
|
|
#define MEMCG_RECLAIM_MAY_SWAP (1 << 1)
|
|
#define MEMCG_RECLAIM_PROACTIVE (1 << 2)
|
|
#define MIN_SWAPPINESS 0
|
|
#define MAX_SWAPPINESS 200
|
|
extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
|
|
unsigned long nr_pages,
|
|
gfp_t gfp_mask,
|
|
unsigned int reclaim_options,
|
|
int *swappiness);
|
|
extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
|
|
gfp_t gfp_mask, bool noswap,
|
|
pg_data_t *pgdat,
|
|
unsigned long *nr_scanned);
|
|
extern unsigned long shrink_all_memory(unsigned long nr_pages);
|
|
extern int vm_swappiness;
|
|
long remove_mapping(struct address_space *mapping, struct folio *folio);
|
|
|
|
#ifdef CONFIG_NUMA
|
|
extern int sysctl_min_unmapped_ratio;
|
|
extern int sysctl_min_slab_ratio;
|
|
#endif
|
|
|
|
void check_move_unevictable_folios(struct folio_batch *fbatch);
|
|
|
|
extern void __meminit kswapd_run(int nid);
|
|
extern void __meminit kswapd_stop(int nid);
|
|
|
|
#ifdef CONFIG_SWAP
|
|
|
|
int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
|
|
unsigned long nr_pages, sector_t start_block);
|
|
int generic_swapfile_activate(struct swap_info_struct *, struct file *,
|
|
sector_t *);
|
|
|
|
static inline unsigned long total_swapcache_pages(void)
|
|
{
|
|
return global_node_page_state(NR_SWAPCACHE);
|
|
}
|
|
|
|
void free_swap_cache(struct folio *folio);
|
|
void free_page_and_swap_cache(struct page *);
|
|
void free_pages_and_swap_cache(struct encoded_page **, int);
|
|
/* linux/mm/swapfile.c */
|
|
extern atomic_long_t nr_swap_pages;
|
|
extern long total_swap_pages;
|
|
extern atomic_t nr_rotate_swap;
|
|
|
|
/* Swap 50% full? Release swapcache more aggressively.. */
|
|
static inline bool vm_swap_full(void)
|
|
{
|
|
return atomic_long_read(&nr_swap_pages) * 2 < total_swap_pages;
|
|
}
|
|
|
|
static inline long get_nr_swap_pages(void)
|
|
{
|
|
return atomic_long_read(&nr_swap_pages);
|
|
}
|
|
|
|
extern void si_swapinfo(struct sysinfo *);
|
|
int folio_alloc_swap(struct folio *folio, gfp_t gfp_mask);
|
|
bool folio_free_swap(struct folio *folio);
|
|
void put_swap_folio(struct folio *folio, swp_entry_t entry);
|
|
extern swp_entry_t get_swap_page_of_type(int);
|
|
extern int add_swap_count_continuation(swp_entry_t, gfp_t);
|
|
extern void swap_shmem_alloc(swp_entry_t, int);
|
|
extern int swap_duplicate(swp_entry_t);
|
|
extern int swapcache_prepare(swp_entry_t entry, int nr);
|
|
extern void swap_free_nr(swp_entry_t entry, int nr_pages);
|
|
extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
|
|
int swap_type_of(dev_t device, sector_t offset);
|
|
int find_first_swap(dev_t *device);
|
|
extern unsigned int count_swap_pages(int, int);
|
|
extern sector_t swapdev_block(int, pgoff_t);
|
|
extern int __swap_count(swp_entry_t entry);
|
|
extern bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry);
|
|
extern int swp_swapcount(swp_entry_t entry);
|
|
struct swap_info_struct *swp_swap_info(swp_entry_t entry);
|
|
struct backing_dev_info;
|
|
extern int init_swap_address_space(unsigned int type, unsigned long nr_pages);
|
|
extern void exit_swap_address_space(unsigned int type);
|
|
extern struct swap_info_struct *get_swap_device(swp_entry_t entry);
|
|
sector_t swap_folio_sector(struct folio *folio);
|
|
|
|
static inline void put_swap_device(struct swap_info_struct *si)
|
|
{
|
|
percpu_ref_put(&si->users);
|
|
}
|
|
|
|
#else /* CONFIG_SWAP */
|
|
static inline struct swap_info_struct *swp_swap_info(swp_entry_t entry)
|
|
{
|
|
return NULL;
|
|
}
|
|
|
|
static inline struct swap_info_struct *get_swap_device(swp_entry_t entry)
|
|
{
|
|
return NULL;
|
|
}
|
|
|
|
static inline void put_swap_device(struct swap_info_struct *si)
|
|
{
|
|
}
|
|
|
|
#define get_nr_swap_pages() 0L
|
|
#define total_swap_pages 0L
|
|
#define total_swapcache_pages() 0UL
|
|
#define vm_swap_full() 0
|
|
|
|
#define si_swapinfo(val) \
|
|
do { (val)->freeswap = (val)->totalswap = 0; } while (0)
|
|
/* only sparc can not include linux/pagemap.h in this file
|
|
* so leave put_page and release_pages undeclared... */
|
|
#define free_page_and_swap_cache(page) \
|
|
put_page(page)
|
|
#define free_pages_and_swap_cache(pages, nr) \
|
|
release_pages((pages), (nr));
|
|
|
|
static inline void free_swap_and_cache_nr(swp_entry_t entry, int nr)
|
|
{
|
|
}
|
|
|
|
static inline void free_swap_cache(struct folio *folio)
|
|
{
|
|
}
|
|
|
|
static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
|
|
{
|
|
return 0;
|
|
}
|
|
|
|
static inline void swap_shmem_alloc(swp_entry_t swp, int nr)
|
|
{
|
|
}
|
|
|
|
static inline int swap_duplicate(swp_entry_t swp)
|
|
{
|
|
return 0;
|
|
}
|
|
|
|
static inline int swapcache_prepare(swp_entry_t swp, int nr)
|
|
{
|
|
return 0;
|
|
}
|
|
|
|
static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
|
|
{
|
|
}
|
|
|
|
static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
|
|
{
|
|
}
|
|
|
|
static inline int __swap_count(swp_entry_t entry)
|
|
{
|
|
return 0;
|
|
}
|
|
|
|
static inline bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry)
|
|
{
|
|
return false;
|
|
}
|
|
|
|
static inline int swp_swapcount(swp_entry_t entry)
|
|
{
|
|
return 0;
|
|
}
|
|
|
|
static inline int folio_alloc_swap(struct folio *folio, gfp_t gfp_mask)
|
|
{
|
|
return -EINVAL;
|
|
}
|
|
|
|
static inline bool folio_free_swap(struct folio *folio)
|
|
{
|
|
return false;
|
|
}
|
|
|
|
static inline int add_swap_extent(struct swap_info_struct *sis,
|
|
unsigned long start_page,
|
|
unsigned long nr_pages, sector_t start_block)
|
|
{
|
|
return -EINVAL;
|
|
}
|
|
#endif /* CONFIG_SWAP */
|
|
|
|
static inline void free_swap_and_cache(swp_entry_t entry)
|
|
{
|
|
free_swap_and_cache_nr(entry, 1);
|
|
}
|
|
|
|
static inline void swap_free(swp_entry_t entry)
|
|
{
|
|
swap_free_nr(entry, 1);
|
|
}
|
|
|
|
#ifdef CONFIG_MEMCG
|
|
static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
|
|
{
|
|
/* Cgroup2 doesn't have per-cgroup swappiness */
|
|
if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
|
|
return READ_ONCE(vm_swappiness);
|
|
|
|
/* root ? */
|
|
if (mem_cgroup_disabled() || mem_cgroup_is_root(memcg))
|
|
return READ_ONCE(vm_swappiness);
|
|
|
|
return READ_ONCE(memcg->swappiness);
|
|
}
|
|
#else
|
|
static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
|
|
{
|
|
return READ_ONCE(vm_swappiness);
|
|
}
|
|
#endif
|
|
|
|
#if defined(CONFIG_SWAP) && defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP)
|
|
void __folio_throttle_swaprate(struct folio *folio, gfp_t gfp);
|
|
static inline void folio_throttle_swaprate(struct folio *folio, gfp_t gfp)
|
|
{
|
|
if (mem_cgroup_disabled())
|
|
return;
|
|
__folio_throttle_swaprate(folio, gfp);
|
|
}
|
|
#else
|
|
static inline void folio_throttle_swaprate(struct folio *folio, gfp_t gfp)
|
|
{
|
|
}
|
|
#endif
|
|
|
|
#if defined(CONFIG_MEMCG) && defined(CONFIG_SWAP)
|
|
int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry);
|
|
static inline int mem_cgroup_try_charge_swap(struct folio *folio,
|
|
swp_entry_t entry)
|
|
{
|
|
if (mem_cgroup_disabled())
|
|
return 0;
|
|
return __mem_cgroup_try_charge_swap(folio, entry);
|
|
}
|
|
|
|
extern void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages);
|
|
static inline void mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
|
|
{
|
|
if (mem_cgroup_disabled())
|
|
return;
|
|
__mem_cgroup_uncharge_swap(entry, nr_pages);
|
|
}
|
|
|
|
extern long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg);
|
|
extern bool mem_cgroup_swap_full(struct folio *folio);
|
|
#else
|
|
static inline int mem_cgroup_try_charge_swap(struct folio *folio,
|
|
swp_entry_t entry)
|
|
{
|
|
return 0;
|
|
}
|
|
|
|
static inline void mem_cgroup_uncharge_swap(swp_entry_t entry,
|
|
unsigned int nr_pages)
|
|
{
|
|
}
|
|
|
|
static inline long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg)
|
|
{
|
|
return get_nr_swap_pages();
|
|
}
|
|
|
|
static inline bool mem_cgroup_swap_full(struct folio *folio)
|
|
{
|
|
return vm_swap_full();
|
|
}
|
|
#endif
|
|
|
|
#endif /* __KERNEL__*/
|
|
#endif /* _LINUX_SWAP_H */
|