mirror of
https://github.com/torvalds/linux.git
synced 2025-10-30 08:08:27 +02:00
449 commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
da274853fe |
cpu: Remove obsolete comment from takedown_cpu()
takedown_cpu() has a comment about "all preempt/rcu users must observe !cpu_active()" which is kind of meaningless in this function. This comment was originally introduced by commit |
||
|
|
19c24f7ee3 |
cpu: Define attack vectors
Define 4 new attack vectors that are used for controlling CPU speculation
mitigations. These may be individually disabled as part of the
mitigations= command line. Attack vector controls are combined with global
options like 'auto' or 'auto,nosmt' like 'mitigations=auto,no_user_kernel'.
The global options come first in the mitigations= string.
Cross-thread mitigations can either remain enabled fully, including
potentially disabling SMT ('auto,nosmt'), remain enabled except for
disabling SMT ('auto'), or entirely disabled through the new
'no_cross_thread' attack vector option.
The default settings for these attack vectors are consistent with existing
kernel defaults, other than the automatic disabling of VM-based attack
vectors if KVM support is not present.
Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250707183316.1349127-3-david.kaplan@amd.com
|
||
|
|
f400565faa |
perf: Remove too early and redundant CPU hotplug handling
The CPU hotplug handlers are called twice: at prepare and online stage. Their role is to: 1) Enable/disable a CPU context. This is irrelevant and even buggy at the prepare stage because the CPU is still offline. On early secondary CPU up, creating an event attached to that CPU might silently fail because the CPU context is observed as online but the context installation's IPI failure is ignored. 2) Update the scope cpumasks and re-migrate the events accordingly in the CPU down case. This is irrelevant at the prepare stage. 3) Remove the events attached to the context of the offlining CPU. It even uses an (unnecessary) IPI for it. This is also irrelevant at the prepare stage. Also none of the *_PREPARE and *_STARTING architecture perf related CPU hotplug callbacks rely on CPUHP_PERF_PREPARE. CPUHP_AP_PERF_ONLINE is enough and the right place to perform the work. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250424161128.29176-4-frederic@kernel.org |
||
|
|
a5b3d8660b |
hyperv-next for 6.15
-----BEGIN PGP SIGNATURE----- iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmfhlLATHHdlaS5saXVA a2VybmVsLm9yZwAKCRB2FHBfkEGgXgchCADOz33rSm4G4w4r0qT05dTDi/lZkEdK 64dQq322XXP/C9FfR66d30243gsAmuM5a0SvzFHLXAOu6yqM270Xehd/Rud+Um2s lSVnc0Ux0AWBgksqFd0t577aN7zmJEukosEYO5lBNop+zOcadrm3S6Th/AoL2h/D yphPkhH13bsCK+Wll/eBOQLIhC9iA0konYbBLuEQ5MqvUbrzc6Rmb5gxsHHZKOqg vLjkrYR/d3s2gIpKxiFp0RwvzGyffZEHxvU/YF3hTenPMlTlnXWbyspBSTVmWggP 13IFLzqxDdW9RgUnGB4xRc424AC1LKqEr42QPQE7zGvl2jdJriA2Q1LT =BXqj -----END PGP SIGNATURE----- Merge tag 'hyperv-next-signed-20250324' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv updates from Wei Liu: - Add support for running as the root partition in Hyper-V (Microsoft Hypervisor) by exposing /dev/mshv (Nuno and various people) - Add support for CPU offlining in Hyper-V (Hamza Mahfooz) - Misc fixes and cleanups (Roman Kisel, Tianyu Lan, Wei Liu, Michael Kelley, Thorsten Blum) * tag 'hyperv-next-signed-20250324' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: (24 commits) x86/hyperv: fix an indentation issue in mshyperv.h x86/hyperv: Add comments about hv_vpset and var size hypercall input args Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs hyperv: Add definitions for root partition driver to hv headers x86: hyperv: Add mshv_handler() irq handler and setup function Drivers: hv: Introduce per-cpu event ring tail Drivers: hv: Export some functions for use by root partition module acpi: numa: Export node_to_pxm() hyperv: Introduce hv_recommend_using_aeoi() arm64/hyperv: Add some missing functions to arm64 x86/mshyperv: Add support for extended Hyper-V features hyperv: Log hypercall status codes as strings x86/hyperv: Fix check of return value from snp_set_vmsa() x86/hyperv: Add VTL mode callback for restarting the system x86/hyperv: Add VTL mode emergency restart callback hyperv: Remove unused union and structs hyperv: Add CONFIG_MSHV_ROOT to gate root partition support hyperv: Change hv_root_partition into a function hyperv: Convert hypercall statuses to linux error codes drivers/hv: add CPU offlining support ... |
||
|
|
d6834d9c99 |
watchdog/hardlockup/perf: Fix perf_event memory leak
During stress-testing, we found a kmemleak report for perf_event:
unreferenced object 0xff110001410a33e0 (size 1328):
comm "kworker/4:11", pid 288, jiffies 4294916004
hex dump (first 32 bytes):
b8 be c2 3b 02 00 11 ff 22 01 00 00 00 00 ad de ...;....".......
f0 33 0a 41 01 00 11 ff f0 33 0a 41 01 00 11 ff .3.A.....3.A....
backtrace (crc 24eb7b3a):
[<00000000e211b653>] kmem_cache_alloc_node_noprof+0x269/0x2e0
[<000000009d0985fa>] perf_event_alloc+0x5f/0xcf0
[<00000000084ad4a2>] perf_event_create_kernel_counter+0x38/0x1b0
[<00000000fde96401>] hardlockup_detector_event_create+0x50/0xe0
[<0000000051183158>] watchdog_hardlockup_enable+0x17/0x70
[<00000000ac89727f>] softlockup_start_fn+0x15/0x40
...
Our stress test includes CPU online and offline cycles, and updating the
watchdog configuration.
After reading the code, I found that there may be a race between cleaning up
perf_event after updating watchdog and disabling event when the CPU goes offline:
CPU0 CPU1 CPU2
(update watchdog) (hotplug offline CPU1)
... _cpu_down(CPU1)
cpus_read_lock() // waiting for cpu lock
softlockup_start_all
smp_call_on_cpu(CPU1)
softlockup_start_fn
...
watchdog_hardlockup_enable(CPU1)
perf create E1
watchdog_ev[CPU1] = E1
cpus_read_unlock()
cpus_write_lock()
cpuhp_kick_ap_work(CPU1)
cpuhp_thread_fun
...
watchdog_hardlockup_disable(CPU1)
watchdog_ev[CPU1] = NULL
dead_event[CPU1] = E1
__lockup_detector_cleanup
for each dead_events_mask
release each dead_event
/*
* CPU1 has not been added to
* dead_events_mask, then E1
* will not be released
*/
CPU1 -> dead_events_mask
cpumask_clear(&dead_events_mask)
// dead_events_mask is cleared, E1 is leaked
In this case, the leaked perf_event E1 matches the perf_event leak
reported by kmemleak. Due to the low probability of problem recurrence
(only reported once), I added some hack delays in the code:
static void __lockup_detector_reconfigure(void)
{
...
watchdog_hardlockup_start();
cpus_read_unlock();
+ mdelay(100);
/*
* Must be called outside the cpus locked section to prevent
* recursive locking in the perf code.
...
}
void watchdog_hardlockup_disable(unsigned int cpu)
{
...
perf_event_disable(event);
this_cpu_write(watchdog_ev, NULL);
this_cpu_write(dead_event, event);
+ mdelay(100);
cpumask_set_cpu(smp_processor_id(), &dead_events_mask);
atomic_dec(&watchdog_cpus);
...
}
void hardlockup_detector_perf_cleanup(void)
{
...
perf_event_release_kernel(event);
per_cpu(dead_event, cpu) = NULL;
}
+ mdelay(100);
cpumask_clear(&dead_events_mask);
}
Then, simultaneously performing CPU on/off and switching watchdog, it is
almost certain to reproduce this leak.
The problem here is that releasing perf_event is not within the CPU
hotplug read-write lock. Commit:
|
||
|
|
7c0db8a4f5 |
cpu: export lockdep_assert_cpus_held()
If CONFIG_HYPERV=m, lockdep_assert_cpus_held() is undefined for HyperV. So, export the function so that GPL drivers can use it more broadly. Cc: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20250117203309.192072-1-hamzamahfooz@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20250117203309.192072-1-hamzamahfooz@linux.microsoft.com> |
||
|
|
9c5968db9e |
The various patchsets are summarized below. Plus of course many
indivudual patches which are described in their changelogs. - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the page allocator so we end up with the ability to allocate and free zero-refcount pages. So that callers (ie, slab) can avoid a refcount inc & dec. - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use large folios other than PMD-sized ones. - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and fixes for this small built-in kernel selftest. - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of the mapletree code. - "mm: fix format issues and param types" from Keren Sun implements a few minor code cleanups. - "simplify split calculation" from Wei Yang provides a few fixes and a test for the mapletree code. - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes continues the work of moving vma-related code into the (relatively) new mm/vma.c. - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David Hildenbrand cleans up and rationalizes handling of gfp flags in the page allocator. - "readahead: Reintroduce fix for improper RA window sizing" from Jan Kara is a second attempt at fixing a readahead window sizing issue. It should reduce the amount of unnecessary reading. - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng addresses an issue where "huge" amounts of pte pagetables are accumulated (https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/). Qi's series addresses this windup by synchronously freeing PTE memory within the context of madvise(MADV_DONTNEED). - "selftest/mm: Remove warnings found by adding compiler flags" from Muhammad Usama Anjum fixes some build warnings in the selftests code when optional compiler warnings are enabled. - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David Hildenbrand tightens the allocator's observance of __GFP_HARDWALL. - "pkeys kselftests improvements" from Kevin Brodsky implements various fixes and cleanups in the MM selftests code, mainly pertaining to the pkeys tests. - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to estimate application working set size. - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn provides some cleanups to memcg's hugetlb charging logic. - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song removes the global swap cgroup lock. A speedup of 10% for a tmpfs-based kernel build was demonstrated. - "zram: split page type read/write handling" from Sergey Senozhatsky has several fixes and cleaups for zram in the area of zram_write_page(). A watchdog softlockup warning was eliminated. - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky cleans up the pagetable destructor implementations. A rare use-after-free race is fixed. - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes simplifies and cleans up the debugging code in the VMA merging logic. - "Account page tables at all levels" from Kevin Brodsky cleans up and regularizes the pagetable ctor/dtor handling. This results in improvements in accounting accuracy. - "mm/damon: replace most damon_callback usages in sysfs with new core functions" from SeongJae Park cleans up and generalizes DAMON's sysfs file interface logic. - "mm/damon: enable page level properties based monitoring" from SeongJae Park increases the amount of information which is presented in response to DAMOS actions. - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes DAMON's long-deprecated debugfs interfaces. Thus the migration to sysfs is completed. - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter Xu cleans up and generalizes the hugetlb reservation accounting. - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino removes a never-used feature of the alloc_pages_bulk() interface. - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park extends DAMOS filters to support not only exclusion (rejecting), but also inclusion (allowing) behavior. - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi "introduces a new memory descriptor for zswap.zpool that currently overlaps with struct page for now. This is part of the effort to reduce the size of struct page and to enable dynamic allocation of memory descriptors." - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and simplifies the swap allocator locking. A speedup of 400% was demonstrated for one workload. As was a 35% reduction for kernel build time with swap-on-zram. - "mm: update mips to use do_mmap(), make mmap_region() internal" from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that mmap_region() can be made MM-internal. - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU regressions and otherwise improves MGLRU performance. - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park updates DAMON documentation. - "Cleanup for memfd_create()" from Isaac Manjarres does that thing. - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand provides various cleanups in the areas of hugetlb folios, THP folios and migration. - "Uncached buffered IO" from Jens Axboe implements the new RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache reading and writing. To permite userspace to address issues with massive buildup of useless pagecache when reading/writing fast devices. - "selftests/mm: virtual_address_range: Reduce memory" from Thomas Weißschuh fixes and optimizes some of the MM selftests. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5a+cwAKCRDdBJ7gKXxA jtoyAP9R58oaOKPJuTizEKKXvh/RpMyD6sYcz/uPpnf+cKTZxQEAqfVznfWlw/Lz uC3KRZYhmd5YrxU4o+qjbzp9XWX/xAE= =Ib2s -----END PGP SIGNATURE----- Merge tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "The various patchsets are summarized below. Plus of course many indivudual patches which are described in their changelogs. - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the page allocator so we end up with the ability to allocate and free zero-refcount pages. So that callers (ie, slab) can avoid a refcount inc & dec - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use large folios other than PMD-sized ones - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and fixes for this small built-in kernel selftest - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of the mapletree code - "mm: fix format issues and param types" from Keren Sun implements a few minor code cleanups - "simplify split calculation" from Wei Yang provides a few fixes and a test for the mapletree code - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes continues the work of moving vma-related code into the (relatively) new mm/vma.c - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David Hildenbrand cleans up and rationalizes handling of gfp flags in the page allocator - "readahead: Reintroduce fix for improper RA window sizing" from Jan Kara is a second attempt at fixing a readahead window sizing issue. It should reduce the amount of unnecessary reading - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng addresses an issue where "huge" amounts of pte pagetables are accumulated: https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/ Qi's series addresses this windup by synchronously freeing PTE memory within the context of madvise(MADV_DONTNEED) - "selftest/mm: Remove warnings found by adding compiler flags" from Muhammad Usama Anjum fixes some build warnings in the selftests code when optional compiler warnings are enabled - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David Hildenbrand tightens the allocator's observance of __GFP_HARDWALL - "pkeys kselftests improvements" from Kevin Brodsky implements various fixes and cleanups in the MM selftests code, mainly pertaining to the pkeys tests - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to estimate application working set size - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn provides some cleanups to memcg's hugetlb charging logic - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song removes the global swap cgroup lock. A speedup of 10% for a tmpfs-based kernel build was demonstrated - "zram: split page type read/write handling" from Sergey Senozhatsky has several fixes and cleaups for zram in the area of zram_write_page(). A watchdog softlockup warning was eliminated - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky cleans up the pagetable destructor implementations. A rare use-after-free race is fixed - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes simplifies and cleans up the debugging code in the VMA merging logic - "Account page tables at all levels" from Kevin Brodsky cleans up and regularizes the pagetable ctor/dtor handling. This results in improvements in accounting accuracy - "mm/damon: replace most damon_callback usages in sysfs with new core functions" from SeongJae Park cleans up and generalizes DAMON's sysfs file interface logic - "mm/damon: enable page level properties based monitoring" from SeongJae Park increases the amount of information which is presented in response to DAMOS actions - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes DAMON's long-deprecated debugfs interfaces. Thus the migration to sysfs is completed - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter Xu cleans up and generalizes the hugetlb reservation accounting - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino removes a never-used feature of the alloc_pages_bulk() interface - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park extends DAMOS filters to support not only exclusion (rejecting), but also inclusion (allowing) behavior - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi introduces a new memory descriptor for zswap.zpool that currently overlaps with struct page for now. This is part of the effort to reduce the size of struct page and to enable dynamic allocation of memory descriptors - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and simplifies the swap allocator locking. A speedup of 400% was demonstrated for one workload. As was a 35% reduction for kernel build time with swap-on-zram - "mm: update mips to use do_mmap(), make mmap_region() internal" from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that mmap_region() can be made MM-internal - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU regressions and otherwise improves MGLRU performance - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park updates DAMON documentation - "Cleanup for memfd_create()" from Isaac Manjarres does that thing - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand provides various cleanups in the areas of hugetlb folios, THP folios and migration - "Uncached buffered IO" from Jens Axboe implements the new RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache reading and writing. To permite userspace to address issues with massive buildup of useless pagecache when reading/writing fast devices - "selftests/mm: virtual_address_range: Reduce memory" from Thomas Weißschuh fixes and optimizes some of the MM selftests" * tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits) mm/compaction: fix UBSAN shift-out-of-bounds warning s390/mm: add missing ctor/dtor on page table upgrade kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags() tools: add VM_WARN_ON_VMG definition mm/damon/core: use str_high_low() helper in damos_wmark_wait_us() seqlock: add missing parameter documentation for raw_seqcount_try_begin() mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh mm/page_alloc: remove the incorrect and misleading comment zram: remove zcomp_stream_put() from write_incompressible_page() mm: separate move/undo parts from migrate_pages_batch() mm/kfence: use str_write_read() helper in get_access_type() selftests/mm/mkdirty: fix memory leak in test_uffdio_copy() kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags() selftests/mm: virtual_address_range: avoid reading from VM_IO mappings selftests/mm: vm_util: split up /proc/self/smaps parsing selftests/mm: virtual_address_range: unmap chunks after validation selftests/mm: virtual_address_range: mmap() without PROT_WRITE selftests/memfd/memfd_test: fix possible NULL pointer dereference mm: add FGP_DONTCACHE folio creation flag mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue ... |
||
|
|
5fb4088624 |
bitmap patches for v6.14.
Hi Linus, Please pull bitmap patches for v6.14. This includes const_true() series from Vincent Mailhol, another __always_inline rework from Nathan Chancellor for RISCV, and a couple random fixes from Dr. David Alan Gilbert and I Hsin Cheng. Thanks, Yury -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEEi8GdvG6xMhdgpu/4sUSA/TofvsgFAmeUMpEACgkQsUSA/Tof vsiqEAv/VvtTD6I3Ms3kIl2G1pBP/EFcYQGbwS1PCsX3RX16rinZ2XUDRtjvRy1Y FA+OsZ2yPtH8G1WRM8YauZsh2cZSCA4xTFadLSZkT8leSWERKaTyJI+PXe2A43IU d+FV4zYH5JYqV2u9aLHWMO8Voq9nNHZXOYHRu0q53TBFn7V294Lma9oDlK3Wjfur vSZZU9SKKlMV8Oy6/hZ3tDemUDM1jAGlqrxFb8aXRsTsCpsmlqE1bQdZ+AadjevZ cVeplB8OCCnqcYV28szIwsJpSzmd5/WBP6jLpeMgBYFGS0JT2USdZ8gsw+Yq/On5 hjxek3cHBKdv0CINk0Ejf4aV0IvoX5S/VRlTjhttzyX68no1DoibDuWJWB42PRWS frllVOmdkm2DqA0G9mgxtwzBl5UqMFVe5LuVU9E9BZZeDmRZmS3obrUiMzpiqUOs zkdDvA0uaKgjx2qZADDEFqg1+XdX0A0iPebEv9vLaULXv0+D/PbkClNqIf8p7778 2GWuBLJe =1hW6 -----END PGP SIGNATURE----- Merge tag 'bitmap-for-6.14' of https://github.com:/norov/linux Pull bitmap updates from Yury Norov: "This includes const_true() series from Vincent Mailhol, another __always_inline rework from Nathan Chancellor for RISCV, and a couple of random fixes from Dr. David Alan Gilbert and I Hsin Cheng" * tag 'bitmap-for-6.14' of https://github.com:/norov/linux: cpumask: Rephrase comments for cpumask_any*() APIs cpu: Remove unused init_cpu_online riscv: Always inline bitops linux/bits.h: simplify GENMASK_INPUT_CHECK() compiler.h: add const_true() |
||
|
|
2f8dea1692 |
hrtimers: Handle CPU state correctly on hotplug
Consider a scenario where a CPU transitions from CPUHP_ONLINE to halfway
through a CPU hotunplug down to CPUHP_HRTIMERS_PREPARE, and then back to
CPUHP_ONLINE:
Since hrtimers_prepare_cpu() does not run, cpu_base.hres_active remains set
to 1 throughout. However, during a CPU unplug operation, the tick and the
clockevents are shut down at CPUHP_AP_TICK_DYING. On return to the online
state, for instance CFS incorrectly assumes that the hrtick is already
active, and the chance of the clockevent device to transition to oneshot
mode is also lost forever for the CPU, unless it goes back to a lower state
than CPUHP_HRTIMERS_PREPARE once.
This round-trip reveals another issue; cpu_base.online is not set to 1
after the transition, which appears as a WARN_ON_ONCE in enqueue_hrtimer().
Aside of that, the bulk of the per CPU state is not reset either, which
means there are dangling pointers in the worst case.
Address this by adding a corresponding startup() callback, which resets the
stale per CPU state and sets the online flag.
[ tglx: Make the new callback unconditionally available, remove the online
modification in the prepare() callback and clear the remaining
state in the starting callback instead of the prepare callback ]
Fixes:
|
||
|
|
21641bd9a7 |
lazy tlb: fix hotplug exit race with MMU_LAZY_TLB_SHOOTDOWN
CPU unplug first calls __cpu_disable(), and that's where powerpc calls cleanup_cpu_mmu_context(), which clears this CPU from mm_cpumask() of all mms in the system. However this CPU may still be using a lazy tlb mm, and its mm_cpumask bit will be cleared from it. The CPU does not switch away from the lazy tlb mm until arch_cpu_idle_dead() calls idle_task_exit(). If that user mm exits in this window, it will not be subject to the lazy tlb mm shootdown and may be freed while in use as a lazy mm by the CPU that is being unplugged. cleanup_cpu_mmu_context() could be moved later, but it looks better to move the lazy tlb mm switching earlier. The problem with doing the lazy mm switching in idle_task_exit() is explained in commit |
||
|
|
7f15d4abf9 |
cpu: Remove unused init_cpu_online
The last use of init_cpu_online() was removed by the
commit
|
||
|
|
55cb93fd24 |
Driver core changes for 6.13-rc1
Here is a small set of driver core changes for 6.13-rc1.
Nothing major for this merge cycle, except for the 2 simple merge
conflicts are here just to make life interesting.
Included in here are:
- sysfs core changes and preparations for more sysfs api cleanups that
can come through all driver trees after -rc1 is out
- fw_devlink fixes based on many reports and debugging sessions
- list_for_each_reverse() removal, no one was using it!
- last-minute seq_printf() format string bug found and fixed in many
drivers all at once.
- minor bugfixes and changes full details in the shortlog
As mentioned above, there is 2 merge conflicts with your tree, one is
where the file is removed (easy enough to resolve), the second is a
build time error, that has been found in linux-next and the fix can be
seen here:
https://lore.kernel.org/r/20241107212645.41252436@canb.auug.org.au
Other than that, the changes here have been in linux-next with no other
reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZ0lEog8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ym+0ACgw6wN+LkLVIHWhxTq5DYHQ0QCxY8AoJrRIcKe
78h0+OU3OXhOy8JGz62W
=oI5S
-----END PGP SIGNATURE-----
Merge tag 'driver-core-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is a small set of driver core changes for 6.13-rc1.
Nothing major for this merge cycle, except for the two simple merge
conflicts are here just to make life interesting.
Included in here are:
- sysfs core changes and preparations for more sysfs api cleanups
that can come through all driver trees after -rc1 is out
- fw_devlink fixes based on many reports and debugging sessions
- list_for_each_reverse() removal, no one was using it!
- last-minute seq_printf() format string bug found and fixed in many
drivers all at once.
- minor bugfixes and changes full details in the shortlog"
* tag 'driver-core-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (35 commits)
Fix a potential abuse of seq_printf() format string in drivers
cpu: Remove spurious NULL in attribute_group definition
s390/con3215: Remove spurious NULL in attribute_group definition
perf: arm-ni: Remove spurious NULL in attribute_group definition
driver core: Constify bin_attribute definitions
sysfs: attribute_group: allow registration of const bin_attribute
firmware_loader: Fix possible resource leak in fw_log_firmware_info()
drivers: core: fw_devlink: Fix excess parameter description in docstring
driver core: class: Correct WARN() message in APIs class_(for_each|find)_device()
cacheinfo: Use of_property_present() for non-boolean properties
cdx: Fix cdx_mmap_resource() after constifying attr in ->mmap()
drivers: core: fw_devlink: Make the error message a bit more useful
phy: tegra: xusb: Set fwnode for xusb port devices
drm: display: Set fwnode for aux bus devices
driver core: fw_devlink: Stop trying to optimize cycle detection logic
driver core: Constify attribute arguments of binary attributes
sysfs: bin_attribute: add const read/write callback variants
sysfs: implement all BIN_ATTR_* macros in terms of __BIN_ATTR()
sysfs: treewide: constify attribute callback of bin_attribute::llseek()
sysfs: treewide: constify attribute callback of bin_attribute::mmap()
...
|
||
|
|
bf9aa14fc5 |
A rather large update for timekeeping and timers:
- The final step to get rid of auto-rearming posix-timers
posix-timers are currently auto-rearmed by the kernel when the signal
of the timer is ignored so that the timer signal can be delivered once
the corresponding signal is unignored.
This requires to throttle the timer to prevent a DoS by small intervals
and keeps the system pointlessly out of low power states for no value.
This is a long standing non-trivial problem due to the lock order of
posix-timer lock and the sighand lock along with life time issues as
the timer and the sigqueue have different life time rules.
Cure this by:
* Embedding the sigqueue into the timer struct to have the same life
time rules. Aside of that this also avoids the lookup of the timer
in the signal delivery and rearm path as it's just a always valid
container_of() now.
* Queuing ignored timer signals onto a seperate ignored list.
* Moving queued timer signals onto the ignored list when the signal is
switched to SIG_IGN before it could be delivered.
* Walking the ignored list when SIG_IGN is lifted and requeue the
signals to the actual signal lists. This allows the signal delivery
code to rearm the timer.
This also required to consolidate the signal delivery rules so they are
consistent across all situations. With that all self test scenarios
finally succeed.
- Core infrastructure for VFS multigrain timestamping
This is required to allow the kernel to use coarse grained time stamps
by default and switch to fine grained time stamps when inode attributes
are actively observed via getattr().
These changes have been provided to the VFS tree as well, so that the
VFS specific infrastructure could be built on top.
- Cleanup and consolidation of the sleep() infrastructure
* Move all sleep and timeout functions into one file
* Rework udelay() and ndelay() into proper documented inline functions
and replace the hardcoded magic numbers by proper defines.
* Rework the fsleep() implementation to take the reality of the timer
wheel granularity on different HZ values into account. Right now the
boundaries are hard coded time ranges which fail to provide the
requested accuracy on different HZ settings.
* Update documentation for all sleep/timeout related functions and fix
up stale documentation links all over the place
* Fixup a few usage sites
- Rework of timekeeping and adjtimex(2) to prepare for multiple PTP clocks
A system can have multiple PTP clocks which are participating in
seperate and independent PTP clock domains. So far the kernel only
considers the PTP clock which is based on CLOCK TAI relevant as that's
the clock which drives the timekeeping adjustments via the various user
space daemons through adjtimex(2).
The non TAI based clock domains are accessible via the file descriptor
based posix clocks, but their usability is very limited. They can't be
accessed fast as they always go all the way out to the hardware and
they cannot be utilized in the kernel itself.
As Time Sensitive Networking (TSN) gains traction it is required to
provide fast user and kernel space access to these clocks.
The approach taken is to utilize the timekeeping and adjtimex(2)
infrastructure to provide this access in a similar way how the kernel
provides access to clock MONOTONIC, REALTIME etc.
Instead of creating a duplicated infrastructure this rework converts
timekeeping and adjtimex(2) into generic functionality which operates
on pointers to data structures instead of using static variables.
This allows to provide time accessors and adjtimex(2) functionality for
the independent PTP clocks in a subsequent step.
- Consolidate hrtimer initialization
hrtimers are set up by initializing the data structure and then
seperately setting the callback function for historical reasons.
That's an extra unnecessary step and makes Rust support less straight
forward than it should be.
Provide a new set of hrtimer_setup*() functions and convert the core
code and a few usage sites of the less frequently used interfaces over.
The bulk of the htimer_init() to hrtimer_setup() conversion is already
prepared and scheduled for the next merge window.
- Drivers:
* Ensure that the global timekeeping clocksource is utilizing the
cluster 0 timer on MIPS multi-cluster systems.
Otherwise CPUs on different clusters use their cluster specific
clocksource which is not guaranteed to be synchronized with other
clusters.
* Mostly boring cleanups, fixes, improvements and code movement
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmc7kPITHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoZKkD/9OUL6fOJrDUmOYBa4QVeMyfTef4EaL
tvwIMM/29XQFeiq3xxCIn+EMnHjXn2lvIhYGQ7GKsbKYwvJ7ZBDpQb+UMhZ2nKI9
6D6BP6WomZohKeH2fZbJQAdqOi3KRYdvQdIsVZUexkqiaVPphRvOH9wOr45gHtZM
EyMRSotPlQTDqcrbUejDMEO94GyjDCYXRsyATLxjmTzL/N4xD4NRIiotjM2vL/a9
8MuCgIhrKUEyYlFoOxxeokBsF3kk3/ez2jlG9b/N8VLH3SYIc2zgL58FBgWxlmgG
bY71nVG3nUgEjxBd2dcXAVVqvb+5widk8p6O7xxOAQKTLMcJ4H0tQDkMnzBtUzvB
DGAJDHAmAr0g+ja9O35Pkhunkh4HYFIbq0Il4d1HMKObhJV0JumcKuQVxrXycdm3
UZfq3seqHsZJQbPgCAhlFU0/2WWScocbee9bNebGT33KVwSp5FoVv89C/6Vjb+vV
Gusc3thqrQuMAZW5zV8g4UcBAA/xH4PB0I+vHib+9XPZ4UQ7/6xKl2jE0kd5hX7n
AAUeZvFNFqIsY+B6vz+Jx/yzyM7u5cuXq87pof5EHVFzv56lyTp4ToGcOGYRgKH5
JXeYV1OxGziSDrd5vbf9CzdWMzqMvTefXrHbWrjkjhNOe8E1A8O88RZ5uRKZhmSw
hZZ4hdM9+3T7cg==
=2VC6
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"A rather large update for timekeeping and timers:
- The final step to get rid of auto-rearming posix-timers
posix-timers are currently auto-rearmed by the kernel when the
signal of the timer is ignored so that the timer signal can be
delivered once the corresponding signal is unignored.
This requires to throttle the timer to prevent a DoS by small
intervals and keeps the system pointlessly out of low power states
for no value. This is a long standing non-trivial problem due to
the lock order of posix-timer lock and the sighand lock along with
life time issues as the timer and the sigqueue have different life
time rules.
Cure this by:
- Embedding the sigqueue into the timer struct to have the same
life time rules. Aside of that this also avoids the lookup of
the timer in the signal delivery and rearm path as it's just a
always valid container_of() now.
- Queuing ignored timer signals onto a seperate ignored list.
- Moving queued timer signals onto the ignored list when the
signal is switched to SIG_IGN before it could be delivered.
- Walking the ignored list when SIG_IGN is lifted and requeue the
signals to the actual signal lists. This allows the signal
delivery code to rearm the timer.
This also required to consolidate the signal delivery rules so they
are consistent across all situations. With that all self test
scenarios finally succeed.
- Core infrastructure for VFS multigrain timestamping
This is required to allow the kernel to use coarse grained time
stamps by default and switch to fine grained time stamps when inode
attributes are actively observed via getattr().
These changes have been provided to the VFS tree as well, so that
the VFS specific infrastructure could be built on top.
- Cleanup and consolidation of the sleep() infrastructure
- Move all sleep and timeout functions into one file
- Rework udelay() and ndelay() into proper documented inline
functions and replace the hardcoded magic numbers by proper
defines.
- Rework the fsleep() implementation to take the reality of the
timer wheel granularity on different HZ values into account.
Right now the boundaries are hard coded time ranges which fail
to provide the requested accuracy on different HZ settings.
- Update documentation for all sleep/timeout related functions
and fix up stale documentation links all over the place
- Fixup a few usage sites
- Rework of timekeeping and adjtimex(2) to prepare for multiple PTP
clocks
A system can have multiple PTP clocks which are participating in
seperate and independent PTP clock domains. So far the kernel only
considers the PTP clock which is based on CLOCK TAI relevant as
that's the clock which drives the timekeeping adjustments via the
various user space daemons through adjtimex(2).
The non TAI based clock domains are accessible via the file
descriptor based posix clocks, but their usability is very limited.
They can't be accessed fast as they always go all the way out to
the hardware and they cannot be utilized in the kernel itself.
As Time Sensitive Networking (TSN) gains traction it is required to
provide fast user and kernel space access to these clocks.
The approach taken is to utilize the timekeeping and adjtimex(2)
infrastructure to provide this access in a similar way how the
kernel provides access to clock MONOTONIC, REALTIME etc.
Instead of creating a duplicated infrastructure this rework
converts timekeeping and adjtimex(2) into generic functionality
which operates on pointers to data structures instead of using
static variables.
This allows to provide time accessors and adjtimex(2) functionality
for the independent PTP clocks in a subsequent step.
- Consolidate hrtimer initialization
hrtimers are set up by initializing the data structure and then
seperately setting the callback function for historical reasons.
That's an extra unnecessary step and makes Rust support less
straight forward than it should be.
Provide a new set of hrtimer_setup*() functions and convert the
core code and a few usage sites of the less frequently used
interfaces over.
The bulk of the htimer_init() to hrtimer_setup() conversion is
already prepared and scheduled for the next merge window.
- Drivers:
- Ensure that the global timekeeping clocksource is utilizing the
cluster 0 timer on MIPS multi-cluster systems.
Otherwise CPUs on different clusters use their cluster specific
clocksource which is not guaranteed to be synchronized with
other clusters.
- Mostly boring cleanups, fixes, improvements and code movement"
* tag 'timers-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (140 commits)
posix-timers: Fix spurious warning on double enqueue versus do_exit()
clocksource/drivers/arm_arch_timer: Use of_property_present() for non-boolean properties
clocksource/drivers/gpx: Remove redundant casts
clocksource/drivers/timer-ti-dm: Fix child node refcount handling
dt-bindings: timer: actions,owl-timer: convert to YAML
clocksource/drivers/ralink: Add Ralink System Tick Counter driver
clocksource/drivers/mips-gic-timer: Always use cluster 0 counter as clocksource
clocksource/drivers/timer-ti-dm: Don't fail probe if int not found
clocksource/drivers:sp804: Make user selectable
clocksource/drivers/dw_apb: Remove unused dw_apb_clockevent functions
hrtimers: Delete hrtimer_init_on_stack()
alarmtimer: Switch to use hrtimer_setup() and hrtimer_setup_on_stack()
io_uring: Switch to use hrtimer_setup_on_stack()
sched/idle: Switch to use hrtimer_setup_on_stack()
hrtimers: Delete hrtimer_init_sleeper_on_stack()
wait: Switch to use hrtimer_setup_sleeper_on_stack()
timers: Switch to use hrtimer_setup_sleeper_on_stack()
net: pktgen: Switch to use hrtimer_setup_sleeper_on_stack()
futex: Switch to use hrtimer_setup_sleeper_on_stack()
fs/aio: Switch to use hrtimer_setup_sleeper_on_stack()
...
|
||
|
|
e7240bd91f |
cpu: Remove spurious NULL in attribute_group definition
This NULL value is most-likely a copy-paste error from an array definition. The NULL doesn't have any effect. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://lore.kernel.org/r/20241118-sysfs-const-attribute_group-fixes-v1-3-48e0b0ad8cba@weissschuh.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
3b1596a21f |
clockevents: Shutdown and unregister current clockevents at CPUHP_AP_TICK_DYING
The way the clockevent devices are finally stopped while a CPU is offlining is currently chaotic. The layout being by order: 1) tick_sched_timer_dying() stops the tick and the underlying clockevent but only for oneshot case. The periodic tick and its related clockevent still runs. 2) tick_broadcast_offline() detaches and stops the per-cpu oneshot broadcast and append it to the released list. 3) Some individual clockevent drivers stop the clockevents (a second time if the tick is oneshot) 4) Once the CPU is dead, a control CPU remotely detaches and stops (a 3rd time if oneshot mode) the CPU clockevent and adds it to the released list. 5) The released list containing the broadcast device released on step 2) and the remotely detached clockevent from step 4) are unregistered. These random events can be factorized if the current clockevent is detached and stopped by the dying CPU at the generic layer, that is from the dying CPU: a) Stop the tick b) Stop/detach the underlying per-cpu oneshot broadcast clockevent c) Stop/detach the underlying clockevent d) Release / unregister the clockevents from b) and c) e) Release / unregister the remaining clockevents from the dying CPU. This part could be performed by the dying CPU This way the drivers and the tick layer don't need to care about clockevent operations during cpuhotplug down. This also unifies the tick behaviour on offline CPUs between oneshot and periodic modes, avoiding offline ticks altogether for sanity. Adopt the simplification. [ tglx: Remove the WARN_ON() in clockevents_register_device() as that is called from an upcoming CPU before the CPU is marked online ] Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20241029125451.54574-3-frederic@kernel.org |
||
|
|
0784181b44 |
lockdep: Add lockdep_cleanup_dead_cpu()
Add a function to check that an offline CPU has left the tracing
infrastructure in a sane state.
Commit
|
||
|
|
9ea925c806 |
Updates for timers and timekeeping:
- Core:
- Overhaul of posix-timers in preparation of removing the
workaround for periodic timers which have signal delivery
ignored.
- Remove the historical extra jiffie in msleep()
msleep() adds an extra jiffie to the timeout value to ensure
minimal sleep time. The timer wheel ensures minimal sleep
time since the large rewrite to a non-cascading wheel, but the
extra jiffie in msleep() remained unnoticed. Remove it.
- Make the timer slack handling correct for realtime tasks.
The procfs interface is inconsistent and does neither reflect
reality nor conforms to the man page. Show the correct 0 slack
for real time tasks and enforce it at the core level instead of
having inconsistent individual checks in various timer setup
functions.
- The usual set of updates and enhancements all over the place.
- Drivers:
- Allow the ACPI PM timer to be turned off during suspend
- No new drivers
- The usual updates and enhancements in various drivers
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbn7jQTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYobqnD/9COlU0nwsulABI/aNIrsh6iYvnCC9v
14CcNta7Qn+157Wfw9BWOyHdNhR1/fPCXE8jJ71zTyIOeW27HV2JyTtxTwe9ZcdK
ViHAaj7YcIjcVUEC3StCoRCPnvLslEw4qJA5AOQuDyMivdQn+YVa2c0baJxKaXZt
xk4HZdMj4NAS0jRKnoZSwtKW/+Oz6rR4GAWrZo+Zs1/8ur3HfqnQfi8lJ1hJtLLW
V7XDCVRvamVi6Ah3ocYPPp/1P6yeQDA1ge9aMddqaza5STWISXRtSnFMUmYP3rbS
FaL8TyL+ilfny8pkGB2WlG6nLuSbtvogtdEh1gG1k1RmZt44kAtk8ba/KiWFPBSb
zK9cjojRMBS71f9G4kmb5F4rnXoLsg1YbD1Nzhz3wq2Cs1Z90dc2QwMren0zoQ1x
Fn56ueRyAiagBlnrSaKyso/2RvqJTNoSdi3RkpjYeAph0UoDCqvTvKjGAf1mWiw1
T/1lUWSVqWHnzZbM7XXzzajIN9bl6A7bbqlcAJ2O9vZIDt7273DG+bQym9Vh6Why
0LTGGERHxzKBsG7WRg+2Gmvv6S18UPKRo8tLtlA758rHlFuPTZCShWrIriwSNl1K
Hxon+d4BparSnm1h9W/NHPKJA574UbWRCBjdk58IkAj8DxZZY4ORD9SMP+ggkV7G
F6p9cgoDNP9KFg==
=jE0N
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"Core:
- Overhaul of posix-timers in preparation of removing the workaround
for periodic timers which have signal delivery ignored.
- Remove the historical extra jiffie in msleep()
msleep() adds an extra jiffie to the timeout value to ensure
minimal sleep time. The timer wheel ensures minimal sleep time
since the large rewrite to a non-cascading wheel, but the extra
jiffie in msleep() remained unnoticed. Remove it.
- Make the timer slack handling correct for realtime tasks.
The procfs interface is inconsistent and does neither reflect
reality nor conforms to the man page. Show the correct 0 slack for
real time tasks and enforce it at the core level instead of having
inconsistent individual checks in various timer setup functions.
- The usual set of updates and enhancements all over the place.
Drivers:
- Allow the ACPI PM timer to be turned off during suspend
- No new drivers
- The usual updates and enhancements in various drivers"
* tag 'timers-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
ntp: Make sure RTC is synchronized when time goes backwards
treewide: Fix wrong singular form of jiffies in comments
cpu: Use already existing usleep_range()
timers: Rename next_expiry_recalc() to be unique
platform/x86:intel/pmc: Fix comment for the pmc_core_acpi_pm_timer_suspend_resume function
clocksource/drivers/jcore: Use request_percpu_irq()
clocksource/drivers/cadence-ttc: Add missing clk_disable_unprepare in ttc_setup_clockevent
clocksource/drivers/asm9260: Add missing clk_disable_unprepare in asm9260_timer_init
clocksource/drivers/qcom: Add missing iounmap() on errors in msm_dt_timer_init()
clocksource/drivers/ingenic: Use devm_clk_get_enabled() helpers
platform/x86:intel/pmc: Enable the ACPI PM Timer to be turned off when suspended
clocksource: acpi_pm: Add external callback for suspend/resume
clocksource/drivers/arm_arch_timer: Using for_each_available_child_of_node_scoped()
dt-bindings: timer: rockchip: Add rk3576 compatible
timers: Annotate possible non critical data race of next_expiry
timers: Remove historical extra jiffie for timeout in msleep()
hrtimer: Use and report correct timerslack values for realtime tasks
hrtimer: Annotate hrtimer_cpu_base_.*_expiry() for sparse.
timers: Add sparse annotation for timer_sync_wait_running().
signal: Replace BUG_ON()s
...
|
||
|
|
2f7eedca6c |
Merge branch 'linus' into timers/core
To update with the latest fixes. |
||
|
|
662a1bfb90 |
cpu: Use already existing usleep_range()
usleep_range() is a wrapper arount usleep_range_state() which hands in TASK_UNTINTERRUPTIBLE as state argument. Use already exising wrapper usleep_range(). No functional change. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20240904-devel-anna-maria-b4-timers-flseep-v1-2-e98760256370@linutronix.de |
||
|
|
8db70faeab |
cpu: Fix W=1 build kernel-doc warning
Building the kernel with W=1 generates the following warning: kernel/cpu.c:2693: warning: This comment starts with '/**', but isn't a kernel-doc comment. The function topology_is_core_online() is a simple helper function and doesn't need a kernel-doc comment. Use a normal comment instead. Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20240825221152.71951-2-thorsten.blum@toblux.com |
||
|
|
eb876ea724 |
Merge branch 'linus' into smp/core
Pull in upstream changes so further patches don't conflict. |
||
|
|
6c17ea1f3e |
cpu/SMT: Enable SMT only if a core is online
If a core is offline then enabling SMT should not online CPUs of
this core. By enabling SMT, what is intended is either changing the SMT
value from "off" to "on" or setting the SMT level (threads per core) from a
lower to higher value.
On PowerPC the ppc64_cpu utility can be used, among other things, to
perform the following functions:
ppc64_cpu --cores-on # Get the number of online cores
ppc64_cpu --cores-on=X # Put exactly X cores online
ppc64_cpu --offline-cores=X[,Y,...] # Put specified cores offline
ppc64_cpu --smt={on|off|value} # Enable, disable or change SMT level
If the user has decided to offline certain cores, enabling SMT should
not online CPUs in those cores. This patch fixes the issue and changes
the behaviour as described, by introducing an arch specific function
topology_is_core_online(). It is currently implemented only for PowerPC.
Fixes:
|
||
|
|
2dce993165 |
cpu/hotplug: Provide weak fallback for arch_cpuhp_init_parallel_bringup()
CONFIG_HOTPLUG_PARALLEL expects the architecture to implement arch_cpuhp_init_parallel_bringup() to decide whether paralllel hotplug is possible and to do the necessary architecture specific initialization. There are architectures which can enable it unconditionally and do not require architecture specific initialization. Provide a weak fallback for arch_cpuhp_init_parallel_bringup() so that such architectures are not forced to implement empty stub functions. Signed-off-by: Jiaxun Yang <jiaxun.yang@flygoat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20240716-loongarch-hotplug-v3-2-af59b3bb35c8@flygoat.com |
||
|
|
c0e81a455e |
cpu/hotplug: Make HOTPLUG_PARALLEL independent of HOTPLUG_SMT
Provide stub functions for SMT related parallel bring up functions so that HOTPLUG_PARALLEL can work without HOTPLUG_SMT. Signed-off-by: Jiaxun Yang <jiaxun.yang@flygoat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20240716-loongarch-hotplug-v3-1-af59b3bb35c8@flygoat.com |
||
|
|
98896d8795 |
- Unrelated x86/cc changes queued here to avoid ugly cross-merges and
conflicts:
- Carve out CPU hotplug function declarations into a separate header
with the goal to be able to use the lockdep assertions in a more
flexible manner
- As a result, refactor cacheinfo code after carving out a function
to return the cache ID associated with a given cache level
- Cleanups
- Add support to be able to kexec TDX guests. For that
- Expand ACPI MADT CPU offlining support
- Add machinery to prepare CoCo guests memory before kexec-ing into a new
kernel
- Cleanup, readjust and massage related code
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmaVCYoACgkQEsHwGGHe
VUoi6g//Up/4vMzcjqzrndXfl0aP+NpK4zNud+ZPP4Qza2yPhKydniMvkWVQ8DTx
jQaGk/tJDeFG6ofOzGkmBGyuZzuO4D7E0XFyXZZeVgSvdk2Af5vaWu1D3e4i4MiM
Ox4H8NtWnC4MozP0hos4qB0vtYaBWVJkNvIXDVF6162zLwEmbuyrpFe3glscwIxv
hMZR/C47RHcEeOb7yA4m/gJ+AqMe9OKradoNJkkfDpnYr6CYsbmpY09or2WYuvoI
0gevkIe6Q9HMcq3CQl6/pR8IgbA5VmGi7iCiE1ihgTPwR3AaU8llzBqYdSgezFrk
68A7oGeUZQeifQgjwkreZclMtsGEeGWVOB0Bh3Jgr6uaWGFXtpydi/hc73wbTz+F
IazKQcKQYjaPW/9UG+0+cFTQlCgQ+WxwqAsN1uqzL6gMgmC9B+TM//xzk5nVxpOd
ouf8T85tyceIPCKepGE/bWEHYYCjfbqBMyQT6RHmxUKbb1/PIsbzN26cenkZmPXT
cpwurWVG7mRQJRqTrsS+D+opP1h/jOdkpwGlBfl1s0sX6RZuMFBk+7TlMMs61Cyo
PWtrLV7Dr369cuXE72wIgfBAao2AS8kFshc7Atokq7/XfL9cCWHeqIcu7yvParP5
WY43YQv8XPGI7ZnPqULByTY0Wxg8TFk8whamx97kEp8uy2HmbQU=
=k+T+
-----END PGP SIGNATURE-----
Merge tag 'x86_cc_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 confidential computing updates from Borislav Petkov:
"Unrelated x86/cc changes queued here to avoid ugly cross-merges and
conflicts:
- Carve out CPU hotplug function declarations into a separate header
with the goal to be able to use the lockdep assertions in a more
flexible manner
- As a result, refactor cacheinfo code after carving out a function
to return the cache ID associated with a given cache level
- Cleanups
Add support to be able to kexec TDX guests:
- Expand ACPI MADT CPU offlining support
- Add machinery to prepare CoCo guests memory before kexec-ing into a
new kernel
- Cleanup, readjust and massage related code"
* tag 'x86_cc_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
ACPI: tables: Print MULTIPROC_WAKEUP when MADT is parsed
x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method
x86/mm: Introduce kernel_ident_mapping_free()
x86/smp: Add smp_ops.stop_this_cpu() callback
x86/acpi: Do not attempt to bring up secondary CPUs in the kexec case
x86/acpi: Rename fields in the acpi_madt_multiproc_wakeup structure
x86/mm: Do not zap page table entries mapping unaccepted memory table during kdump
x86/mm: Make e820__end_ram_pfn() cover E820_TYPE_ACPI ranges
x86/tdx: Convert shared memory back to private on kexec
x86/mm: Add callbacks to prepare encrypted memory for kexec
x86/tdx: Account shared memory
x86/mm: Return correct level from lookup_address() if pte is none
x86/mm: Make x86_platform.guest.enc_status_change_*() return an error
x86/kexec: Keep CR4.MCE set during kexec for TDX guest
x86/relocate_kernel: Use named labels for less confusion
cpu/hotplug, x86/acpi: Disable CPU offlining for ACPI MADT wakeup
cpu/hotplug: Add support for declaring CPU offlining not supported
x86/apic: Mark acpi_mp_wake_* variables as __ro_after_init
x86/acpi: Extract ACPI MADT wakeup code into a separate file
x86/kexec: Remove spurious unconditional JMP from from identity_mapped()
...
|
||
|
|
c89d780cc1 |
arm64 updates for 6.11:
* Virtual CPU hotplug support for arm64 ACPI systems
* cpufeature infrastructure cleanups and making the FEAT_ECBHB ID bits
visible to guests
* CPU errata: expand the speculative SSBS workaround to more CPUs
* arm64 ACPI:
- acpi=nospcr option to disable SPCR as default console for arm64
- Move some ACPI code (cpuidle, FFH) to drivers/acpi/arm64/
* GICv3, use compile-time PMR values: optimise the way regular IRQs are
masked/unmasked when GICv3 pseudo-NMIs are used, removing the need for
a static key in fast paths by using a priority value chosen
dynamically at boot time
* arm64 perf updates:
- Rework of the IMX PMU driver to enable support for I.MX95
- Enable support for tertiary match groups in the CMN PMU driver
- Initial refactoring of the CPU PMU code to prepare for the fixed
instruction counter introduced by Arm v9.4
- Add missing PMU driver MODULE_DESCRIPTION() strings
- Hook up DT compatibles for recent CPU PMUs
* arm64 kselftest updates:
- Kernel mode NEON fp-stress
- Cleanups, spelling mistakes
* arm64 Documentation update with a minor clarification on TBI
* Miscellaneous:
- Fix missing IPI statistics
- Implement raw_smp_processor_id() using thread_info rather than a
per-CPU variable (better code generation)
- Make MTE checking of in-kernel asynchronous tag faults conditional
on KASAN being enabled
- Minor cleanups, typos
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmaQKN4ACgkQa9axLQDI
XvE0Nw/+JZ6OEQ+DMUHXZfbWanvn1p0nVOoEV3MYVpOeQK1ILYCoDapatLNIlet0
wcja7tohKbL1ifc7GOqlkitu824LMlotncrdOBycRqb/4C5KuJ+XhygFv5hGfX0T
Uh2zbo4w52FPPEUMICfEAHrKT3QB9tv7f66xeUNbWWFqUn3rY02/ZVQVVdw6Zc0e
fVYWGUUoQDR7+9hRkk6tnYw3+9YFVAUAbLWk+DGrW7WsANi6HuJ/rBMibwFI6RkG
SZDZHum6vnwx0Dj9H7WrYaQCvUMm7AlckhQGfPbIFhUk6pWysfJtP5Qk49yiMl7p
oRk/GrSXpiKumuetgTeOHbokiE1Nb8beXx0OcsjCu4RrIaNipAEpH1AkYy5oiKoT
9vKZErMDtQgd96JHFVaXc+A3D2kxVfkc1u7K3TEfVRnZFV7CN+YL+61iyZ+uLxVi
d9xrAmwRsWYFVQzlZG3NWvSeQBKisUA1L8JROlzWc/NFDwTqDGIt/zS4pZNL3+OM
EXW0LyKt7Ijl6vPXKCXqrODRrPlcLc66VMZxofZOl0/dEqyJ+qLL4GUkWZu8lTqO
BqydYnbTSjiDg/ntWjTrD0uJ8c40Qy7KTPEdaPqEIQvkDEsUGlOnhAQjHrnGNb9M
psZtpDW2xm7GykEOcd6rgSz4Xeky2iLsaR4Wc7FTyDS0YRmeG44=
=ob2k
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
"The biggest part is the virtual CPU hotplug that touches ACPI,
irqchip. We also have some GICv3 optimisation for pseudo-NMIs that has
been queued via the arm64 tree. Otherwise the usual perf updates,
kselftest, various small cleanups.
Core:
- Virtual CPU hotplug support for arm64 ACPI systems
- cpufeature infrastructure cleanups and making the FEAT_ECBHB ID
bits visible to guests
- CPU errata: expand the speculative SSBS workaround to more CPUs
- GICv3, use compile-time PMR values: optimise the way regular IRQs
are masked/unmasked when GICv3 pseudo-NMIs are used, removing the
need for a static key in fast paths by using a priority value
chosen dynamically at boot time
ACPI:
- 'acpi=nospcr' option to disable SPCR as default console for arm64
- Move some ACPI code (cpuidle, FFH) to drivers/acpi/arm64/
Perf updates:
- Rework of the IMX PMU driver to enable support for I.MX95
- Enable support for tertiary match groups in the CMN PMU driver
- Initial refactoring of the CPU PMU code to prepare for the fixed
instruction counter introduced by Arm v9.4
- Add missing PMU driver MODULE_DESCRIPTION() strings
- Hook up DT compatibles for recent CPU PMUs
Kselftest updates:
- Kernel mode NEON fp-stress
- Cleanups, spelling mistakes
Miscellaneous:
- arm64 Documentation update with a minor clarification on TBI
- Fix missing IPI statistics
- Implement raw_smp_processor_id() using thread_info rather than a
per-CPU variable (better code generation)
- Make MTE checking of in-kernel asynchronous tag faults conditional
on KASAN being enabled
- Minor cleanups, typos"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (69 commits)
selftests: arm64: tags: remove the result script
selftests: arm64: tags_test: conform test to TAP output
perf: add missing MODULE_DESCRIPTION() macros
arm64: smp: Fix missing IPI statistics
irqchip/gic-v3: Fix 'broken_rdists' unused warning when !SMP and !ACPI
ACPI: Add acpi=nospcr to disable ACPI SPCR as default console on ARM64
Documentation: arm64: Update memory.rst for TBI
arm64/cpufeature: Replace custom macros with fields from ID_AA64PFR0_EL1
KVM: arm64: Replace custom macros with fields from ID_AA64PFR0_EL1
perf: arm_pmuv3: Include asm/arm_pmuv3.h from linux/perf/arm_pmuv3.h
perf: arm_v6/7_pmu: Drop non-DT probe support
perf/arm: Move 32-bit PMU drivers to drivers/perf/
perf: arm_pmuv3: Drop unnecessary IS_ENABLED(CONFIG_ARM64) check
perf: arm_pmuv3: Avoid assigning fixed cycle counter with threshold
arm64: Kconfig: Fix dependencies to enable ACPI_HOTPLUG_CPU
perf: imx_perf: add support for i.MX95 platform
perf: imx_perf: fix counter start and config sequence
perf: imx_perf: refactor driver for imx93
perf: imx_perf: let the driver manage the counter usage rather the user
perf: imx_perf: add macro definitions for parsing config attr
...
|
||
|
|
0eff0491e7 |
A small set of SMP/CPU hotplug updates:
- Reverse the order of iteration when freezing secondary CPUs for
hibernation.
This avoids that drivers like the Intel uncore performance counter have
to transfer the assignement of handling the per package uncore events
for every CPU in a package, which is a considerable speedup on larger
systems.
- Add a missing destroy_work_on_stack() invocation in smp_call_on_cpu()
to prevent debug objects to emit a false positive warning when the
stack is freed.
- Small cleanups in comments and a str_plural() conversion
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmaUNnYTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYocjbEACqfc8EkTEJczAMbAaq7Nq7jYGRt5e8
UyWpbnT1re63TAiGUEAr7pPJd6ucXAwPOHBt0V+j1HIr2UsaGT2/cfnlSipR68xh
UmqPxXpuba+JeZDS0Bf6gRXKYBogVWFgjP8cp+f9IKr6xMgwJc7ujB9mX0YKmIgz
bDoaFS+NFSD1ZS7tuidLqfU9UmJYRDRhyZg124HwDXG20zHR2CHgP4QQ2bHOvxZU
LKbdjoHBmGtkLomS+R1UxsT+onsnE0c1EN37LX0mUE5L1YbUTcbZlXLLG7T35jOO
rvai+EKVbA2KUAtrM/LZ8WZ0Lt5DTjMrouyzv6of7N2WljijdlxMb04sXl7kdng+
rohRfDB6yNhQhEnDx6fd+IP/JCpPCctCmkN/QbvMBfnRnTe1yNg/gmnWaY3RAHNM
GBifAxSEosMn7AMnSs/or6DAVNHVxI3Ms+r4Xb/o1JvGx7PXiBULh1Nww5WdxmBl
IraXNR/R0qXUokXmNtJrq3aG/SepKWAc0BJJc3b6zA9tf5rsizTaxetTVZLS6jOX
/DHJOOgAlLRDZdE53YpdL3HVVTdM/BDSM0xpMO7uxdJ4laJ9s+7dFGk9KXxu24qM
6dIG6hn7D9XT0q05sP+r7qO0ygIe0Qorg5bA+xgvpYdPNrVfQpjzkj5jwkspvk+l
5/gpTUmVm8Vwhw==
=M4eq
-----END PGP SIGNATURE-----
Merge tag 'smp-core-2024-07-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU hotplug updates from Thomas Gleixner:
"A small set of SMP/CPU hotplug updates:
- Reverse the order of iteration when freezing secondary CPUs for
hibernation.
This avoids that drivers like the Intel uncore performance counter
have to transfer the assignement of handling the per package uncore
events for every CPU in a package, which is a considerable speedup
on larger systems.
- Add a missing destroy_work_on_stack() invocation in
smp_call_on_cpu() to prevent debug objects to emit a false positive
warning when the stack is freed.
- Small cleanups in comments and a str_plural() conversion"
* tag 'smp-core-2024-07-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
smp: Add missing destroy_work_on_stack() call in smp_call_on_cpu()
cpu/hotplug: Reverse order of iteration in freeze_secondary_cpus()
smp: Use str_plural() to fix Coccinelle warnings
cpu/hotplug: Fix typo in comment
|
||
|
|
4e1a7df454 |
cpumask: Add enabled cpumask for present CPUs that can be brought online
The 'offline' file in sysfs shows all offline CPUs, including those that aren't present. User-space is expected to remove not-present CPUs from this list to learn which CPUs could be brought online. CPUs can be present but not-enabled. These CPUs can't be brought online until the firmware policy changes, which comes with an ACPI notification that will register the CPUs. With only the offline and present files, user-space is unable to determine which CPUs it can try to bring online. Add a new CPU mask that shows this based on all the registered CPUs. Signed-off-by: James Morse <james.morse@arm.com> Tested-by: Miguel Luis <miguel.luis@oracle.com> Tested-by: Vishnu Pajjuri <vishnu@os.amperecomputing.com> Tested-by: Jianyong Wu <jianyong.wu@arm.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://lore.kernel.org/r/20240529133446.28446-20-Jonathan.Cameron@huawei.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> |
||
|
|
6ef8eb5125 |
cpu: Fix broken cmdline "nosmp" and "maxcpus=0"
After the rework of "Parallel CPU bringup", the cmdline "nosmp" and "maxcpus=0" parameters are not working anymore. These parameters set setup_max_cpus to zero and that's handed to bringup_nonboot_cpus(). The code there does a decrement before checking for zero, which brings it into the negative space and brings up all CPUs. Add a zero check at the beginning of the function to prevent this. [ tglx: Massaged change log ] Fixes: |
||
|
|
66e48e491d |
cpu/hotplug, x86/acpi: Disable CPU offlining for ACPI MADT wakeup
ACPI MADT doesn't allow to offline a CPU after it has been woken up. Currently, CPU hotplug is prevented based on the confidential computing attribute which is set for Intel TDX. But TDX is not the only possible user of the wake up method. Any platform that uses ACPI MADT wakeup method cannot offline CPU. Disable CPU offlining on ACPI MADT wakeup enumeration. This has no visible effects for users: currently, TDX guest is the only platform that uses the ACPI MADT wakeup method. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Tao Liu <ltao@redhat.com> Link: https://lore.kernel.org/r/20240614095904.1345461-5-kirill.shutemov@linux.intel.com |
||
|
|
1037e4c53e |
cpu/hotplug: Add support for declaring CPU offlining not supported
The ACPI MADT mailbox wakeup method doesn't allow to offline a CPU after it has been woken up. Currently, offlining is prevented based on the confidential computing attribute which is set for Intel TDX. But TDX is not the only possible user of the wake up method. The MADT wakeup can be implemented outside of a confidential computing environment. Offline support is a property of the wakeup method, not the CoCo implementation. Introduce cpu_hotplug_disable_offlining() that can be called to indicate that CPU offlining should be disabled. This function is going to replace CC_ATTR_HOTPLUG_DISABLED for ACPI MADT wakeup method. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Tao Liu <ltao@redhat.com> Link: https://lore.kernel.org/r/20240614095904.1345461-4-kirill.shutemov@linux.intel.com |
||
|
|
fde78e4673 |
cpu/hotplug: Reverse order of iteration in freeze_secondary_cpus()
Whenever CPU hotplug state callbacks are registered, the startup callback is invoked on CPUs that have already reached the provided state in order of ascending CPU IDs. In freeze_secondary_cpus() the teardown of CPUs happens in the same are invoked in the same order. This is known to make a difference is the current implementation of these callbacks in arch/x86/events/intel/uncore.c: - uncore_event_cpu_online() designates the first CPU it is invoked for on each package as the uncore event collector for that package - uncore_event_cpu_offline() if the CPU being offlined is the event collector for its package, transfers that responsibility over to the next (by ascending CPU id) one in the same package With the current order of CPU teardowns in freeze_secondary_cpus(), the latter ends up doing the ownership transfer work on every single CPU. That work involves a synchronize_rcu() call, ultimately unnecessarily degrading the performance of CPU offlining. To address this make freeze_secondary_cpus() iterate through the CPUs in reverse order, so that the teardown happens in order of descending CPU IDs. [ tglx: Massage change log ] Signed-off-by: Stanislav Spassov <stanspas@amazon.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240524160449.48594-1-stanspas@amazon.de |
||
|
|
932d847639 |
cpu/hotplug: Fix dynstate assignment in __cpuhp_setup_state_cpuslocked()
Commit |
||
|
|
de6fef50ea |
cgroup: Changes for v6.10
- The locking around cpuset hotplug processing has always been a bit of mess
which was worked around by making hotplug processing asynchronous. The
asynchronity isn't great and led to other issues. We tried to make the
behavior synchronous a while ago but that led to lockdep splats. Waiman
took another stab at cleaning up and making it synchronous. The patch has
been in -next for well over a month and there haven't been any complaints,
so fingers crossed.
- Tracepoints added to help understanding rstat lock contentions.
- A bunch of minor changes - doc updates, code cleanups and selftests.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZkUrFA4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGfTyAQCwd0aNQOqaKRhJGtWYShqV/aYzurCy1Z2tB9/3
dkdy9gD7BHNk6kZQEbT97RrHPIduFansLtc76VziACibWBuomgg=
=2DNQ
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
- The locking around cpuset hotplug processing has always been a bit of
mess which was worked around by making hotplug processing
asynchronous. The asynchronity isn't great and led to other issues.
We tried to make the behavior synchronous a while ago but that led to
lockdep splats. Waiman took another stab at cleaning up and making it
synchronous. The patch has been in -next for well over a month and
there haven't been any complaints, so fingers crossed.
- Tracepoints added to help understanding rstat lock contentions.
- A bunch of minor changes - doc updates, code cleanups and selftests.
* tag 'cgroup-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (24 commits)
cgroup/rstat: add cgroup_rstat_cpu_lock helpers and tracepoints
selftests/cgroup: Drop define _GNU_SOURCE
docs: cgroup-v1: Update page cache removal functions
selftests/cgroup: fix uninitialized variables in test_zswap.c
selftests/cgroup: cpu_hogger init: use {} instead of {NULL}
selftests/cgroup: fix clang warnings: uninitialized fd variable
selftests/cgroup: fix clang build failures for abs() calls
cgroup/cpuset: Remove outdated comment in sched_partition_write()
cgroup/cpuset: Fix incorrect top_cpuset flags
cgroup/cpuset: Avoid clearing CS_SCHED_LOAD_BALANCE twice
cgroup/cpuset: Statically initialize more members of top_cpuset
cgroup: Avoid unnecessary looping in cgroup_no_v1()
cgroup, legacy_freezer: update comment for freezer_css_offline()
docs, cgroup: add entries for pids to cgroup-v2.rst
cgroup: don't call cgroup1_pidlist_destroy_all() for v2
cgroup_freezer: update comment for freezer_css_online()
cgroup/rstat: desc member cgrp in cgroup_rstat_flush_release
cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints
cgroup/pids: Remove superfluous zeroing
docs: cgroup-v1: Fix description for css_online
...
|
||
|
|
ce0abef6a1 |
cpu: Ignore "mitigations" kernel parameter if CPU_MITIGATIONS=n
Explicitly disallow enabling mitigations at runtime for kernels that were built with CONFIG_CPU_MITIGATIONS=n, as some architectures may omit code entirely if mitigations are disabled at compile time. E.g. on x86, a large pile of Kconfigs are buried behind CPU_MITIGATIONS, and trying to provide sane behavior for retroactively enabling mitigations is extremely difficult, bordering on impossible. E.g. page table isolation and call depth tracking require build-time support, BHI mitigations will still be off without additional kernel parameters, etc. [ bp: Touchups. ] Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240420000556.2645001-3-seanjc@google.com |
||
|
|
fe42754b94 |
cpu: Re-enable CPU mitigations by default for !X86 architectures
Rename x86's to CPU_MITIGATIONS, define it in generic code, and force it
on for all architectures exception x86. A recent commit to turn
mitigations off by default if SPECULATION_MITIGATIONS=n kinda sorta
missed that "cpu_mitigations" is completely generic, whereas
SPECULATION_MITIGATIONS is x86-specific.
Rename x86's SPECULATIVE_MITIGATIONS instead of keeping both and have it
select CPU_MITIGATIONS, as having two configs for the same thing is
unnecessary and confusing. This will also allow x86 to use the knob to
manage mitigations that aren't strictly related to speculative
execution.
Use another Kconfig to communicate to common code that CPU_MITIGATIONS
is already defined instead of having x86's menu depend on the common
CPU_MITIGATIONS. This allows keeping a single point of contact for all
of x86's mitigations, and it's not clear that other architectures *want*
to allow disabling mitigations at compile-time.
Fixes:
|
||
|
|
f337a6a21e |
x86/cpu: Actually turn off mitigations by default for SPECULATION_MITIGATIONS=n
Initialize cpu_mitigations to CPU_MITIGATIONS_OFF if the kernel is built
with CONFIG_SPECULATION_MITIGATIONS=n, as the help text quite clearly
states that disabling SPECULATION_MITIGATIONS is supposed to turn off all
mitigations by default.
│ If you say N, all mitigations will be disabled. You really
│ should know what you are doing to say so.
As is, the kernel still defaults to CPU_MITIGATIONS_AUTO, which results in
some mitigations being enabled in spite of SPECULATION_MITIGATIONS=n.
Fixes:
|
||
|
|
2125c0034c |
cgroup/cpuset: Make cpuset hotplug processing synchronous
Since commit 3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside
get_online_cpus()"), cpuset hotplug was done asynchronously via a work
function. This is to avoid recursive locking of cgroup_mutex.
Since then, the cgroup locking scheme has changed quite a bit. A
cpuset_mutex was introduced to protect cpuset specific operations.
The cpuset_mutex is then replaced by a cpuset_rwsem. With commit
|
||
|
|
ca7e917769 |
Rework of APIC enumeration and topology evaluation:
The current implementation has a couple of shortcomings:
- It fails to handle hybrid systems correctly.
- The APIC registration code which handles CPU number assignents is in
the middle of the APIC code and detached from the topology evaluation.
- The various mechanisms which enumerate APICs, ACPI, MPPARSE and guest
specific ones, tweak global variables as they see fit or in case of
XENPV just hack around the generic mechanisms completely.
- The CPUID topology evaluation code is sprinkled all over the vendor
code and reevaluates global variables on every hotplug operation.
- There is no way to analyze topology on the boot CPU before bringing up
the APs. This causes problems for infrastructure like PERF which needs
to size certain aspects upfront or could be simplified if that would be
possible.
- The APIC admission and CPU number association logic is incomprehensible
and overly complex and needs to be kept around after boot instead of
completing this right after the APIC enumeration.
This update addresses these shortcomings with the following changes:
- Rework the CPUID evaluation code so it is common for all vendors and
provides information about the APIC ID segments in a uniform way
independent of the number of segments (Thread, Core, Module, ..., Die,
Package) so that this information can be computed instead of rewriting
global variables of dubious value over and over.
- A few cleanups and simplifcations of the APIC, IO/APIC and related
interfaces to prepare for the topology evaluation changes.
- Seperation of the parser stages so the early evaluation which tries to
find the APIC address can be seperately overridden from the late
evaluation which enumerates and registers the local APIC as further
preparation for sanitizing the topology evaluation.
- A new registration and admission logic which
- encapsulates the inner workings so that parsers and guest logic
cannot longer fiddle in it
- uses the APIC ID segments to build topology bitmaps at registration
time
- provides a sane admission logic
- allows to detect the crash kernel case, where CPU0 does not run on
the real BSP, automatically. This is required to prevent sending
INIT/SIPI sequences to the real BSP which would reset the whole
machine. This was so far handled by a tedious command line
parameter, which does not even work in nested crash scenarios.
- Associates CPU number after the enumeration completed and prevents
the late registration of APICs, which was somehow tolerated before.
- Converting all parsers and guest enumeration mechanisms over to the
new interfaces.
This allows to get rid of all global variable tweaking from the parsers
and enumeration mechanisms and sanitizes the XEN[PV] handling so it can
use CPUID evaluation for the first time.
- Mopping up existing sins by taking the information from the APIC ID
segment bitmaps.
This evaluates hybrid systems correctly on the boot CPU and allows for
cleanups and fixes in the related drivers, e.g. PERF.
The series has been extensively tested and the minimal late fallout due to
a broken ACPI/MADT table has been addressed by tightening the admission
logic further.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmXuDawTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYobE7EACngItF+UOTCoCV6och2lL6HVoIdZD1
Y5oaAgD+WzQSz/lBkH6b9kZSyvjlMo6O9GlnGX+ii+VUnijDp4VrspnxbJDaKEq3
gOfsSg2Tk+ps50HqMcZawjjBYJb/TmvKwEV2XuzIBPOONSWLNjvN7nBSzLl1eF9/
8uCE39/8aB5K3GXryRyXdo2uLu6eHTVC0aYFu/kLX1/BbVqF5NMD3sz9E9w8+D/U
MIIMEMXy4Fn+P2o0vVH+gjUlwI76mJbB1WqCX/sqbVacXrjl3KfNJRiisTFIOOYV
8o+rIV0ef5X9xmZqtOXAdyZQzj++Gwmz9+4TU1M4YHtS7UkYn6AluOjvVekCc+gc
qXE3WhqKfCK2/carRMLQxAMxNeRylkZG+Wuv1Qtyjpe9JX2dTqtems0f4DMp9DKf
b7InO3z39kJanpqcUG2Sx+GWanetfnX+0Ho2Moqu6Xi+2ATr1PfMG/Wyr5/WWOfV
qApaHSTwa+J43mSzP6BsXngEv085EHSGM5tPe7u46MCYFqB21+bMl+qH82KjMkOe
c6uZovFQMmX2WBlqJSYGVCH+Jhgvqq8HFeRs19Hd4enOt3e6LE3E74RBVD1AyfLV
1b/m8tYB/o871ZlEZwDCGVrV/LNnA7PxmFpq5ZHLpUt39g2/V0RH1puBVz1e97pU
YsTT7hBCUYzgjQ==
=/5oR
-----END PGP SIGNATURE-----
Merge tag 'x86-apic-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 APIC updates from Thomas Gleixner:
"Rework of APIC enumeration and topology evaluation.
The current implementation has a couple of shortcomings:
- It fails to handle hybrid systems correctly.
- The APIC registration code which handles CPU number assignents is
in the middle of the APIC code and detached from the topology
evaluation.
- The various mechanisms which enumerate APICs, ACPI, MPPARSE and
guest specific ones, tweak global variables as they see fit or in
case of XENPV just hack around the generic mechanisms completely.
- The CPUID topology evaluation code is sprinkled all over the vendor
code and reevaluates global variables on every hotplug operation.
- There is no way to analyze topology on the boot CPU before bringing
up the APs. This causes problems for infrastructure like PERF which
needs to size certain aspects upfront or could be simplified if
that would be possible.
- The APIC admission and CPU number association logic is
incomprehensible and overly complex and needs to be kept around
after boot instead of completing this right after the APIC
enumeration.
This update addresses these shortcomings with the following changes:
- Rework the CPUID evaluation code so it is common for all vendors
and provides information about the APIC ID segments in a uniform
way independent of the number of segments (Thread, Core, Module,
..., Die, Package) so that this information can be computed instead
of rewriting global variables of dubious value over and over.
- A few cleanups and simplifcations of the APIC, IO/APIC and related
interfaces to prepare for the topology evaluation changes.
- Seperation of the parser stages so the early evaluation which tries
to find the APIC address can be seperately overridden from the late
evaluation which enumerates and registers the local APIC as further
preparation for sanitizing the topology evaluation.
- A new registration and admission logic which
- encapsulates the inner workings so that parsers and guest logic
cannot longer fiddle in it
- uses the APIC ID segments to build topology bitmaps at
registration time
- provides a sane admission logic
- allows to detect the crash kernel case, where CPU0 does not run
on the real BSP, automatically. This is required to prevent
sending INIT/SIPI sequences to the real BSP which would reset
the whole machine. This was so far handled by a tedious command
line parameter, which does not even work in nested crash
scenarios.
- Associates CPU number after the enumeration completed and
prevents the late registration of APICs, which was somehow
tolerated before.
- Converting all parsers and guest enumeration mechanisms over to the
new interfaces.
This allows to get rid of all global variable tweaking from the
parsers and enumeration mechanisms and sanitizes the XEN[PV]
handling so it can use CPUID evaluation for the first time.
- Mopping up existing sins by taking the information from the APIC ID
segment bitmaps.
This evaluates hybrid systems correctly on the boot CPU and allows
for cleanups and fixes in the related drivers, e.g. PERF.
The series has been extensively tested and the minimal late fallout
due to a broken ACPI/MADT table has been addressed by tightening the
admission logic further"
* tag 'x86-apic-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (76 commits)
x86/topology: Ignore non-present APIC IDs in a present package
x86/apic: Build the x86 topology enumeration functions on UP APIC builds too
smp: Provide 'setup_max_cpus' definition on UP too
smp: Avoid 'setup_max_cpus' namespace collision/shadowing
x86/bugs: Use fixed addressing for VERW operand
x86/cpu/topology: Get rid of cpuinfo::x86_max_cores
x86/cpu/topology: Provide __num_[cores|threads]_per_package
x86/cpu/topology: Rename topology_max_die_per_package()
x86/cpu/topology: Rename smp_num_siblings
x86/cpu/topology: Retrieve cores per package from topology bitmaps
x86/cpu/topology: Use topology logical mapping mechanism
x86/cpu/topology: Provide logical pkg/die mapping
x86/cpu/topology: Simplify cpu_mark_primary_thread()
x86/cpu/topology: Mop up primary thread mask handling
x86/cpu/topology: Use topology bitmaps for sizing
x86/cpu/topology: Let XEN/PV use topology from CPUID/MADT
x86/xen/smp_pv: Count number of vCPUs early
x86/cpu/topology: Assign hotpluggable CPUIDs during init
x86/cpu/topology: Reject unknown APIC IDs on ACPI hotplug
x86/topology: Add a mechanism to track topology via APIC IDs
...
|
||
|
|
d08c407f71 |
A large set of updates and features for timers and timekeeping:
- The hierarchical timer pull model
When timer wheel timers are armed they are placed into the timer wheel
of a CPU which is likely to be busy at the time of expiry. This is done
to avoid wakeups on potentially idle CPUs.
This is wrong in several aspects:
1) The heuristics to select the target CPU are wrong by
definition as the chance to get the prediction right is close
to zero.
2) Due to #1 it is possible that timers are accumulated on a
single target CPU
3) The required computation in the enqueue path is just overhead for
dubious value especially under the consideration that the vast
majority of timer wheel timers are either canceled or rearmed
before they expire.
The timer pull model avoids the above by removing the target
computation on enqueue and queueing timers always on the CPU on which
they get armed.
This is achieved by having separate wheels for CPU pinned timers and
global timers which do not care about where they expire.
As long as a CPU is busy it handles both the pinned and the global
timers which are queued on the CPU local timer wheels.
When a CPU goes idle it evaluates its own timer wheels:
- If the first expiring timer is a pinned timer, then the global
timers can be ignored as the CPU will wake up before they expire.
- If the first expiring timer is a global timer, then the expiry time
is propagated into the timer pull hierarchy and the CPU makes sure
to wake up for the first pinned timer.
The timer pull hierarchy organizes CPUs in groups of eight at the
lowest level and at the next levels groups of eight groups up to the
point where no further aggregation of groups is required, i.e. the
number of levels is log8(NR_CPUS). The magic number of eight has been
established by experimention, but can be adjusted if needed.
In each group one busy CPU acts as the migrator. It's only one CPU to
avoid lock contention on remote timer wheels.
The migrator CPU checks in its own timer wheel handling whether there
are other CPUs in the group which have gone idle and have global timers
to expire. If there are global timers to expire, the migrator locks the
remote CPU timer wheel and handles the expiry.
Depending on the group level in the hierarchy this handling can require
to walk the hierarchy downwards to the CPU level.
Special care is taken when the last CPU goes idle. At this point the
CPU is the systemwide migrator at the top of the hierarchy and it
therefore cannot delegate to the hierarchy. It needs to arm its own
timer device to expire either at the first expiring timer in the
hierarchy or at the first CPU local timer, which ever expires first.
This completely removes the overhead from the enqueue path, which is
e.g. for networking a true hotpath and trades it for a slightly more
complex idle path.
This has been in development for a couple of years and the final series
has been extensively tested by various teams from silicon vendors and
ran through extensive CI.
There have been slight performance improvements observed on network
centric workloads and an Intel team confirmed that this allows them to
power down a die completely on a mult-die socket for the first time in
a mostly idle scenario.
There is only one outstanding ~1.5% regression on a specific overloaded
netperf test which is currently investigated, but the rest is either
positive or neutral performance wise and positive on the power
management side.
- Fixes for the timekeeping interpolation code for cross-timestamps:
cross-timestamps are used for PTP to get snapshots from hardware timers
and interpolated them back to clock MONOTONIC. The changes address a
few corner cases in the interpolation code which got the math and logic
wrong.
- Simplifcation of the clocksource watchdog retry logic to automatically
adjust to handle larger systems correctly instead of having more
incomprehensible command line parameters.
- Treewide consolidation of the VDSO data structures.
- The usual small improvements and cleanups all over the place.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmXuAN0THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoVKXEADIR45rjR1Xtz32js7B53Y65O4WNoOQ
6/ycWcswuGzg/h4QUpPSJ6gOGVmKSWwZi4n0P/VadCiXGSPPm0aUKsoRUt9DZsPY
mtj2wjCSXKXiyhTl9OtrZME86ZAIGO1dQXa/sOHsiP5PCjgQkD0b5CYi1+B6eHDt
1/Uo2Tb9g8VAPppq20V5Uo93GrPf642oyi3FCFrR1M112Uuak5DmqHJYiDpreNcG
D5SgI+ykSiaUaVyHifvqijoJk0rYXkqEC6evl02477lJ/X0vVo2/M8XPS95BxHST
s5Iruo4rP+qeAy8QvhZpoPX59fO0m/AgA7cf77XXAtOpVdLH+bs4ILsEbouAIOtv
lsmRkcYt+TpvrZFHPAxks+6g3afuROiDtxD5sXXpVWxvofi8FwWqubdlqdsbw9MP
ZCTNyzNyKL47QeDwBfSynYUL1RSyqsphtIwk4oeQklH9rwMAnW21hi30z15hQ0pQ
FOVkmcwi79JNvl/G+jRkDzw7r8/zcHshWdSjyUM04CDjjnCDjQOFWSIjEPwbQjjz
S4HXpJKJW963dBgs9Z84/Ctw1GwoBk1qedDWDJE1257Qvmo/Wpe/7GddWcazOGnN
RRFMzGPbOqBDbjtErOKGU+iCisgNEvz2XK+TI16uRjWde7DxZpiTVYgNDrZ+/Pyh
rQ23UBms6ZRR+A==
=iQlu
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"A large set of updates and features for timers and timekeeping:
- The hierarchical timer pull model
When timer wheel timers are armed they are placed into the timer
wheel of a CPU which is likely to be busy at the time of expiry.
This is done to avoid wakeups on potentially idle CPUs.
This is wrong in several aspects:
1) The heuristics to select the target CPU are wrong by
definition as the chance to get the prediction right is
close to zero.
2) Due to #1 it is possible that timers are accumulated on
a single target CPU
3) The required computation in the enqueue path is just overhead
for dubious value especially under the consideration that the
vast majority of timer wheel timers are either canceled or
rearmed before they expire.
The timer pull model avoids the above by removing the target
computation on enqueue and queueing timers always on the CPU on
which they get armed.
This is achieved by having separate wheels for CPU pinned timers
and global timers which do not care about where they expire.
As long as a CPU is busy it handles both the pinned and the global
timers which are queued on the CPU local timer wheels.
When a CPU goes idle it evaluates its own timer wheels:
- If the first expiring timer is a pinned timer, then the global
timers can be ignored as the CPU will wake up before they
expire.
- If the first expiring timer is a global timer, then the expiry
time is propagated into the timer pull hierarchy and the CPU
makes sure to wake up for the first pinned timer.
The timer pull hierarchy organizes CPUs in groups of eight at the
lowest level and at the next levels groups of eight groups up to
the point where no further aggregation of groups is required, i.e.
the number of levels is log8(NR_CPUS). The magic number of eight
has been established by experimention, but can be adjusted if
needed.
In each group one busy CPU acts as the migrator. It's only one CPU
to avoid lock contention on remote timer wheels.
The migrator CPU checks in its own timer wheel handling whether
there are other CPUs in the group which have gone idle and have
global timers to expire. If there are global timers to expire, the
migrator locks the remote CPU timer wheel and handles the expiry.
Depending on the group level in the hierarchy this handling can
require to walk the hierarchy downwards to the CPU level.
Special care is taken when the last CPU goes idle. At this point
the CPU is the systemwide migrator at the top of the hierarchy and
it therefore cannot delegate to the hierarchy. It needs to arm its
own timer device to expire either at the first expiring timer in
the hierarchy or at the first CPU local timer, which ever expires
first.
This completely removes the overhead from the enqueue path, which
is e.g. for networking a true hotpath and trades it for a slightly
more complex idle path.
This has been in development for a couple of years and the final
series has been extensively tested by various teams from silicon
vendors and ran through extensive CI.
There have been slight performance improvements observed on network
centric workloads and an Intel team confirmed that this allows them
to power down a die completely on a mult-die socket for the first
time in a mostly idle scenario.
There is only one outstanding ~1.5% regression on a specific
overloaded netperf test which is currently investigated, but the
rest is either positive or neutral performance wise and positive on
the power management side.
- Fixes for the timekeeping interpolation code for cross-timestamps:
cross-timestamps are used for PTP to get snapshots from hardware
timers and interpolated them back to clock MONOTONIC. The changes
address a few corner cases in the interpolation code which got the
math and logic wrong.
- Simplifcation of the clocksource watchdog retry logic to
automatically adjust to handle larger systems correctly instead of
having more incomprehensible command line parameters.
- Treewide consolidation of the VDSO data structures.
- The usual small improvements and cleanups all over the place"
* tag 'timers-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits)
timer/migration: Fix quick check reporting late expiry
tick/sched: Fix build failure for CONFIG_NO_HZ_COMMON=n
vdso/datapage: Quick fix - use asm/page-def.h for ARM64
timers: Assert no next dyntick timer look-up while CPU is offline
tick: Assume timekeeping is correctly handed over upon last offline idle call
tick: Shut down low-res tick from dying CPU
tick: Split nohz and highres features from nohz_mode
tick: Move individual bit features to debuggable mask accesses
tick: Move got_idle_tick away from common flags
tick: Assume the tick can't be stopped in NOHZ_MODE_INACTIVE mode
tick: Move broadcast cancellation up to CPUHP_AP_TICK_DYING
tick: Move tick cancellation up to CPUHP_AP_TICK_DYING
tick: Start centralizing tick related CPU hotplug operations
tick/sched: Don't clear ts::next_tick again in can_stop_idle_tick()
tick/sched: Rename tick_nohz_stop_sched_tick() to tick_nohz_full_stop_tick()
tick: Use IS_ENABLED() whenever possible
tick/sched: Remove useless oneshot ifdeffery
tick/nohz: Remove duplicate between lowres and highres handlers
tick/nohz: Remove duplicate between tick_nohz_switch_to_nohz() and tick_setup_sched_timer()
hrtimer: Select housekeeping CPU during migration
...
|
||
|
|
4c8a498541 |
smp: Avoid 'setup_max_cpus' namespace collision/shadowing
bringup_nonboot_cpus() gets passed the 'setup_max_cpus' variable in init/main.c - which is also the name of the parameter, shadowing the name. To reduce confusion and to allow the 'setup_max_cpus' value to be #defined in the <linux/smp.h> header, use the 'max_cpus' name for the function parameter name. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org |
||
|
|
500f8f9bce |
tick: Assume timekeeping is correctly handed over upon last offline idle call
The timekeeping duty is handed over from the outgoing CPU on stop machine, then the oneshot tick is stopped right after. Therefore it's guaranteed that the current CPU isn't the timekeeper upon its last call to idle. Besides, calling tick_nohz_idle_stop_tick() while the dying CPU goes into idle suggests that the tick is going to be stopped while it is actually stopped already from the appropriate CPU hotplug state. Remove the confusing call and the obsolete case handling and convert it to a sanity check that verifies the above assumption. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-16-frederic@kernel.org |
||
|
|
ef8969bb55 |
tick: Move broadcast cancellation up to CPUHP_AP_TICK_DYING
The broadcast shutdown code is executed through a random explicit call within stop machine from the outgoing CPU. However the tick broadcast is a midware between the tick callback and the clocksource, therefore it makes more sense to shut it down after the tick callback and before the clocksource drivers. Move it instead to the common tick shutdown CPU hotplug state where related operations can be ordered from highest to lowest level. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-10-frederic@kernel.org |
||
|
|
3ad6eb0683 |
tick: Start centralizing tick related CPU hotplug operations
During the CPU offlining process, the various timer tick features are
shut down from scattered places, sometimes from teardown callbacks on
stop machine, sometimes through explicit calls, sometimes from the
control CPU after the CPU died. The reason why these shutdown operations
are spread around is not always clear and it makes the tick lifecycle
hard to follow.
The tick should be shut down in order from highest to lowest level:
On stop machine from the dying CPU (high-level):
1) Hand-over the timekeeping duty (tick_handover_do_timer())
2) Cancel the tick implementation called by the clockevent callback
(tick_cancel_sched_timer())
3) Shutdown broadcasting (tick_offline_cpu() / tick_broadcast_offline())
On stop machine from the dying CPU (low-level):
4) Shutdown clockevents drivers (CPUHP_AP_*_TIMER_STARTING states)
From the control CPU after the CPU died (low-level):
5) Shutdown/unregister/cleanup clockevents for the dead CPU
(tick_cleanup_dead_cpu())
Instead the current order is 2, 4 (both from CPU hotplug states), then
1 and 3 through direct calls. This layout and order don't make much
sense. The operations 1, 2, 3 should be gathered together and in order.
Sort this situation with creating a new TICK shut-down CPU hotplug state
and start with introducing the timekeeping duty hand-over there. The
state must precede hrtimers migration because the tick hrtimer will be
stopped from it in a further patch.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240225225508.11587-8-frederic@kernel.org
|
||
|
|
266e957864 |
cpu: Remove stray semicolon
This syntax error was introduced by commit |
||
|
|
da92df490e |
cpu: Mark cpu_possible_mask as __ro_after_init
cpu_possible_mask is by definition "cpus which could be hotplugged without reboot". It's a property which is fixed after kernel enumerates the hardware configuration. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/41cd78af-92a3-4f23-8c7a-4316a04a66d8@p183 |
||
|
|
effe6d278e |
kernel/cpu: Convert snprintf() to sysfs_emit()
Per filesystems/sysfs.rst, show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space. coccinelle complains that there are still a couple of functions that use snprintf(). Convert them to sysfs_emit(). No functional change intended. Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240116045151.3940401-40-lizhijian@fujitsu.com |
||
|
|
ef7e585bf4 |
cpu/hotplug: Delete an extraneous kernel-doc description
struct cpuhp_cpu_state has an extraneous kernel-doc comment for @cpu. There is no struct member by that name, so remove the comment to prevent the kernel-doc warning: kernel/cpu.c:85: warning: Excess struct member 'cpu' description in 'cpuhp_cpu_state' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240114030615.30441-1-rdunlap@infradead.org |
||
|
|
d30e51aa7b |
slab updates for 6.8
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmWWu9EACgkQu+CwddJF
iJpXvQf/aGL7uEY57VpTm0t4gPwoZ9r2P89HxI/nQs9XgVzDcBmVp/cC0LDvSdcm
t91kJO538KeGjMgvlhLMTEuoShH5FlPs6cOwrGAYUoAGa4NwiOpGvliGky+nNHqY
w887ZgSzVLq0UOuSvn86N6enumMvewt4V+872+OWo6O1HWOJhC0SgHTIa8QPQtwb
yZ9BghO5IqMRXiZEsSIwyO+tQHcaU6l2G5huFXzgMFUhkQqAB9KTFc3h6rYI+i80
L4ppNXo2KNPGTDRb9dA8LNMWgvmfjhCb7chs8o1zSY2PwZlkzOix7EUBLCAIbc/2
EIaFC8AsZjfT47D1t72r8QpHB+C14Q==
=J+E7
-----END PGP SIGNATURE-----
Merge tag 'slab-for-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab updates from Vlastimil Babka:
- SLUB: delayed freezing of CPU partial slabs (Chengming Zhou)
Freezing is an operation involving double_cmpxchg() that makes a slab
exclusive for a particular CPU. Chengming noticed that we use it also
in situations where we are not yet installing the slab as the CPU
slab, because freezing also indicates that the slab is not on the
shared list. This results in redundant freeze/unfreeze operation and
can be avoided by marking separately the shared list presence by
reusing the PG_workingset flag.
This approach neatly avoids the issues described in
|
||
|
|
70da1d01ed |
cpu/hotplug: remove CPUHP_SLAB_PREPARE hooks
The CPUHP_SLAB_PREPARE hooks are only used by SLAB which is removed. SLUB defines them as NULL, so we can remove those altogether. Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: David Rientjes <rientjes@google.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |