linux/kernel
Andrea Righi 0b47b6c354 Revert "sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()"
scx_bpf_reenqueue_local() can be called from ops.cpu_release() when a
CPU is taken by a higher scheduling class to give tasks queued to the
CPU's local DSQ a chance to be migrated somewhere else, instead of
waiting indefinitely for that CPU to become available again.

In doing so, we decided to skip migration-disabled tasks, under the
assumption that they cannot be migrated anyway.

However, when a higher scheduling class preempts a CPU, the running task
is always inserted at the head of the local DSQ as a migration-disabled
task. This means it is always skipped by scx_bpf_reenqueue_local(), and
ends up being confined to the same CPU even if that CPU is heavily
contended by other higher scheduling class tasks.

As an example, let's consider the following scenario:

 $ schedtool -a 0,1, -e yes > /dev/null
 $ sudo schedtool -F -p 99 -a 0, -e \
   stress-ng -c 1 --cpu-load 99 --cpu-load-slice 1000

The first task (SCHED_EXT) can run on CPU0 or CPU1. The second task
(SCHED_FIFO) is pinned to CPU0 and consumes ~99% of it. If the SCHED_EXT
task initially runs on CPU0, it will remain there because it always sees
CPU0 as "idle" in the short gaps left by the RT task, resulting in ~1%
utilization while CPU1 stays idle:

    0[||||||||||||||||||||||100.0%]   8[                        0.0%]
    1[                        0.0%]   9[                        0.0%]
    2[                        0.0%]  10[                        0.0%]
    3[                        0.0%]  11[                        0.0%]
    4[                        0.0%]  12[                        0.0%]
    5[                        0.0%]  13[                        0.0%]
    6[                        0.0%]  14[                        0.0%]
    7[                        0.0%]  15[                        0.0%]
  PID USER       PRI  NI  S CPU  CPU%▽MEM%   TIME+  Command
 1067 root        RT   0  R   0  99.0  0.2  0:31.16 stress-ng-cpu [run]
  975 arighi      20   0  R   0   1.0  0.0  0:26.32 yes

By allowing scx_bpf_reenqueue_local() to re-enqueue migration-disabled
tasks, the scheduler can choose to migrate them to other CPUs (CPU1 in
this case) via ops.enqueue(), leading to better CPU utilization:

    0[||||||||||||||||||||||100.0%]   8[                        0.0%]
    1[||||||||||||||||||||||100.0%]   9[                        0.0%]
    2[                        0.0%]  10[                        0.0%]
    3[                        0.0%]  11[                        0.0%]
    4[                        0.0%]  12[                        0.0%]
    5[                        0.0%]  13[                        0.0%]
    6[                        0.0%]  14[                        0.0%]
    7[                        0.0%]  15[                        0.0%]
  PID USER       PRI  NI  S CPU  CPU%▽MEM%   TIME+  Command
  577 root        RT   0  R   0 100.0  0.2  0:23.17 stress-ng-cpu [run]
  555 arighi      20   0  R   1 100.0  0.0  0:28.67 yes

It's debatable whether per-CPU tasks should be re-enqueued as well, but
doing so is probably safer: the scheduler can recognize re-enqueued
tasks through the %SCX_ENQ_REENQ flag, reassess their placement, and
either put them back at the head of the local DSQ or let another task
attempt to take the CPU.

This also prevents giving per-CPU tasks an implicit priority boost,
which would otherwise make them more likely to reclaim CPUs preempted by
higher scheduling classes.

Fixes: 97e13ecb02 ("sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()")
Cc: stable@vger.kernel.org # v6.15+
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-16 10:15:23 -10:00
..
bpf bpf: Fix memory leak of bpf_scc_info objects 2025-08-02 09:04:57 -07:00
cgroup cgroup: Changes for v6.17 2025-07-31 16:04:19 -07:00
configs configs/hardening: Enable CONFIG_INIT_ON_FREE_DEFAULT_ON 2025-07-21 21:41:57 -07:00
debug TTY/Serial driver updates for 6.15-rc1 2025-04-02 18:17:33 -07:00
dma dma-contiguous: hornor the cma address limit setup by user 2025-06-12 08:38:40 +02:00
entry ARM: 2025-07-30 17:14:01 -07:00
events perf/core: Prevent VMA split of buffer mappings 2025-08-05 21:55:29 +02:00
futex futex: Remove support for IMMUTABLE 2025-07-11 16:02:01 +02:00
gcov kbuild: require gcc-8 and binutils-2.30 2025-04-30 21:53:35 +02:00
irq genirq/test: Resolve irq lock inversion warnings 2025-08-06 10:29:48 +02:00
kcsan kcsan: test: Initialize dummy variable 2025-07-23 08:51:32 +02:00
livepatch sched,livepatch: Untangle cond_resched() and live-patching 2025-05-14 13:16:24 +02:00
locking Significant patch series in this pull request: 2025-08-03 16:23:09 -07:00
module Significant patch series in this pull request: 2025-08-05 16:02:07 +03:00
power drm for 6.17-rc1 2025-07-30 19:26:49 -07:00
printk printk changes for 6.17 2025-08-04 10:54:36 -07:00
rcu sched_ext: Changes for v6.17 2025-07-31 16:29:46 -07:00
sched Revert "sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()" 2025-09-16 10:15:23 -10:00
time bitmap-for-6.17 2025-07-31 16:52:32 -07:00
trace Significant patch series in this pull request: 2025-08-03 16:23:09 -07:00
unwind unwind: Finish up unwind when a task exits 2025-07-31 10:20:11 -04:00
.gitignore kheaders: rebuild kheaders_data.tar.xz when a file is modified within a minute 2025-06-24 20:30:37 +09:00
acct.c acct: block access to kernel internal filesystems 2025-02-12 12:24:16 +01:00
async.c
audit.c audit: record AUDIT_ANOM_* events regardless of presence of rules 2025-04-11 14:14:41 -04:00
audit.h audit,module: restore audit logging in load failure case 2025-06-16 17:00:06 -04:00
audit_fsnotify.c
audit_tree.c replace collect_mounts()/drop_collected_mounts() with a safer variant 2025-06-23 14:01:49 -04:00
audit_watch.c fs: add kern_path_locked_negative() 2025-04-15 11:32:34 +02:00
auditfilter.c audit: fix suffixed '/' filename matching 2024-12-05 19:22:38 -05:00
auditsc.c audit,module: restore audit logging in load failure case 2025-06-16 17:00:06 -04:00
backtracetest.c
bounds.c
capability.c capability: Remove unused has_capability 2025-03-07 22:03:09 -06:00
cfi.c cfi: Move BPF CFI types and helpers to generic code 2025-07-31 18:23:53 -07:00
compat.c
configs.c
context_tracking.c context_tracking: Make RCU watch ct_kernel_exit_state() warning 2025-03-04 18:44:29 -08:00
cpu.c cpu: Remove obsolete comment from takedown_cpu() 2025-08-06 22:48:12 +02:00
cpu_pm.c
crash_core.c kdump: wait for DMA to finish when using CMA 2025-07-19 19:08:23 -07:00
crash_dump_dm_crypt.c crash_dump: retrieve dm crypt keys in kdump kernel 2025-05-21 10:48:21 -07:00
crash_reserve.c kdump: implement reserve_crashkernel_cma 2025-07-19 19:08:23 -07:00
cred.c cred: remove old {override,revert}_creds() helpers 2024-12-02 11:25:09 +01:00
delayacct.c delayacct: remove redundant code and adjust indentation 2025-05-27 19:40:33 -07:00
dma.c
elfcorehdr.c
exec_domain.c
exit.c Significant patch series in this pull request: 2025-08-03 16:23:09 -07:00
exit.h
extable.c
fail_function.c
fork.c - Prevent a futex hash leak due to different mm lifetimes 2025-08-10 08:11:39 +03:00
freezer.c sched,freezer: Remove unnecessary warning in __thaw_task 2025-07-17 07:56:50 -10:00
gen_kheaders.sh kheaders: make it possible to override TAR 2025-08-06 10:23:36 +09:00
groups.c
hung_task.c hung_task: extend hung task blocker tracking to rwsems 2025-07-19 19:08:26 -07:00
iomem.c mm/memremap: Pass down MEMREMAP_* flags to arch_memremap_wb() 2025-02-21 15:05:38 +01:00
irq_work.c kasan: make kasan_record_aux_stack_noalloc() the default behaviour 2025-01-13 22:40:36 -08:00
jump_label.c jump_label: Use RCU in all users of __module_text_address(). 2025-03-10 11:54:46 +01:00
kallsyms.c bpf: Clean up individual BTF_ID code 2025-07-16 18:34:42 -07:00
kallsyms_internal.h
kallsyms_selftest.c kallsyms: Use kthread_run_on_cpu() 2025-01-02 22:12:12 +01:00
kallsyms_selftest.h
kcmp.c kcmp: improve performance adding an unlikely hint to task comparisons 2025-02-21 10:25:33 +01:00
Kconfig.freezer
Kconfig.hz kernel: Fix "select" wording on HZ_250 description 2025-02-21 09:20:30 +01:00
Kconfig.kexec crashdump: add CONFIG_KEYS dependency 2025-06-25 15:55:04 -07:00
Kconfig.locks
Kconfig.preempt sched: No PREEMPT_RT=y for all{yes,mod}config 2024-11-07 15:25:05 +01:00
kcov.c kcov: fix typo in comment of kcov_fault_in_area 2025-07-09 22:57:52 -07:00
kexec.c kexec: enable CMA based contiguous allocation 2025-08-02 12:01:38 -07:00
kexec_core.c Significant patch series in this pull request: 2025-08-03 16:23:09 -07:00
kexec_elf.c kexec: initialize ELF lowest address to ULONG_MAX 2025-03-16 22:30:47 -07:00
kexec_file.c Significant patch series in this pull request: 2025-08-03 16:23:09 -07:00
kexec_handover.c Summary of significant series in this pull request: 2025-07-31 14:57:54 -07:00
kexec_internal.h kexec: enable CMA based contiguous allocation 2025-08-02 12:01:38 -07:00
kheaders.c kheaders: Simplify attribute through __BIN_ATTR_SIMPLE_RO() 2024-12-24 09:46:49 +01:00
kprobes.c kprobes: Add missing kerneldoc for __get_insn_slot 2025-07-15 18:45:34 +09:00
kstack_erase.c stackleak: Rename stackleak_track_stack to __sanitizer_cov_stack_depth 2025-07-21 21:40:39 -07:00
ksyms_common.c
ksysfs.c kernel/ksysfs.c: simplify bin_attribute definition 2025-01-07 16:59:15 +01:00
kthread.c kthread: update comment for __to_kthread 2025-07-09 22:57:55 -07:00
latencytop.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
Makefile Kbuild updates for v6.17 2025-08-06 07:32:52 +03:00
module_signature.c
notifier.c
nsproxy.c kernel/nsproxy: remove unnecessary guards 2025-05-09 13:13:54 +02:00
padata.c padata: use cpumask_nth() 2025-06-13 17:26:17 +08:00
panic.c Significant patch series in this pull request: 2025-08-03 16:23:09 -07:00
params.c module: ensure that kobject_put() is safe for module type kobjects 2025-05-07 20:24:59 +02:00
pid.c Summary 2025-07-29 21:43:08 -07:00
pid_namespace.c pid: Do not set pid_max in new pid namespaces 2025-03-06 10:18:36 +01:00
pid_sysctl.h treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
profile.c
ptrace.c ptrace: introduce PTRACE_SET_SYSCALL_INFO request 2025-05-11 17:48:15 -07:00
range.c
reboot.c - The 7 patch series "powerpc/crash: use generic crashkernel 2025-04-01 10:06:52 -07:00
regset.c
relay.c relayfs: support a counter tracking if data is too big to write 2025-07-09 22:57:52 -07:00
resource.c resource: fix false warning in __request_region() 2025-07-24 17:57:59 -07:00
resource_kunit.c
rseq.c rseq: Fix segfault on registration when rseq_cs is non-zero 2025-03-06 22:26:49 +01:00
scftorture.c scftorture: Handle NULL argument passed to scf_add_to_free_list(). 2024-11-14 16:09:51 -08:00
scs.c
seccomp.c seccomp: avoid the lock trip seccomp_filter_release in common case 2025-02-24 11:17:10 -08:00
signal.c coredump: rename do_coredump() to vfs_coredump() 2025-06-16 17:01:22 +02:00
smp.c smp: Fix spelling in on_each_cpu_cond_mask()'s doc-comment 2025-08-02 14:24:50 +02:00
smpboot.c sched/smp: Use the SMP version of idle_thread_set_boot_cpu() 2025-06-13 08:47:20 +02:00
smpboot.h
softirq.c lockdep: Fix wait context check on softirq for PREEMPT_RT 2025-03-25 10:46:44 +01:00
stacktrace.c
static_call.c
static_call_inline.c Modules changes for 6.15-rc1 2025-03-30 15:44:36 -07:00
stop_machine.c sched/core: Fix migrate_swap() vs. hotplug 2025-07-01 15:02:03 +02:00
sys.c Summary of significant series in this pull request: 2025-07-31 14:57:54 -07:00
sys_ni.c
sysctl-test.c sysctl: move u8 register test to lib/test_sysctl.c 2025-04-14 14:13:41 +02:00
sysctl.c sysctl: rename kern_table -> sysctl_subsys_table 2025-07-23 11:56:02 +02:00
task_work.c kasan: make kasan_record_aux_stack_noalloc() the default behaviour 2025-01-13 22:40:36 -08:00
taskstats.c
torture.c torture: Add get_torture_init_jiffies() for test-start time 2025-02-05 07:14:24 -08:00
tracepoint.c tracepoint: Print the function symbol when tracepoint_debug is set 2025-03-21 15:30:10 -04:00
tsacct.c
ucount.c ucount: use atomic_long_try_cmpxchg() in atomic_long_inc_below() 2025-08-02 12:01:38 -07:00
uid16.c
uid16.h
umh.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
up.c
user-return-notifier.c
user.c
user_namespace.c uidgid: add map_id_range_up() 2025-02-12 12:12:27 +01:00
utsname.c
utsname_sysctl.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
vhost_task.c vhost: Use ERR_CAST inlined function instead of ERR_PTR(PTR_ERR(...)) 2025-08-01 09:11:08 -04:00
vmcore_info.c crash: export PAGE_UNACCEPTED_MAPCOUNT_VALUE to vmcoreinfo 2025-05-11 17:54:04 -07:00
watch_queue.c vfs-6.15-rc1.pipe 2025-03-24 09:52:37 -07:00
watchdog.c kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count 2025-05-21 10:48:22 -07:00
watchdog_buddy.c watchdog: fix opencoded cpumask_next_wrap() in watchdog_next_cpu() 2025-07-31 11:28:03 -04:00
watchdog_perf.c watchdog/perf: Provide function for adjusting the event period 2025-07-04 13:17:30 +01:00
workqueue.c workqueue: Changes for v6.17 2025-07-31 15:40:22 -07:00
workqueue_internal.h