linux/kernel
Linus Torvalds d5048d1176 Updates for the core time/timer subsystem:
- Fix a memory ordering issue in posix-timers
 
     Posix-timer lookup is lockless and reevaluates the timer validity under
     the timer lock, but the update which validates the timer is not
     protected by the timer lock. That allows the store to be reordered
     against the initialization stores, so that the lookup side can observe
     a partially initialized timer. That's mostly a theoretical problem, but
     incorrect nevertheless.
 
   - Fix a long standing inconsistency of the coarse time getters
 
     The coarse time getters read the base time of the current update cycle
     without reading the actual hardware clock. NTP frequency adjustment can
     set the base time backwards. The fine grained interfaces compensate
     this by reading the clock and applying the new conversion factor, but
     the coarse grained time getters use the base time directly. That allows
     the user to observe time going backwards.
 
     Cure it by always forwarding base time, when NTP changes the frequency
     with an immediate step.
 
   - Rework of posix-timer hashing
 
     The posix-timer hash is not scalable and due to the CRIU timer restore
     mechanism prone to massive contention on the global hash bucket lock.
 
     Replace the global hash lock with a fine grained per bucket locking
     scheme to address that.
 
   - Rework the proc/$PID/timers interface.
 
     /proc/$PID/timers is provided for CRIU to be able to restore a
     timer. The printout happens with sighand lock held and interrupts
     disabled. That's not required as this can be done with RCU protection
     as well.
 
   - Provide a sane mechanism for CRIU to restore a timer ID
 
     CRIU restores timers by creating and deleting them until the kernel
     internal per process ID counter reached the requested ID. That's
     horribly slow for sparse timer IDs.
 
     Provide a prctl() which allows CRIU to restore a timer with a given
     ID. When enabled the ID pointer is used as input pointer to read the
     requested ID from user space. When disabled, the normal allocation
     scheme (next ID) is active as before. This is backwards compatible for
     both kernel and user space.
 
   - Make hrtimer_update_function() less expensive.
 
     The sanity checks are valuable, but expensive for high frequency usage
     in io/uring. Make the debug checks conditional and enable them only
     when lockdep is enabled.
 
   - Small updates, cleanups and improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmfgQ6wTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoeQzD/9p+EuUGrMbSNaLVMCYFULBbR0lersJ
 hrGGoKUsNt5T+f6hEEbSLBnkjZcMIj0J+mdIEUiRa73ryw1KmwLk/8MBu0c6u6q3
 musDvJqt3dLTG98yN0YeWK3tJDxhSjxIpwcAXusPQ04j16I2fVXFzDQ/kGPq6MTI
 tdMYzsS3wjuWpi+CbgRSP2HEwu08fIDVsQ7Grynh4Kmd31apne4ZgF2UVp6UiZyp
 8yJHZgVzJcFs7Y3MS6XTgezHnuADxMY1irzbXmok19941X8mZz2QRIpGQX+oMh6o
 g7SG2lj9i8YbLqU9/5RbC5ppjRcWfogDpW0Lk+OmdOpr0RiXTmx5Lz8Egxex9wG5
 pUJszeTY+bLw7mmYmkGZyBz+PNoGgVM5KFZRe5ENvYM8Gy8LUW5DA9zvxeHqDDz1
 FiMmKdYrwr8VCKqx+8hJQdzlzRbepxq9sNzDdMKVOUcFdGUVWekfG6ZFkfLKxwzA
 XDTKJilzXbAAj4r57vEvOCYLUZH/ZsFK4yyg0O53fEg6fj87EbTDb5+YUGazb3+C
 yNTEOQIT8LtutzLR9+xeLi92k+6zlJ4c1PfqBx5Kv/TwBrIfV1P8N2c6TCOWDoRM
 AOvo2SXEA/jEPix2GjT5jalSV1mROEXo2T9/G7kz4H7K+DkI/dGgS9mXyUDO2mMd
 ouOxYN0GohVqTQ==
 =XUGH
 -----END PGP SIGNATURE-----

Merge tag 'timers-core-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer core updates from Thomas Gleixner:

 - Fix a memory ordering issue in posix-timers

   Posix-timer lookup is lockless and reevaluates the timer validity
   under the timer lock, but the update which validates the timer is not
   protected by the timer lock. That allows the store to be reordered
   against the initialization stores, so that the lookup side can
   observe a partially initialized timer. That's mostly a theoretical
   problem, but incorrect nevertheless.

 - Fix a long standing inconsistency of the coarse time getters

   The coarse time getters read the base time of the current update
   cycle without reading the actual hardware clock. NTP frequency
   adjustment can set the base time backwards. The fine grained
   interfaces compensate this by reading the clock and applying the new
   conversion factor, but the coarse grained time getters use the base
   time directly. That allows the user to observe time going backwards.

   Cure it by always forwarding base time, when NTP changes the
   frequency with an immediate step.

 - Rework of posix-timer hashing

   The posix-timer hash is not scalable and due to the CRIU timer
   restore mechanism prone to massive contention on the global hash
   bucket lock.

   Replace the global hash lock with a fine grained per bucket locking
   scheme to address that.

 - Rework the proc/$PID/timers interface.

   /proc/$PID/timers is provided for CRIU to be able to restore a timer.
   The printout happens with sighand lock held and interrupts disabled.
   That's not required as this can be done with RCU protection as well.

 - Provide a sane mechanism for CRIU to restore a timer ID

   CRIU restores timers by creating and deleting them until the kernel
   internal per process ID counter reached the requested ID. That's
   horribly slow for sparse timer IDs.

   Provide a prctl() which allows CRIU to restore a timer with a given
   ID. When enabled the ID pointer is used as input pointer to read the
   requested ID from user space. When disabled, the normal allocation
   scheme (next ID) is active as before. This is backwards compatible
   for both kernel and user space.

 - Make hrtimer_update_function() less expensive.

   The sanity checks are valuable, but expensive for high frequency
   usage in io/uring. Make the debug checks conditional and enable them
   only when lockdep is enabled.

 - Small updates, cleanups and improvements

* tag 'timers-core-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
  selftests/timers: Improve skew_consistency by testing with other clockids
  timekeeping: Fix possible inconsistencies in _COARSE clockids
  posix-timers: Drop redundant memset() invocation
  selftests/timers/posix-timers: Add a test for exact allocation mode
  posix-timers: Provide a mechanism to allocate a given timer ID
  posix-timers: Dont iterate /proc/$PID/timers with sighand:: Siglock held
  posix-timers: Make per process list RCU safe
  posix-timers: Avoid false cacheline sharing
  posix-timers: Switch to jhash32()
  posix-timers: Improve hash table performance
  posix-timers: Make signal_struct:: Next_posix_timer_id an atomic_t
  posix-timers: Make lock_timer() use guard()
  posix-timers: Rework timer removal
  posix-timers: Simplify lock/unlock_timer()
  posix-timers: Use guards in a few places
  posix-timers: Remove SLAB_PANIC from kmem cache
  posix-timers: Remove a few paranoid warnings
  posix-timers: Cleanup includes
  posix-timers: Add cond_resched() to posix_timer_add() search loop
  posix-timers: Initialise timer before adding it to the hash table
  ...
2025-03-25 10:33:23 -07:00
..
bpf [ Merge note: this pull request depends on you having merged 2025-03-24 22:06:11 -07:00
cgroup Scheduler updates for v6.15: 2025-03-24 21:28:12 -07:00
configs ubsan/overflow: Rework integer overflow sanitizer option to turn on everything 2025-03-07 19:58:05 -08:00
debug kdb: Remove unused flags stack 2025-01-25 08:22:26 +00:00
dma dma-mapping: fix missing clear bdr in check_ram_in_range_map() 2025-03-12 13:41:44 +01:00
entry Objtool changes for v6.15: 2025-03-24 21:18:05 -07:00
events perf: Clean up pmu specific data 2025-03-17 11:23:38 +01:00
futex futex: Use a hashmask instead of hashsize 2025-02-26 16:07:59 +01:00
gcov gcov: clang: use correct function param names 2025-01-24 22:47:27 -08:00
irq Updates for interrupt chip drivers: 2025-03-25 09:54:36 -07:00
kcsan kcsan: Remove redundant call of kallsyms_lookup_name() 2024-10-14 16:44:56 +02:00
livepatch livepatch: Add stack_order sysfs attribute 2024-12-09 11:44:03 +01:00
locking locking/lockdep: Add kasan_check_byte() check in lock_acquire() 2025-03-08 00:55:04 +01:00
module module: don't annotate ROX memory as kmemleak_not_leak() 2025-02-14 10:32:02 +01:00
power More power management updates for 6.14-rc1 2025-01-30 15:10:34 -08:00
printk Flush console log from kernel_power_off() 2025-03-04 18:44:29 -08:00
rcu RCU pull request for v6.15 2025-03-24 19:41:37 -07:00
sched Scheduler updates for v6.15: 2025-03-24 21:28:12 -07:00
time Updates for the core time/timer subsystem: 2025-03-25 10:33:23 -07:00
trace [ Merge note: this pull request depends on you having merged 2025-03-24 22:06:11 -07:00
.gitignore
acct.c acct: block access to kernel internal filesystems 2025-02-12 12:24:16 +01:00
async.c
audit.c audit: Initialize lsmctx to avoid memory allocation error 2025-01-29 20:02:04 -05:00
audit.h audit: change context data from secid to lsm_prop 2024-10-11 14:34:16 -04:00
audit_fsnotify.c
audit_tree.c
audit_watch.c VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry 2025-02-19 14:08:41 +01:00
auditfilter.c audit: fix suffixed '/' filename matching 2024-12-05 19:22:38 -05:00
auditsc.c fs: dedup handling of struct filename init and refcounts bumps 2025-03-18 15:34:27 +01:00
backtracetest.c
bounds.c
capability.c kernel: remove get_task_comm() and print task comm directly 2025-01-12 20:21:15 -08:00
cfi.c x86/cfi: Add 'cfi=warn' boot option 2025-02-26 12:10:48 +01:00
compat.c
configs.c
context_tracking.c context_tracking: Make RCU watch ct_kernel_exit_state() warning 2025-03-04 18:44:29 -08:00
cpu.c watchdog/hardlockup/perf: Fix perf_event memory leak 2025-03-06 12:05:33 +01:00
cpu_pm.c
crash_core.c crash: Use note name macros 2025-02-10 16:56:58 -08:00
crash_reserve.c crash: fix crash memory reserve exceed system memory bug 2024-09-01 20:43:30 -07:00
cred.c cred: remove old {override,revert}_creds() helpers 2024-12-02 11:25:09 +01:00
delayacct.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
dma.c
elfcorehdr.c
exec_domain.c
exit.c kernel-6.15-rc1.tasklist_lock 2025-03-24 13:39:27 -07:00
exit.h
extable.c
fail_function.c
fork.c pidfs: ensure that PIDFS_INFO_EXIT is available 2025-03-19 14:40:18 +01:00
freezer.c sched/fair: Fix external p->on_rq users 2024-10-14 09:14:35 +02:00
gen_kheaders.sh Kbuild updates for v6.14 2025-01-31 12:07:07 -08:00
groups.c
hung_task.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
iomem.c mm/memremap: Pass down MEMREMAP_* flags to arch_memremap_wb() 2025-02-21 15:05:38 +01:00
irq_work.c kasan: make kasan_record_aux_stack_noalloc() the default behaviour 2025-01-13 22:40:36 -08:00
jump_label.c jump_label: Fix static_key_slow_dec() yet again 2024-09-10 11:57:27 +02:00
kallsyms.c kallsyms: Remove KALLSYMS_ABSOLUTE_PERCPU 2025-02-18 10:16:04 +01:00
kallsyms_internal.h
kallsyms_selftest.c kallsyms: Use kthread_run_on_cpu() 2025-01-02 22:12:12 +01:00
kallsyms_selftest.h
kcmp.c kcmp: improve performance adding an unlikely hint to task comparisons 2025-02-21 10:25:33 +01:00
Kconfig.freezer
Kconfig.hz
Kconfig.kexec crash, powerpc: default to CRASH_DUMP=n on PPC_BOOK3S_32 2024-11-14 22:43:48 -08:00
Kconfig.locks
Kconfig.preempt sched: No PREEMPT_RT=y for all{yes,mod}config 2024-11-07 15:25:05 +01:00
kcov.c kcov: mark in_softirq_really() as __always_inline 2024-12-30 17:59:08 -08:00
kexec.c
kexec_core.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
kexec_elf.c
kexec_file.c kexec_file: fix elfcorehdr digest exclusion when CONFIG_CRASH_HOTPLUG=y 2024-09-01 17:59:01 -07:00
kexec_internal.h kexec: use atomic_try_cmpxchg_acquire() in kexec_trylock() 2024-09-01 20:43:23 -07:00
kheaders.c kheaders: Simplify attribute through __BIN_ATTR_SIMPLE_RO() 2024-12-24 09:46:49 +01:00
kprobes.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
ksyms_common.c
ksysfs.c kernel/ksysfs.c: simplify bin_attribute definition 2025-01-07 16:59:15 +01:00
kthread.c kthread: Fix return value on kzalloc() failure in kthread_affine_preferred() 2025-02-04 01:42:27 +01:00
latencytop.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
Makefile tracing: Disable branch profiling in noinstr code 2025-03-22 09:49:26 +01:00
module_signature.c
notifier.c reboot: move reboot_notifier_list to kernel/reboot.c 2024-11-05 17:12:31 -08:00
nsproxy.c fdget(), trivial conversions 2024-11-03 01:28:06 -05:00
padata.c padata: switch padata_find_next() to using cpumask_next_wrap() 2025-02-24 16:37:23 -05:00
panic.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
params.c module: Constify 'struct module_attribute' 2025-01-26 13:05:23 +01:00
pid.c kernel-6.15-rc1.tasklist_lock 2025-03-24 13:39:27 -07:00
pid_namespace.c pid: Do not set pid_max in new pid namespaces 2025-03-06 10:18:36 +01:00
pid_sysctl.h treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
profile.c
ptrace.c
range.c
reboot.c Flush console log from kernel_power_off() 2025-03-04 18:44:29 -08:00
regset.c
relay.c [tree-wide] finally take no_llseek out 2024-09-27 08:18:43 -07:00
resource.c kernel/resource: simplify API __devm_release_region() implementation 2025-01-12 20:20:58 -08:00
resource_kunit.c resource, kunit: fix user-after-free in resource_test_region_intersects() 2024-10-09 12:47:19 -07:00
rseq.c rseq: Fix segfault on registration when rseq_cs is non-zero 2025-03-06 22:26:49 +01:00
scftorture.c scftorture: Handle NULL argument passed to scf_add_to_free_list(). 2024-11-14 16:09:51 -08:00
scs.c
seccomp.c seccomp: avoid the lock trip seccomp_filter_release in common case 2025-02-24 11:17:10 -08:00
signal.c Updates for the core time/timer subsystem: 2025-03-25 10:33:23 -07:00
smp.c CSD-lock pull request for v6.14 2025-01-28 11:34:03 -08:00
smpboot.c
smpboot.h
softirq.c softirq: Allow raising SCHED_SOFTIRQ from SMP-call-function on RT kernel 2024-12-02 12:01:27 +01:00
stackleak.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
stacktrace.c
static_call.c
static_call_inline.c x86/static-call: provide a way to do very early static-call updates 2024-12-13 09:28:32 +01:00
stop_machine.c stop-machine: Add comment for rcu_momentary_eqs() 2025-03-11 10:15:52 -07:00
sys.c Updates for the core time/timer subsystem: 2025-03-25 10:33:23 -07:00
sys_ni.c
sysctl-test.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
sysctl.c [ Merge note: this pull request depends on you having merged 2025-03-24 22:06:11 -07:00
task_work.c kasan: make kasan_record_aux_stack_noalloc() the default behaviour 2025-01-13 22:40:36 -08:00
taskstats.c fdget(), more trivial conversions 2024-11-03 01:28:06 -05:00
torture.c torture: Add get_torture_init_jiffies() for test-start time 2025-02-05 07:14:24 -08:00
tracepoint.c tracing: Fix syscall tracepoint use-after-free 2024-11-01 14:37:31 -04:00
tsacct.c
ucount.c ucounts: move kfree() out of critical zone protected by ucounts_lock 2025-01-12 20:21:00 -08:00
uid16.c
uid16.h
umh.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
up.c
user-return-notifier.c
user.c uidgid: make sure we fit into one cacheline 2024-09-12 12:16:09 +02:00
user_namespace.c uidgid: add map_id_range_up() 2025-02-12 12:12:27 +01:00
usermode_driver.c
utsname.c
utsname_sysctl.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
vhost_task.c vhost: return task creation error instead of NULL 2025-03-01 02:52:52 -05:00
vmcore_info.c mm: support only one page_type per page 2024-09-03 21:15:43 -07:00
watch_queue.c vfs-6.15-rc1.pipe 2025-03-24 09:52:37 -07:00
watchdog.c watchdog/hardlockup/perf: Fix perf_event memory leak 2025-03-06 12:05:33 +01:00
watchdog_buddy.c
watchdog_perf.c watchdog/hardlockup/perf: Warn if watchdog_ev is leaked 2025-03-06 12:07:39 +01:00
workqueue.c workqueue: An update for v6.14-rc4 2025-02-26 14:22:47 -08:00
workqueue_internal.h