Commit graph

1174 commits

Author SHA1 Message Date
Kan Liang
e800ac5120 perf: Only dump the throttle log for the leader
The PERF_RECORD_THROTTLE records are dumped for all throttled events.
It's not necessary for group events, which are throttled altogether.

Optimize it by only dump the throttle log for the leader.

The sample right after the THROTTLE record must be generated by the
actual target event. It is good enough for the perf tool to locate the
actual target event.

Suggested-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20250520181644.2673067-3-kan.liang@linux.intel.com
2025-05-21 13:57:43 +02:00
Kan Liang
9734e25fbf perf: Fix the throttle logic for a group
The current throttle logic doesn't work well with a group, e.g., the
following sampling-read case.

$ perf record -e "{cycles,cycles}:S" ...

$ perf report -D | grep THROTTLE | tail -2
            THROTTLE events:        426  ( 9.0%)
          UNTHROTTLE events:        425  ( 9.0%)

$ perf report -D | grep PERF_RECORD_SAMPLE -a4 | tail -n 5
0 1020120874009167 0x74970 [0x68]: PERF_RECORD_SAMPLE(IP, 0x1):
... sample_read:
.... group nr 2
..... id 0000000000000327, value 000000000cbb993a, lost 0
..... id 0000000000000328, value 00000002211c26df, lost 0

The second cycles event has a much larger value than the first cycles
event in the same group.

The current throttle logic in the generic code only logs the THROTTLE
event. It relies on the specific driver implementation to disable
events. For all ARCHs, the implementation is similar. Only the event is
disabled, rather than the group.

The logic to disable the group should be generic for all ARCHs. Add the
logic in the generic code. The following patch will remove the buggy
driver-specific implementation.

The throttle only happens when an event is overflowed. Stop the entire
group when any event in the group triggers the throttle.
The MAX_INTERRUPTS is set to all throttle events.

The unthrottled could happen in 3 places.
- event/group sched. All events in the group are scheduled one by one.
  All of them will be unthrottled eventually. Nothing needs to be
  changed.
- The perf_adjust_freq_unthr_events for each tick. Needs to restart the
  group altogether.
- The __perf_event_period(). The whole group needs to be restarted
  altogether as well.

With the fix,
$ sudo perf report -D | grep PERF_RECORD_SAMPLE -a4 | tail -n 5
0 3573470770332 0x12f5f8 [0x70]: PERF_RECORD_SAMPLE(IP, 0x2):
... sample_read:
.... group nr 2
..... id 0000000000000a28, value 00000004fd3dfd8f, lost 0
..... id 0000000000000a29, value 00000004fd3dfd8f, lost 0

Suggested-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20250520181644.2673067-2-kan.liang@linux.intel.com
2025-05-21 13:57:42 +02:00
Kan Liang
ca559503b8 perf/core: Add the is_event_in_freq_mode() helper to simplify the code
Add a helper to check if an event is in freq mode to improve readability.

No functional changes.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250516182853.2610284-2-kan.liang@linux.intel.com
2025-05-17 10:02:27 +02:00
Frederic Weisbecker
881097c054 perf: Fix confusing aux iteration
While an event tears down all links to it as an aux, the iteration
happens on the event's group leader instead of the group itself.

If the event is a group leader, it has no effect because the event is
also its own group leader. But otherwise there would be a risk to detach
all the siblings events from the wrong group leader.

It just happens to work because each sibling's aux link is tested
against the right event before proceeding. Also the ctx lock is the same
for the events and their group leader so the iteration is safe.

Yet the iteration is confusing. Clarify the actual intent.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250424161128.29176-5-frederic@kernel.org
2025-05-08 21:50:19 +02:00
Frederic Weisbecker
d20eb2d5fe perf: Fix irq work dereferencing garbage
The following commit:

	da916e96e2 ("perf: Make perf_pmu_unregister() useable")

has introduced two significant event's parent lifecycle changes:

1) An event that has exited now has EVENT_TOMBSTONE as a parent.
   This can result in a situation where the delayed wakeup irq_work can
   accidentally dereference EVENT_TOMBSTONE on:

CPU 0                                          CPU 1
-----                                          -----

__schedule()
    local_irq_disable()
    rq_lock()
    <NMI>
    perf_event_overflow()
        irq_work_queue(&child->pending_irq)
    </NMI>
    perf_event_task_sched_out()
        raw_spin_lock(&ctx->lock)
        ctx_sched_out()
        ctx->is_active = 0
        event_sched_out(child)
        raw_spin_unlock(&ctx->lock)
                                              perf_event_release_kernel(parent)
                                                  perf_remove_from_context(child)
                                                  raw_spin_lock_irq(&ctx->lock)
                                                  // Sees !ctx->is_active
                                                  // Removes from context inline
                                                  __perf_remove_from_context(child)
                                                      perf_child_detach(child)
                                                          event->parent = EVENT_TOMBSTONE
    raw_spin_rq_unlock_irq(rq);
    <IRQ>
    perf_pending_irq()
        perf_event_wakeup(child)
            ring_buffer_wakeup(child)
                rcu_dereference(child->parent->rb) <--- CRASH

This also concerns the call to kill_fasync() on parent->fasync.

2) The final parent reference count decrement can now happen before the
   the final child reference count decrement. ie: the parent can now
   be freed before its child. On PREEMPT_RT, this can result in a
   situation where the delayed wakeup irq_work can accidentally
   dereference a freed parent:

CPU 0                                          CPU 1                              CPU 2
-----                                          -----                              ------

perf_pmu_unregister()
    pmu_detach_events()
       pmu_get_event()
           atomic_long_inc_not_zero(&child->refcount)

                                               <NMI>
                                               perf_event_overflow()
                                                   irq_work_queue(&child->pending_irq);
                                               </NMI>
                                               <IRQ>
                                               irq_work_run()
                                                   wake_irq_workd()
                                               </IRQ>
                                               preempt_schedule_irq()
                                               =========> SWITCH to workd
                                               irq_work_run_list()
                                                   perf_pending_irq()
                                                       perf_event_wakeup(child)
                                                           ring_buffer_wakeup(child)
                                                               event = child->parent

                                                                                  perf_event_release_kernel(parent)
                                                                                      // Not last ref, PMU holds it
                                                                                      put_event(child)
                                                                                      // Last ref
                                                                                      put_event(parent)
                                                                                          free_event()
                                                                                              call_rcu(...)
                                                                                  rcu_core()
                                                                                      free_event_rcu()

                                                               rcu_dereference(event->rb) <--- CRASH

This also concerns the call to kill_fasync() on parent->fasync.

The "easy" solution to 1) is to check that event->parent is not
EVENT_TOMBSTONE on perf_event_wakeup() (including both ring buffer
and fasync uses).

The "easy" solution to 2) is to turn perf_event_wakeup() to wholefully
run under rcu_read_lock().

However because of 2), sanity would prescribe to make event::parent
an __rcu pointer and annotate each and every users to prove they are
reliable.

Propose an alternate solution and restore the stable pointer to the
parent until all its children have called _free_event() themselves to
avoid any further accident. Also revert the EVENT_TOMBSTONE design
that is mostly here to determine which caller of perf_event_exit_event()
must perform the refcount decrement on a child event matching the
increment in inherit_event().

Arrange instead for checking the attach state of an event prior to its
removal and decrement the refcount of the child accordingly.

Fixes: da916e96e2 ("perf: Make perf_pmu_unregister() useable")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-05-08 21:50:19 +02:00
Frederic Weisbecker
22d38babb3 perf: Fix failing inherit_event() doing extra refcount decrement on parent
When inherit_event() fails after the child allocation but before the
parent refcount has been incremented, calling put_event() wrongly
decrements the reference to the parent, risking to free it too early.

Also pmu_get_event() can't be holding a reference to the child
concurrently at this point since it is under pmus_srcu critical section.

Fix it with restoring the deleted free_event() function and call it on
the failing child in order to free it directly under the verified
assumption that its refcount is only 1. The refcount to the parent is
then voluntarily omitted.

Fixes: da916e96e2 ("perf: Make perf_pmu_unregister() useable")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250424161128.29176-2-frederic@kernel.org
2025-05-08 21:50:18 +02:00
Qing Wang
f51972e6f8 perf/core: Fix broken throttling when max_samples_per_tick=1
According to the throttling mechanism, the pmu interrupts number can not
exceed the max_samples_per_tick in one tick. But this mechanism is
ineffective when max_samples_per_tick=1, because the throttling check is
skipped during the first interrupt and only performed when the second
interrupt arrives.

Perhaps this bug may cause little influence in one tick, but if in a
larger time scale, the problem can not be underestimated.

When max_samples_per_tick = 1:
Allowed-interrupts-per-second max-samples-per-second  default-HZ  ARCH
200                           100                     100         X86
500                           250                     250         ARM64
...
Obviously, the pmu interrupt number far exceed the user's expect.

Fixes: e050e3f0a7 ("perf: Fix broken interrupt rate throttling")
Signed-off-by: Qing Wang <wangqing7171@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250405141635.243786-3-wangqing7171@gmail.com
2025-04-25 14:55:22 +02:00
Peter Zijlstra
1caafd919e Merge branch 'perf/urgent'
Merge urgent fixes for dependencies.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2025-04-25 14:55:20 +02:00
Namhyung Kim
0db61388b3 perf/core: Change to POLLERR for pinned events with error
Commit:

  f4b07fd62d ("perf/core: Use POLLHUP for pinned events in error")

started to emit POLLHUP for pinned events in an error state.

But the POLLHUP is also used to signal events that the attached task is
terminated.  To distinguish pinned per-task events in the error state
it would need to check if the task is live.

Change it to POLLERR to make it clear.

Suggested-by: Gabriel Marin <gmx@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250422223318.180343-1-namhyung@kernel.org
2025-04-23 09:39:06 +02:00
Peter Zijlstra
b02b41c827 perf/core: Fix event timekeeping merge
Due to an oversight in merging:

  da916e96e2 ("perf: Make perf_pmu_unregister() useable")

on top of:

  a3c3c66670 ("perf/core: Fix child_total_time_enabled accounting bug at task exit")

the timekeeping fix from this latter patch got undone.

Redo it.

Fixes: da916e96e2 ("perf: Make perf_pmu_unregister() useable")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20250417080815.GI38216@noisy.programming.kicks-ass.net
2025-04-17 14:21:15 +02:00
Peter Zijlstra
162c9e3faf perf/core: Fix event->parent life-time issue
Due to an oversight in merging:

  da916e96e2 ("perf: Make perf_pmu_unregister() useable")

on top of:

  56799bc035 ("perf: Fix hang while freeing sigtrap event")

.. it is now possible to hit put_event(EVENT_TOMBSTONE), which makes
the computer sad.

This also means that for the event->parent == EVENT_TOMBSTONE, the
put_event() matching inherit_event() has gone missing.

Previously this was done in perf_event_release_kernel() after calling
perf_remove_from_context(), but with it delegated to put_event(), this
case is now entirely missed, leading to leaks.

Fixes: da916e96e2 ("perf: Make perf_pmu_unregister() useable")
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: kernel test robot <oliver.sang@intel.com>
Tested-by: James Clark <james.clark@linaro.org>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Closes: https://lore.kernel.org/oe-lkp/202504131701.941039cd-lkp@intel.com
Link: https://lkml.kernel.org/r/20250415131446.GN5600@noisy.programming.kicks-ass.net
2025-04-17 14:21:15 +02:00
Frederic Weisbecker
2839f393c6 perf/core: Fix put_ctx() ordering
So there are three situations:

* If perf_event_free_task() has removed all the children from the parent list
  before perf_event_release_kernel() got a chance to even iterate them, then
  it's all good as there is no get_ctx() pending.

* If perf_event_release_kernel() iterates a child event, but it gets freed
  meanwhile by perf_event_free_task() while the mutexes are temporarily
  unlocked, it's all good because while locking again the ctx mutex,
  perf_event_release_kernel() observes TASK_TOMBSTONE.

* But if perf_event_release_kernel() frees the child event before
  perf_event_free_task() got a chance, we may face this scenario:

    perf_event_release_kernel()                                  perf_event_free_task()
    --------------------------                                   ------------------------
    mutex_lock(&event->child_mutex)
    get_ctx(child->ctx)
    mutex_unlock(&event->child_mutex)

    mutex_lock(ctx->mutex)
    mutex_lock(&event->child_mutex)
    perf_remove_from_context(child)
    mutex_unlock(&event->child_mutex)
    mutex_unlock(ctx->mutex)

                                                                 // This lock acquires ctx->refcount == 2
                                                                 // visibility
                                                                 mutex_lock(ctx->mutex)
                                                                 ctx->task = TASK_TOMBSTONE
                                                                 mutex_unlock(ctx->mutex)

                                                                 wait_var_event()
                                                                     // enters prepare_to_wait() since
                                                                     // ctx->refcount == 2
                                                                     // is guaranteed to be seen
                                                                     set_current_state(TASK_INTERRUPTIBLE)
                                                                     smp_mb()
                                                                     if (ctx->refcount != 1)
                                                                         schedule()
    put_ctx()
       // NOT fully ordered! Only RELEASE semantics
       refcount_dec_and_test()
           atomic_fetch_sub_release()
       // So TASK_TOMBSTONE is not guaranteed to be seen
       if (ctx->task == TASK_TOMBSTONE)
           wake_up_var()

Basically it's a broken store buffer:

    perf_event_release_kernel()                                  perf_event_free_task()
    --------------------------                                   ------------------------
    ctx->task = TASK_TOMBSTONE                                   smp_store_release(&ctx->refcount, ctx->refcount - 1)
    smp_mb()
    READ_ONCE(ctx->refcount)                                     READ_ONCE(ctx->task)

So we need a smp_mb__after_atomic() before looking at ctx->task.

Fixes: 59f3aa4a3e ("perf: Simplify perf_event_free_task() wait")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/Z_ZvmEhjkAhplCBE@localhost.localdomain
2025-04-17 14:21:15 +02:00
Peter Zijlstra
f6938a562a perf/core: Fix perf-stat / read()
In the zeal to adjust all event->state checks to include the new
REVOKED state, one adjustment was made in error. Notably it resulted
in read() on the perf filedesc to stop working for any state lower
than ERROR, specifically EXIT.

This leads to problems with (among others) perf-stat, which wants to
read the counts after a program has finished execution.

Fixes: da916e96e2 ("perf: Make perf_pmu_unregister() useable")
Reported-by: "Mi, Dapeng" <dapeng1.mi@linux.intel.com>
Reported-by: James Clark <james.clark@linaro.org>
Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/77036114-8723-4af9-a068-1d535f4e2e81@linaro.org
Link: https://lore.kernel.org/r/20250417080725.GH38216@noisy.programming.kicks-ass.net
2025-04-17 14:21:15 +02:00
Peter Zijlstra
da916e96e2 perf: Make perf_pmu_unregister() useable
Previously it was only safe to call perf_pmu_unregister() if there
were no active events of that pmu around -- which was impossible to
guarantee since it races all sorts against perf_init_event().

Rework the whole thing by:

 - keeping track of all events for a given pmu

 - 'hiding' the pmu from perf_init_event()

 - waiting for the appropriate (s)rcu grace periods such that all
   prior references to the PMU will be completed

 - detaching all still existing events of that pmu (see first point)
   and moving them to a new REVOKED state.

 - actually freeing the pmu data.

Where notably the new REVOKED state must inhibit all event actions
from reaching code that wants to use event->pmu.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lkml.kernel.org/r/20250307193723.525402029@infradead.org
2025-04-08 20:55:48 +02:00
Peter Zijlstra
4da0600eda perf: Rename perf_event_exit_task(.child)
The task passed to perf_event_exit_task() is not a child, it is
current. Fix this confusing naming, since much of the rest of the code
also relies on it being current.

Specifically, both exec() and exit() callers use it with current as
the argument.

Notably, task_ctx_sched_out() doesn't make much sense outside of
current.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lkml.kernel.org/r/20250307193305.486326750@infradead.org
2025-04-08 20:55:48 +02:00
Peter Zijlstra
9066136502 perf: Unify perf_event_free_task() / perf_event_exit_task_context()
Both perf_event_free_task() and perf_event_exit_task_context() are
very similar, except perf_event_exit_task_context() is a little more
generic / makes less assumptions.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lkml.kernel.org/r/20250307193723.274039710@infradead.org
2025-04-08 20:55:47 +02:00
Peter Zijlstra
3e8671e00e perf: Simplify perf_event_release_kernel()
There is no good reason to have the free list anymore. It is possible
to call free_event() after the locks have been dropped in the main
loop.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lkml.kernel.org/r/20250307193723.151721102@infradead.org
2025-04-08 20:55:47 +02:00
Peter Zijlstra
59f3aa4a3e perf: Simplify perf_event_free_task() wait
Simplify the code by moving the duplicated wakeup condition into
put_ctx().

Notably, wait_var_event() is in perf_event_free_task() and will have
set ctx->task = TASK_TOMBSTONE.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lkml.kernel.org/r/20250307193723.044499344@infradead.org
2025-04-08 20:55:46 +02:00
Peter Zijlstra
0a00a43b8c perf: Simplify child event tear-down
Currently perf_event_release_kernel() will iterate the child events and attempt
tear-down. However, it removes them from the child_list using list_move(),
notably skipping the state management done by perf_child_detach().

Crucially, it fails to clear PERF_ATTACH_CHILD, which opens the door for a
concurrent perf_remove_from_context() to race.

This way child_list management stays fully serialized using child_mutex.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lkml.kernel.org/r/20250307193305.486326750@infradead.org
2025-04-08 20:55:46 +02:00
Peter Zijlstra
7ed9138a72 perf: Ensure bpf_perf_link path is properly serialized
Ravi reported that the bpf_perf_link_attach() usage of
perf_event_set_bpf_prog() is not serialized by ctx->mutex, unlike the
PERF_EVENT_IOC_SET_BPF case.

Reported-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lkml.kernel.org/r/20250307193305.486326750@infradead.org
2025-04-08 20:55:46 +02:00
Frederic Weisbecker
56799bc035 perf: Fix hang while freeing sigtrap event
Perf can hang while freeing a sigtrap event if a related deferred
signal hadn't managed to be sent before the file got closed:

perf_event_overflow()
   task_work_add(perf_pending_task)

fput()
   task_work_add(____fput())

task_work_run()
    ____fput()
        perf_release()
            perf_event_release_kernel()
                _free_event()
                    perf_pending_task_sync()
                        task_work_cancel() -> FAILED
                        rcuwait_wait_event()

Once task_work_run() is running, the list of pending callbacks is
removed from the task_struct and from this point on task_work_cancel()
can't remove any pending and not yet started work items, hence the
task_work_cancel() failure and the hang on rcuwait_wait_event().

Task work could be changed to remove one work at a time, so a work
running on the current task can always cancel a pending one, however
the wait / wake design is still subject to inverted dependencies when
remote targets are involved, as pictured by Oleg:

T1                                                      T2

fd = perf_event_open(pid => T2->pid);                  fd = perf_event_open(pid => T1->pid);
close(fd)                                              close(fd)
    <IRQ>                                                  <IRQ>
    perf_event_overflow()                                  perf_event_overflow()
       task_work_add(perf_pending_task)                        task_work_add(perf_pending_task)
    </IRQ>                                                 </IRQ>
    fput()                                                 fput()
        task_work_add(____fput())                              task_work_add(____fput())

    task_work_run()                                        task_work_run()
        ____fput()                                             ____fput()
            perf_release()                                         perf_release()
                perf_event_release_kernel()                            perf_event_release_kernel()
                    _free_event()                                          _free_event()
                        perf_pending_task_sync()                               perf_pending_task_sync()
                            rcuwait_wait_event()                                   rcuwait_wait_event()

Therefore the only option left is to acquire the event reference count
upon queueing the perf task work and release it from the task work, just
like it was done before 3a5465418f ("perf: Fix event leak upon exec and file release")
but without the leaks it fixed.

Some adjustments are necessary to make it work:

* A child event might dereference its parent upon freeing. Care must be
  taken to release the parent last.

* Some places assuming the event doesn't have any reference held and
  therefore can be freed right away must instead put the reference and
  let the reference counting to its job.

Reported-by: "Yi Lai" <yi1.lai@linux.intel.com>
Closes: https://lore.kernel.org/all/Zx9Losv4YcJowaP%2F@ly-workstation/
Reported-by: syzbot+3c4321e10eea460eb606@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/673adf75.050a0220.87769.0024.GAE@google.com/
Fixes: 3a5465418f ("perf: Fix event leak upon exec and file release")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250304135446.18905-1-frederic@kernel.org
2025-04-08 20:55:43 +02:00
Gabriel Shahrouzi
0ba3a4ab76 perf/core: Fix WARN_ON(!ctx) in __free_event() for partial init
Move the get_ctx(child_ctx) call and the child_event->ctx assignment to
occur immediately after the child event is allocated. Ensure that
child_event->ctx is non-NULL before any subsequent error path within
inherit_event calls free_event(), satisfying the assumptions of the
cleanup code.

Details:

There's no clear Fixes tag, because this bug is a side-effect of
multiple interacting commits over time (up to 15 years old), not
a single regression.

The code initially incremented refcount then assigned context
immediately after the child_event was created. Later, an early
validity check for child_event was added before the
refcount/assignment. Even later, a WARN_ON_ONCE() cleanup check was
added, assuming event->ctx is valid if the pmu_ctx is valid.
The problem is that the WARN_ON_ONCE() could trigger after the initial
check passed but before child_event->ctx was assigned, violating its
precondition. The solution is to assign child_event->ctx right after
its initial validation. This ensures the context exists for any
subsequent checks or cleanup routines, resolving the WARN_ON_ONCE().

To resolve it, defer the refcount update and child_event->ctx assignment
directly after child_event->pmu_ctx is set but before checking if the
parent event is orphaned. The cleanup routine depends on
event->pmu_ctx being non-NULL before it verifies event->ctx is
non-NULL. This also maintains the author's original intent of passing
in child_ctx to find_get_pmu_context before its refcount/assignment.

[ mingo: Expanded the changelog from another email by Gabriel Shahrouzi. ]

Reported-by: syzbot+ff3aa851d46ab82953a3@syzkaller.appspotmail.com
Signed-off-by: Gabriel Shahrouzi <gshahrouzi@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Link: https://lore.kernel.org/r/20250405203036.582721-1-gshahrouzi@gmail.com
Closes: https://syzkaller.appspot.com/bug?extid=ff3aa851d46ab82953a3
2025-04-06 20:30:28 +02:00
Yeoreum Yun
a3c3c66670 perf/core: Fix child_total_time_enabled accounting bug at task exit
The perf events code fails to account for total_time_enabled of
inactive events.

Here is a failure case for accounting total_time_enabled for
CPU PMU events:

  sudo ./perf stat -vvv -e armv8_pmuv3_0/event=0x08/ -e armv8_pmuv3_1/event=0x08/ -- stress-ng --pthread=2 -t 2s
  ...

  armv8_pmuv3_0/event=0x08/: 1138698008 2289429840 2174835740
  armv8_pmuv3_1/event=0x08/: 1826791390 1950025700 847648440
                             `          `          `
                             `          `          > total_time_running with child
                             `          > total_time_enabled with child
                             > count with child

  Performance counter stats for 'stress-ng --pthread=2 -t 2s':

       1,138,698,008      armv8_pmuv3_0/event=0x08/                                               (94.99%)
       1,826,791,390      armv8_pmuv3_1/event=0x08/                                               (43.47%)

The two events above are opened on two different CPU PMUs, for example,
each event is opened for a cluster in an Arm big.LITTLE system, they
will never run on the same CPU.  In theory, the total enabled time should
be same for both events, as two events are opened and closed together.

As the result show, the two events' total enabled time including
child event is different (2289429840 vs 1950025700).

This is because child events are not accounted properly
if a event is INACTIVE state when the task exits:

  perf_event_exit_event()
   `> perf_remove_from_context()
     `> __perf_remove_from_context()
       `> perf_child_detach()   -> Accumulate child_total_time_enabled
         `> list_del_event()    -> Update child event's time

The problem is the time accumulation happens prior to child event's
time updating. Thus, it misses to account the last period's time when
the event exits.

The perf core layer follows the rule that timekeeping is tied to state
change. To address the issue, make __perf_remove_from_context()
handle the task exit case by passing 'DETACH_EXIT' to it and
invoke perf_event_state() for state alongside with accounting the time.

Then, perf_child_detach() populates the time into the parent's time metrics.

After this patch, the bug is fixed:

  sudo ./perf stat -vvv -e armv8_pmuv3_0/event=0x08/ -e armv8_pmuv3_1/event=0x08/ -- stress-ng --pthread=2 -t 10s
  ...
  armv8_pmuv3_0/event=0x08/: 15396770398 32157963940 21898169000
  armv8_pmuv3_1/event=0x08/: 22428964974 32157963940 10259794940

   Performance counter stats for 'stress-ng --pthread=2 -t 10s':

      15,396,770,398      armv8_pmuv3_0/event=0x08/                                               (68.10%)
      22,428,964,974      armv8_pmuv3_1/event=0x08/                                               (31.90%)

[ mingo: Clarified the changelog. ]

Fixes: ef54c1a476 ("perf: Rework perf_event_exit_event()")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Leo Yan <leo.yan@arm.com>
Link: https://lore.kernel.org/r/20250326082003.1630986-1-yeoreum.yun@arm.com
2025-03-31 12:57:38 +02:00
Linus Torvalds
054570267d lsm/stable-6.15 PR 20250323
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmfgWgMUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNW5RAAvCDq5gBtY0aTNlULe637EVLSh+t8
 PkSzHzu/NlzU6BfjtwSm2fuML8welTGxSwUPxUzMCI91gPdkGeFktefavT3xa+QI
 BHWROn7fEJ/KmRZvngPeIkgLr5xhF5nBJmc/Jw71qem20zRzNgJnpzMX16d10Phx
 dxd2xOO1qM3bv6Z9RcIssZRGaN+PHngpWWg+0B69XuaBUso87S6NDyKNn1XPmvoz
 as96k+Wk/xAZGVEeCbs/+H5rBx6DLg+FfTRa06Oh4BFsqedpkDPxLrTgCJGJkA0H
 dsK6O/993zvjx0Jn4ZPoJ9n35S82BmkCsz4bGq1xVl6FYUiMcm3/8yO41wllS+w4
 j+RlTU/RIdB7n8EKyMMl1hj1stTvt3Bi9F5Cbf7ZEv0snfR00K4KVpi17jnFjUHv
 kpOiEtXZb/NGQip7UAuUq0PisfqbiO4jJurYHRetDgv1WCy6+C8ufM5t6I+cnvmG
 VG+dlxcW+rDIn6bLRVuGi9TJRsQ6eox9ipa+qEKNNiOXgftELcgT7m74nAS5m0uv
 n5rDa221nPXecEB0X7d6YUFk711lly90dbelNeLrmv1w6jl8L1PpS1oBaW+UzGu9
 46eGBd6pzu9otvK9WVyDEdotDOCrgH0sd7pTetqDhLJZ7KrGwyyqO2gD/JroUKcC
 lnxBQwPnat86iI8=
 =oxfV
 -----END PGP SIGNATURE-----

Merge tag 'lsm-pr-20250323' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm

Pull lsm updates from Paul Moore:

 - Various minor updates to the LSM Rust bindings

   Changes include marking trivial Rust bindings as inlines and comment
   tweaks to better reflect the LSM hooks.

 - Add LSM/SELinux access controls to io_uring_allowed()

   Similar to the io_uring_disabled sysctl, add a LSM hook to
   io_uring_allowed() to enable LSMs a simple way to enforce security
   policy on the use of io_uring. This pull request includes SELinux
   support for this new control using the io_uring/allowed permission.

 - Remove an unused parameter from the security_perf_event_open() hook

   The perf_event_attr struct parameter was not used by any currently
   supported LSMs, remove it from the hook.

 - Add an explicit MAINTAINERS entry for the credentials code

   We've seen problems in the past where patches to the credentials code
   sent by non-maintainers would often languish on the lists for
   multiple months as there was no one explicitly tasked with the
   responsibility of reviewing and/or merging credentials related code.

   Considering that most of the code under security/ has a vested
   interest in ensuring that the credentials code is well maintained,
   I'm volunteering to look after the credentials code and Serge Hallyn
   has also volunteered to step up as an official reviewer. I posted the
   MAINTAINERS update as a RFC to LKML in hopes that someone else would
   jump up with an "I'll do it!", but beyond Serge it was all crickets.

 - Update Stephen Smalley's old email address to prevent confusion

   This includes a corresponding update to the mailmap file.

* tag 'lsm-pr-20250323' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
  mailmap: map Stephen Smalley's old email addresses
  lsm: remove old email address for Stephen Smalley
  MAINTAINERS: add Serge Hallyn as a credentials reviewer
  MAINTAINERS: add an explicit credentials entry
  cred,rust: mark Credential methods inline
  lsm,rust: reword "destroy" -> "release" in SecurityCtx
  lsm,rust: mark SecurityCtx methods inline
  perf: Remove unnecessary parameter of security check
  lsm: fix a missing security_uring_allowed() prototype
  io_uring,lsm,selinux: add LSM hooks for io_uring_setup()
  io_uring: refactor io_uring_allowed()
2025-03-25 15:44:19 -07:00
Linus Torvalds
a50b4fe095 A treewide hrtimer timer cleanup
hrtimers are initialized with hrtimer_init() and a subsequent store to
   the callback pointer. This turned out to be suboptimal for the upcoming
   Rust integration and is obviously a silly implementation to begin with.
 
   This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
   with hrtimer_setup(T, cb);
 
   The conversion was done with Coccinelle and a few manual fixups.
 
   Once the conversion has completely landed in mainline, hrtimer_init()
   will be removed and the hrtimer::function becomes a private member.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmff5jQTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoVvRD/wKtuwmiA66NJFgXC0qVq82A6fO3bY8
 GBdbfysDJIbqGu5PTcULTbJ8qkqv3jeLUv6CcXvS4sZ7y/uJQl2lzf8yrD/0bbwc
 rLI6sHiPSZmK93kNVN4X5H7kvt7cE/DYC9nnEOgK3BY5FgKc4n9887d4aVBhL8Lv
 ODwVXvZ+xi351YCj7qRyPU24zt/p4tkkT1o2k4a0HBluqLI0D+V20fke9IERUL8r
 d1uWKlcn0TqYDesE8HXKIhbst3gx52rMJrXBJDHwFmG6v8Pj1fkTXCVpPo8QcBz8
 OTVkpomN9f/Tx4+GZwhZOF86LhLL3OhxD6pT7JhFCXdmSGv+Ez8uyk1YZysM/XpV
 Juy/1yAcBpDIDkmhMFGdAAn48Nn9Fotty0r4je60zSEp1d/4QMXcFme29qr2JTUE
 iWnQ/HD6DxUjVHqy7CYvvo26Xegg1C7qgyOVt4PYZwAM1VKF5P3kzYTb4SAdxtop
 Tpji1sfW9QV08jqMNo6XntD32DSP9S2HqjO9LwBw700jnx2jjJ35fcJs6iodMOUn
 gckIZLMn3L0OoglPdyA5O7SNTbKE7aFiRKdnT/cJtR3Fa39Qu27CwC5gfiyuie9I
 Q+LG8GLuYSBHXAR+PBK4GWlzJ7Dn8k3eqmbnLeKpRMsU6ZzcttgA64xhaviN2wN0
 iJbvLJeisXr3GA==
 =bYAX
 -----END PGP SIGNATURE-----

Merge tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer cleanups from Thomas Gleixner:
 "A treewide hrtimer timer cleanup

  hrtimers are initialized with hrtimer_init() and a subsequent store to
  the callback pointer. This turned out to be suboptimal for the
  upcoming Rust integration and is obviously a silly implementation to
  begin with.

  This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
  with hrtimer_setup(T, cb);

  The conversion was done with Coccinelle and a few manual fixups.

  Once the conversion has completely landed in mainline, hrtimer_init()
  will be removed and the hrtimer::function becomes a private member"

* tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
  wifi: rt2x00: Switch to use hrtimer_update_function()
  io_uring: Use helper function hrtimer_update_function()
  serial: xilinx_uartps: Use helper function hrtimer_update_function()
  ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup()
  RDMA: Switch to use hrtimer_setup()
  virtio: mem: Switch to use hrtimer_setup()
  drm/vmwgfx: Switch to use hrtimer_setup()
  drm/xe/oa: Switch to use hrtimer_setup()
  drm/vkms: Switch to use hrtimer_setup()
  drm/msm: Switch to use hrtimer_setup()
  drm/i915/request: Switch to use hrtimer_setup()
  drm/i915/uncore: Switch to use hrtimer_setup()
  drm/i915/pmu: Switch to use hrtimer_setup()
  drm/i915/perf: Switch to use hrtimer_setup()
  drm/i915/gvt: Switch to use hrtimer_setup()
  drm/i915/huc: Switch to use hrtimer_setup()
  drm/amdgpu: Switch to use hrtimer_setup()
  stm class: heartbeat: Switch to use hrtimer_setup()
  i2c: Switch to use hrtimer_setup()
  iio: Switch to use hrtimer_setup()
  ...
2025-03-25 10:54:15 -07:00
Kan Liang
bd2da08d93 perf: Clean up pmu specific data
The pmu specific data is saved in task_struct now. Remove it from event
context structure.

Remove swap_task_ctx() as well.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250314172700.438923-7-kan.liang@linux.intel.com
2025-03-17 11:23:38 +01:00
Kan Liang
d57e94f5b8 perf: Supply task information to sched_task()
To save/restore LBR call stack data in system-wide mode, the task_struct
information is required.

Extend the parameters of sched_task() to supply task_struct information.

When schedule in, the LBR call stack data for new task will be restored.
When schedule out, the LBR call stack data for old task will be saved.
Only need to pass the required task_struct information.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250314172700.438923-4-kan.liang@linux.intel.com
2025-03-17 11:23:37 +01:00
Kan Liang
506e64e710 perf: attach/detach PMU specific data
The LBR call stack data has to be saved/restored during context switch
to fix the shorter LBRs call stacks issue in the  system-wide mode.
Allocate PMU specific data and attach them to the corresponding
task_struct during LBR call stack monitoring.

When a LBR call stack event is accounted, the perf_ctx_data for the
related tasks will be allocated/attached by attach_perf_ctx_data().
When a LBR call stack event is unaccounted, the perf_ctx_data for
related tasks will be detached/freed by detach_perf_ctx_data().

The LBR call stack event could be a per-task event or a system-wide
event.
- For a per-task event, perf only allocates the perf_ctx_data for the
  current task. If the allocation fails, perf will error out.
- For a system-wide event, perf has to allocate the perf_ctx_data for
  both the existing tasks and the upcoming tasks.
  The allocation for the existing tasks is done in perf_event_alloc().
  If any allocation fails, perf will error out.
  The allocation for the new tasks will be done in perf_event_fork().
  A global reader/writer semaphore, global_ctx_data_rwsem, is added to
  address the global race.
- The perf_ctx_data only be freed by the last LBR call stack event.
  The number of the per-task events is tracked by refcount of each task.
  Since the system-wide events impact all tasks, it's not practical to
  go through the whole task list to update the refcount for each
  system-wide event. The number of system-wide events is tracked by a
  global variable global_ctx_data_ref.

Suggested-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250314172700.438923-3-kan.liang@linux.intel.com
2025-03-17 11:23:37 +01:00
Kan Liang
cb43691293 perf: Save PMU specific data in task_struct
Some PMU specific data has to be saved/restored during context switch,
e.g. LBR call stack data. Currently, the data is saved in event context
structure, but only for per-process event. For system-wide event,
because of missing the LBR call stack data after context switch, LBR
callstacks are always shorter in comparison to per-process mode.

For example,
  Per-process mode:
  $perf record --call-graph lbr -- taskset -c 0 ./tchain_edit

  -   99.90%    99.86%  tchain_edit  tchain_edit       [.] f3
       99.86% _start
          __libc_start_main
          generic_start_main
          main
          f1
        - f2
             f3

  System-wide mode:
  $perf record --call-graph lbr -a -- taskset -c 0 ./tchain_edit

  -   99.88%    99.82%  tchain_edit  tchain_edit        [.] f3
   - 62.02% main
        f1
        f2
        f3
   - 28.83% f1
      - f2
        f3
   - 28.83% f1
      - f2
           f3
   - 8.88% generic_start_main
        main
        f1
        f2
        f3

It isn't practical to simply allocate the data for system-wide event in
CPU context structure for all tasks. We have no idea which CPU a task
will be scheduled to. The duplicated LBR data has to be maintained on
every CPU context structure. That's a huge waste. Otherwise, the LBR
data still lost if the task is scheduled to another CPU.

Save the pmu specific data in task_struct. The size of pmu specific data
is 788 bytes for LBR call stack. Usually, the overall amount of threads
doesn't exceed a few thousands. For 10K threads, keeping LBR data would
consume additional ~8MB. The additional space will only be allocated
during LBR call stack monitoring. It will be released when the
monitoring is finished.

Furthermore, moving task_ctx_data from perf_event_context to task_struct
can reduce complexity and make things clearer. E.g. perf doesn't need to
swap task_ctx_data on optimized context switch path.
This patch set is just the first step. There could be other
optimization/extension on top of this patch set. E.g. for cgroup
profiling, perf just needs to save/store the LBR call stack information
for tasks in specific cgroup. That could reduce the additional space.
Also, the LBR call stack can be available for software events, or allow
even debugging use cases, like LBRs on crash later.

Because of the alignment requirement of Intel Arch LBR, the Kmem cache
is used to allocate the PMU specific data. It's required when child task
allocates the space. Save it in struct perf_ctx_data.
The refcount in struct perf_ctx_data is used to track the users of pmu
specific data.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Link: https://lore.kernel.org/r/20250314172700.438923-1-kan.liang@linux.intel.com
2025-03-17 11:23:36 +01:00
Namhyung Kim
f4b07fd62d perf/core: Use POLLHUP for pinned events in error
Pinned performance events can enter an error state when they fail to be
scheduled in the context due to a failed constraint or some other conflict
or condition.

In error state these events won't generate any samples anymore and are
silently ignored until they are recovered by PERF_EVENT_IOC_ENABLE,
or the condition can also change so that they can be scheduled in.

Tooling should be allowed to know about the state change, but
currently there's no mechanism to notify tooling when events enter
an error state.

One way to do this is to issue a POLLHUP event to poll(2) to handle this.
Reading events in an error state would return 0 (EOF) and it matches to
the behavior of POLLHUP according to the man page.

Tooling should remove the fd of the event from pollfd after getting
POLLHUP, otherwise it'll be returned repeatedly.

[ mingo: Clarified the changelog ]

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250317061745.1777584-1-namhyung@kernel.org
2025-03-17 08:31:03 +01:00
XieLudan
b6ecb57f1f perf/core: Use sysfs_emit() instead of scnprintf()
Follow the advice in Documentation/filesystems/sysfs.rst:

  "- show() should only use sysfs_emit() or sysfs_emit_at() when formatting
     the value to be returned to user space."

No change in functionality intended.

[ mingo: Updated the changelog ]

Signed-off-by: XieLudan <xie.ludan@zte.com.cn>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250315141738452lXIH39UJAXlCmcATCzcBv@zte.com.cn
2025-03-16 12:38:27 +01:00
Thorsten Blum
fd3f5d385a perf/core: Remove optional 'size' arguments from strscpy() calls
The 'size' parameter is optional and strscpy() automatically determines
the length of the destination buffer using sizeof() if the argument is
omitted. This makes the explicit sizeof() calls unnecessary.

Furthermore, KSYM_NAME_LEN is equal to sizeof(name) and can also be
removed. Remove them to shorten and simplify the code.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250310192336.442994-1-thorsten.blum@linux.dev
2025-03-10 20:50:30 +01:00
Peter Zijlstra
da02f54e81 perf/core: Clean up perf_try_init_event()
Make sure that perf_try_init_event() doesn't leave event->pmu nor
event->destroy set on failure.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20250205102449.110145835@infradead.org
2025-03-05 12:13:59 +01:00
Peter Zijlstra
66477c7230 perf/core: Fix perf_mmap() failure path
When f_ops->mmap() returns failure, m_ops->close() is *not* called.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20241104135519.248358497@infradead.org
2025-03-04 09:43:26 +01:00
Peter Zijlstra
4eabf533fb perf/core: Detach 'struct perf_cpu_pmu_context' and 'struct pmu' lifetimes
In prepration for being able to unregister a PMU with existing events,
it becomes important to detach struct perf_cpu_pmu_context lifetimes
from that of struct pmu.

Notably struct perf_cpu_pmu_context embeds a struct perf_event_pmu_context
that can stay referenced until the last event goes.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20241104135518.760214287@infradead.org
2025-03-04 09:43:22 +01:00
Peter Zijlstra
0983593f32 perf/core: Lift event->mmap_mutex in perf_mmap()
This puts 'all' of perf_mmap() under single event->mmap_mutex.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20241104135519.582252957@infradead.org
2025-03-04 09:43:19 +01:00
Peter Zijlstra
8eaec7bb72 perf/core: Remove retry loop from perf_mmap()
AFAICT there is no actual benefit from the mutex drop on re-try. The
'worst' case scenario is that we instantly re-gain the mutex without
perf_mmap_close() getting it. So might as well make that the normal
case.

Reflow the code to make the ring buffer detach case naturally flow
into the no ring buffer case.

[ mingo: Forward ported it ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20241104135519.463607258@infradead.org
2025-03-04 09:43:15 +01:00
Peter Zijlstra
0c8a4e4139 perf/core: Further simplify perf_mmap()
Perform CSE and such.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20241104135519.354909594@infradead.org
2025-03-04 09:43:10 +01:00
Peter Zijlstra
954878377b perf/core: Simplify the perf_mmap() control flow
Identity-transform:

	if (c) {
		X1;
	} else {
		Y;
		goto l;
	}

	X2;
  l:

into the simpler:

	if (c) {
		X1;
		X2;
	} else {
		Y;
	}

[ mingo: Forward ported it ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20241104135519.095904637@infradead.org
2025-03-04 09:43:05 +01:00
Peter Zijlstra
c5b9678957 perf/bpf: Robustify perf_event_free_bpf_prog()
Ensure perf_event_free_bpf_prog() is safe to call a second time;
notably without making any references to event->pmu when there is no
prog left.

Note: perf_event_detach_bpf_prog() might leave a stale event->prog

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20241104135518.978956692@infradead.org
2025-03-04 09:42:59 +01:00
Peter Zijlstra
adc38b4ca1 perf/core: Introduce perf_free_addr_filters()
Replace _free_event()'s use of perf_addr_filters_splice()s use with an
explicit perf_free_addr_filters() with the explicit propery that it is
able to be called a second time without ill effect.

Most notable, referencing event->pmu must be avoided when there are no
filters left (from eg a previous call).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20241104135518.868460518@infradead.org
2025-03-04 09:42:55 +01:00
Peter Zijlstra
b2996f5655 perf/core: Add this_cpc() helper
As a preparation for adding yet another indirection.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20241104135518.650051565@infradead.org
2025-03-04 09:42:51 +01:00
Peter Zijlstra
4baeb0687a perf/core: Merge struct pmu::pmu_disable_count into struct perf_cpu_pmu_context::pmu_disable_count
Because it makes no sense to have two per-cpu allocations per pmu.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20241104135518.518730578@infradead.org
2025-03-04 09:42:47 +01:00
Peter Zijlstra
8f2221f52e perf/core: Simplify perf_event_alloc()
Using the previous simplifications, transition perf_event_alloc() to
the cleanup way of things -- reducing error path magic.

[ mingo: Ported it to recent kernels. ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20241104135518.410755241@infradead.org
2025-03-04 09:42:40 +01:00
Peter Zijlstra
caf8b765d4 perf/core: Simplify perf_init_event()
Use the <linux/cleanup.h> guard() and scoped_guard() infrastructure
to simplify the control flow.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20241104135518.302444446@infradead.org
2025-03-04 09:42:32 +01:00
Peter Zijlstra
6c8b0b835f perf/core: Simplify perf_pmu_register()
Using the previously introduced perf_pmu_free() and a new IDR helper,
simplify the perf_pmu_register error paths.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20241104135518.198937277@infradead.org
2025-03-04 09:42:29 +01:00
Peter Zijlstra
8f4c4963d2 perf/core: Simplify the perf_pmu_register() error path
The error path of perf_pmu_register() is of course very similar to a
subset of perf_pmu_unregister(). Extract this common part in
perf_pmu_free() and simplify things.

[ mingo: Forward ported it ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20241104135518.090915501@infradead.org
2025-03-04 09:42:26 +01:00
Peter Zijlstra
c70ca29803 perf/core: Simplify the perf_event_alloc() error path
The error cleanup sequence in perf_event_alloc() is a subset of the
existing _free_event() function (it must of course be).

Split this out into __free_event() and simplify the error path.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20241104135517.967889521@infradead.org
2025-03-04 09:42:14 +01:00
Ingo Molnar
ef2f798600 Merge branch 'perf/urgent' into perf/core, to pick up dependent patches and fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-03-01 19:41:14 +01:00
Peter Zijlstra
003659fec9 perf/core: Fix perf_pmu_register() vs. perf_init_event()
There is a fairly obvious race between perf_init_event() doing
idr_find() and perf_pmu_register() doing idr_alloc() with an
incompletely initialized PMU pointer.

Avoid by doing idr_alloc() on a NULL pointer to register the id, and
swizzling the real struct pmu pointer at the end using idr_replace().

Also making sure to not set struct pmu members after publishing
the struct pmu, duh.

[ introduce idr_cmpxchg() in order to better handle the idr_replace()
  error case -- if it were to return an unexpected pointer, it will
  already have replaced the value and there is no going back. ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20241104135517.858805880@infradead.org
2025-03-01 19:38:42 +01:00