Commit graph

39297 commits

Author SHA1 Message Date
Uros Bizjak
c02d5546ea sched/core: Use try_cmpxchg in set_nr_{and_not,if}_polling
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) != old in
set_nr_{and_not,if}_polling. x86 cmpxchg returns success in ZF flag,
so this change saves a compare after cmpxchg.

The definition of cmpxchg based fetch_or was changed in the
same way as atomic_fetch_##op definitions were changed
in e6790e4b5d.

Also declare these two functions as inline to ensure inlining. In the
case of set_nr_and_not_polling, the compiler (gcc) tries to outsmart
itself by constructing the boolean return value with logic operations
on the fetched value, and these extra operations enlarge the function
over the inlining threshold value.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220629151552.6015-1-ubizjak@gmail.com
2022-07-04 09:23:08 +02:00
Josh Don
1fcf54deb7 sched/core: add forced idle accounting for cgroups
4feee7d126 previously added per-task forced idle accounting. This patch
extends this to also include cgroups.

rstat is used for cgroup accounting, except for the root, which uses
kcpustat in order to bypass the need for doing an rstat flush when
reading root stats.

Only cgroup v2 is supported. Similar to the task accounting, the cgroup
accounting requires that schedstats is enabled.

Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lkml.kernel.org/r/20220629211426.3329954-1-joshdon@google.com
2022-07-04 09:23:07 +02:00
Vincent Donnefort
b812fc9768 sched/fair: Remove the energy margin in feec()
find_energy_efficient_cpu() integrates a margin to protect tasks from
bouncing back and forth from a CPU to another. This margin is set as being
6% of the total current energy estimated on the system. This however does
not work for two reasons:

1. The energy estimation is not a good absolute value:

compute_energy() used in feec() is a good estimation for task placement as
it allows to compare the energy with and without a task. The computed
delta will give a good overview of the cost for a certain task placement.
It, however, doesn't work as an absolute estimation for the total energy
of the system. First it adds the contribution to idle CPUs into the
energy, second it mixes util_avg with util_est values. util_avg contains
the near history for a CPU usage, it doesn't tell at all what the current
utilization is. A system that has been quite busy in the near past will
hold a very high energy and then a high margin preventing any task
migration to a lower capacity CPU, wasting energy. It even creates a
negative feedback loop: by holding the tasks on a less efficient CPU, the
margin contributes in keeping the energy high.

2. The margin handicaps small tasks:

On a system where the workload is composed mostly of small tasks (which is
often the case on Android), the overall energy will be high enough to
create a margin none of those tasks can cross. On a Pixel4, a small
utilization of 5% on all the CPUs creates a global estimated energy of 140
joules, as per the Energy Model declaration of that same device. This
means, after applying the 6% margin that any migration must save more than
8 joules to happen. No task with a utilization lower than 40 would then be
able to migrate away from the biggest CPU of the system.

The 6% of the overall system energy was brought by the following patch:

 (eb92692b25 sched/fair: Speed-up energy-aware wake-ups)

It was previously 6% of the prev_cpu energy. Also, the following one
made this margin value conditional on the clusters where the task fits:

 (8d4c97c105 sched/fair: Only compute base_energy_pd if necessary)

We could simply revert that margin change to what it was, but the original
version didn't have strong grounds neither and as demonstrated in (1.) the
estimated energy isn't a good absolute value. Instead, removing it
completely. It is indeed, made possible by recent changes that improved
energy estimation comparison fairness (sched/fair: Remove task_util from
effective utilization in feec()) (PM: EM: Increase energy calculation
precision) and task utilization stabilization (sched/fair: Decay task
util_avg during migration)

Without a margin, we could have feared bouncing between CPUs. But running
LISA's eas_behaviour test coverage on three different platforms (Hikey960,
RB-5 and DB-845) showed no issue.

Removing the energy margin enables more energy-optimized placements for a
more energy efficient system.

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lkml.kernel.org/r/20220621090414.433602-8-vdonnefort@google.com
2022-06-28 09:17:48 +02:00
Vincent Donnefort
3e8c6c9aac sched/fair: Remove task_util from effective utilization in feec()
The energy estimation in find_energy_efficient_cpu() (feec()) relies on
the computation of the effective utilization for each CPU of a perf domain
(PD). This effective utilization is then used as an estimation of the busy
time for this pd. The function effective_cpu_util() which gives this value,
scales the utilization relative to IRQ pressure on the CPU to take into
account that the IRQ time is hidden from the task clock. The IRQ scaling is
as follow:

   effective_cpu_util = irq + (cpu_cap - irq)/cpu_cap * util

Where util is the sum of CFS/RT/DL utilization, cpu_cap the capacity of
the CPU and irq the IRQ avg time.

If now we take as an example a task placement which doesn't raise the OPP
on the candidate CPU, we can write the energy delta as:

  delta = OPPcost/cpu_cap * (effective_cpu_util(cpu_util + task_util) -
                             effective_cpu_util(cpu_util))
        = OPPcost/cpu_cap * (cpu_cap - irq)/cpu_cap * task_util

We end-up with an energy delta depending on the IRQ avg time, which is a
problem: first the time spent on IRQs by a CPU has no effect on the
additional energy that would be consumed by a task. Second, we don't want
to favour a CPU with a higher IRQ avg time value.

Nonetheless, we need to take the IRQ avg time into account. If a task
placement raises the PD's frequency, it will increase the energy cost for
the entire time where the CPU is busy. A solution is to only use
effective_cpu_util() with the CPU contribution part. The task contribution
is added separately and scaled according to prev_cpu's IRQ time.

No change for the FREQUENCY_UTIL component of the energy estimation. We
still want to get the actual frequency that would be selected after the
task placement.

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lkml.kernel.org/r/20220621090414.433602-7-vdonnefort@google.com
2022-06-28 09:17:47 +02:00
Dietmar Eggemann
9b340131a4 sched/fair: Use the same cpumask per-PD throughout find_energy_efficient_cpu()
The Perf Domain (PD) cpumask (struct em_perf_domain.cpus) stays
invariant after Energy Model creation, i.e. it is not updated after
CPU hotplug operations.

That's why the PD mask is used in conjunction with the cpu_online_mask
(or Sched Domain cpumask). Thereby the cpu_online_mask is fetched
multiple times (in compute_energy()) during a run-queue selection
for a task.

cpu_online_mask may change during this time which can lead to wrong
energy calculations.

To be able to avoid this, use the select_rq_mask per-cpu cpumask to
create a cpumask out of PD cpumask and cpu_online_mask and pass it
through the function calls of the EAS run-queue selection path.

The PD cpumask for max_spare_cap_cpu/compute_prev_delta selection
(find_energy_efficient_cpu()) is now ANDed not only with the SD mask
but also with the cpu_online_mask. This is fine since this cpumask
has to be in syc with the one used for energy computation
(compute_energy()).
An exclusive cpuset setup with at least one asymmetric CPU capacity
island (hence the additional AND with the SD cpumask) is the obvious
exception here.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lkml.kernel.org/r/20220621090414.433602-6-vdonnefort@google.com
2022-06-28 09:17:47 +02:00
Dietmar Eggemann
ec4fc801a0 sched/fair: Rename select_idle_mask to select_rq_mask
On 21/06/2022 11:04, Vincent Donnefort wrote:
> From: Dietmar Eggemann <dietmar.eggemann@arm.com>

https://lkml.kernel.org/r/202206221253.ZVyGQvPX-lkp@intel.com discovered
that this patch doesn't build anymore (on tip sched/core or linux-next)
because of commit f5b2eeb499 ("sched/fair: Consider CPU affinity when
allowing NUMA imbalance in find_idlest_group()").

New version of [PATCH v11 4/7] sched/fair: Rename select_idle_mask to
select_rq_mask below.

-- >8 --

Decouple the name of the per-cpu cpumask select_idle_mask from its usage
in select_idle_[cpu/capacity]() of the CFS run-queue selection
(select_task_rq_fair()).

This is to support the reuse of this cpumask in the Energy Aware
Scheduling (EAS) path (find_energy_efficient_cpu()) of the CFS run-queue
selection.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lkml.kernel.org/r/250691c7-0e2b-05ab-bedf-b245c11d9400@arm.com
2022-06-28 09:17:47 +02:00
Dietmar Eggemann
bb44799949 sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util()
effective_cpu_util() already has a `int cpu' parameter which allows to
retrieve the CPU capacity scale factor (or maximum CPU capacity) inside
this function via an arch_scale_cpu_capacity(cpu).

A lot of code calling effective_cpu_util() (or the shim
sched_cpu_util()) needs the maximum CPU capacity, i.e. it will call
arch_scale_cpu_capacity() already.
But not having to pass it into effective_cpu_util() will make the EAS
wake-up code easier, especially when the maximum CPU capacity reduced
by the thermal pressure is passed through the EAS wake-up functions.

Due to the asymmetric CPU capacity support of arm/arm64 architectures,
arch_scale_cpu_capacity(int cpu) is a per-CPU variable read access via
per_cpu(cpu_scale, cpu) on such a system.
On all other architectures it is a a compile-time constant
(SCHED_CAPACITY_SCALE).

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lkml.kernel.org/r/20220621090414.433602-4-vdonnefort@google.com
2022-06-28 09:17:46 +02:00
Vincent Donnefort
e2f3e35f1f sched/fair: Decay task PELT values during wakeup migration
Before being migrated to a new CPU, a task sees its PELT values
synchronized with rq last_update_time. Once done, that same task will also
have its sched_avg last_update_time reset. This means the time between
the migration and the last clock update will not be accounted for in
util_avg and a discontinuity will appear. This issue is amplified by the
PELT clock scaling. It takes currently one tick after the CPU being idle
to let clock_pelt catching up clock_task.

This is especially problematic for asymmetric CPU capacity systems which
need stable util_avg signals for task placement and energy estimation.

Ideally, this problem would be solved by updating the runqueue clocks
before the migration. But that would require taking the runqueue lock
which is quite expensive [1]. Instead estimate the missing time and update
the task util_avg with that value.

To that end, we need sched_clock_cpu() but it is a costly function. Limit
the usage to the case where the source CPU is idle as we know this is when
the clock is having the biggest risk of being outdated.

See comment in migrate_se_pelt_lag() for more details about how the PELT
value is estimated. Notice though this estimation doesn't take into account
IRQ and Paravirt time.

[1] https://lkml.kernel.org/r/20190709115759.10451-1-chris.redpath@arm.com

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lkml.kernel.org/r/20220621090414.433602-3-vdonnefort@google.com
2022-06-28 09:17:46 +02:00
Vincent Donnefort
d05b43059d sched/fair: Provide u64 read for 32-bits arch helper
Introducing macro helpers u64_u32_{store,load}() to factorize lockless
accesses to u64 variables for 32-bits architectures.

Users are for now cfs_rq.min_vruntime and sched_avg.last_update_time. To
accommodate the later where the copy lies outside of the structure
(cfs_rq.last_udpate_time_copy instead of sched_avg.last_update_time_copy),
use the _copy() version of those helpers.

Those new helpers encapsulate smp_rmb() and smp_wmb() synchronization and
therefore, have a small penalty for 32-bits machines in set_task_rq_fair()
and init_cfs_rq().

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lkml.kernel.org/r/20220621090414.433602-2-vdonnefort@google.com
2022-06-28 09:17:46 +02:00
Chen Yu
70fb5ccf2e sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
[Problem Statement]
select_idle_cpu() might spend too much time searching for an idle CPU,
when the system is overloaded.

The following histogram is the time spent in select_idle_cpu(),
when running 224 instances of netperf on a system with 112 CPUs
per LLC domain:

@usecs:
[0]                  533 |                                                    |
[1]                 5495 |                                                    |
[2, 4)             12008 |                                                    |
[4, 8)            239252 |                                                    |
[8, 16)          4041924 |@@@@@@@@@@@@@@                                      |
[16, 32)        12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
[32, 64)        14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64, 128)       13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
[128, 256)       8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
[256, 512)       4507667 |@@@@@@@@@@@@@@@                                     |
[512, 1K)        2600472 |@@@@@@@@@                                           |
[1K, 2K)          927912 |@@@                                                 |
[2K, 4K)          218720 |                                                    |
[4K, 8K)           98161 |                                                    |
[8K, 16K)          37722 |                                                    |
[16K, 32K)          6715 |                                                    |
[32K, 64K)           477 |                                                    |
[64K, 128K)            7 |                                                    |

netperf latency usecs:
=======
case            	load    	    Lat_99th	    std%
TCP_RR          	thread-224	      257.39	(  0.21)

The time spent in select_idle_cpu() is visible to netperf and might have a negative
impact.

[Symptom analysis]
The patch [1] from Mel Gorman has been applied to track the efficiency
of select_idle_sibling. Copy the indicators here:

SIS Search Efficiency(se_eff%):
        A ratio expressed as a percentage of runqueues scanned versus
        idle CPUs found. A 100% efficiency indicates that the target,
        prev or recent CPU of a task was idle at wakeup. The lower the
        efficiency, the more runqueues were scanned before an idle CPU
        was found.

SIS Domain Search Efficiency(dom_eff%):
        Similar, except only for the slower SIS
	patch.

SIS Fast Success Rate(fast_rate%):
        Percentage of SIS that used target, prev or
	recent CPUs.

SIS Success rate(success_rate%):
        Percentage of scans that found an idle CPU.

The test is based on Aubrey's schedtests tool, including netperf, hackbench,
schbench and tbench.

Test on vanilla kernel:
schedstat_parse.py -f netperf_vanilla.log
case	        load	    se_eff%	    dom_eff%	  fast_rate%	success_rate%
TCP_RR	   28 threads	     99.978	      18.535	      99.995	     100.000
TCP_RR	   56 threads	     99.397	       5.671	      99.964	     100.000
TCP_RR	   84 threads	     21.721	       6.818	      73.632	     100.000
TCP_RR	  112 threads	     12.500	       5.533	      59.000	     100.000
TCP_RR	  140 threads	      8.524	       4.535	      49.020	     100.000
TCP_RR	  168 threads	      6.438	       3.945	      40.309	      99.999
TCP_RR	  196 threads	      5.397	       3.718	      32.320	      99.982
TCP_RR	  224 threads	      4.874	       3.661	      25.775	      99.767
UDP_RR	   28 threads	     99.988	      17.704	      99.997	     100.000
UDP_RR	   56 threads	     99.528	       5.977	      99.970	     100.000
UDP_RR	   84 threads	     24.219	       6.992	      76.479	     100.000
UDP_RR	  112 threads	     13.907	       5.706	      62.538	     100.000
UDP_RR	  140 threads	      9.408	       4.699	      52.519	     100.000
UDP_RR	  168 threads	      7.095	       4.077	      44.352	     100.000
UDP_RR	  196 threads	      5.757	       3.775	      35.764	      99.991
UDP_RR	  224 threads	      5.124	       3.704	      28.748	      99.860

schedstat_parse.py -f schbench_vanilla.log
(each group has 28 tasks)
case	        load	    se_eff%	    dom_eff%	  fast_rate%	success_rate%
normal	   1   mthread	     99.152	       6.400	      99.941	     100.000
normal	   2   mthreads	     97.844	       4.003	      99.908	     100.000
normal	   3   mthreads	     96.395	       2.118	      99.917	      99.998
normal	   4   mthreads	     55.288	       1.451	      98.615	      99.804
normal	   5   mthreads	      7.004	       1.870	      45.597	      61.036
normal	   6   mthreads	      3.354	       1.346	      20.777	      34.230
normal	   7   mthreads	      2.183	       1.028	      11.257	      21.055
normal	   8   mthreads	      1.653	       0.825	       7.849	      15.549

schedstat_parse.py -f hackbench_vanilla.log
(each group has 28 tasks)
case			load	        se_eff%	    dom_eff%	  fast_rate%	success_rate%
process-pipe	     1 group	         99.991	       7.692	      99.999	     100.000
process-pipe	    2 groups	         99.934	       4.615	      99.997	     100.000
process-pipe	    3 groups	         99.597	       3.198	      99.987	     100.000
process-pipe	    4 groups	         98.378	       2.464	      99.958	     100.000
process-pipe	    5 groups	         27.474	       3.653	      89.811	      99.800
process-pipe	    6 groups	         20.201	       4.098	      82.763	      99.570
process-pipe	    7 groups	         16.423	       4.156	      77.398	      99.316
process-pipe	    8 groups	         13.165	       3.920	      72.232	      98.828
process-sockets	     1 group	         99.977	       5.882	      99.999	     100.000
process-sockets	    2 groups	         99.927	       5.505	      99.996	     100.000
process-sockets	    3 groups	         99.397	       3.250	      99.980	     100.000
process-sockets	    4 groups	         79.680	       4.258	      98.864	      99.998
process-sockets	    5 groups	          7.673	       2.503	      63.659	      92.115
process-sockets	    6 groups	          4.642	       1.584	      58.946	      88.048
process-sockets	    7 groups	          3.493	       1.379	      49.816	      81.164
process-sockets	    8 groups	          3.015	       1.407	      40.845	      75.500
threads-pipe	     1 group	         99.997	       0.000	     100.000	     100.000
threads-pipe	    2 groups	         99.894	       2.932	      99.997	     100.000
threads-pipe	    3 groups	         99.611	       4.117	      99.983	     100.000
threads-pipe	    4 groups	         97.703	       2.624	      99.937	     100.000
threads-pipe	    5 groups	         22.919	       3.623	      87.150	      99.764
threads-pipe	    6 groups	         18.016	       4.038	      80.491	      99.557
threads-pipe	    7 groups	         14.663	       3.991	      75.239	      99.247
threads-pipe	    8 groups	         12.242	       3.808	      70.651	      98.644
threads-sockets	     1 group	         99.990	       6.667	      99.999	     100.000
threads-sockets	    2 groups	         99.940	       5.114	      99.997	     100.000
threads-sockets	    3 groups	         99.469	       4.115	      99.977	     100.000
threads-sockets	    4 groups	         87.528	       4.038	      99.400	     100.000
threads-sockets	    5 groups	          6.942	       2.398	      59.244	      88.337
threads-sockets	    6 groups	          4.359	       1.954	      49.448	      87.860
threads-sockets	    7 groups	          2.845	       1.345	      41.198	      77.102
threads-sockets	    8 groups	          2.871	       1.404	      38.512	      74.312

schedstat_parse.py -f tbench_vanilla.log
case			load	      se_eff%	    dom_eff%	  fast_rate%	success_rate%
loopback	  28 threads	       99.976	      18.369	      99.995	     100.000
loopback	  56 threads	       99.222	       7.799	      99.934	     100.000
loopback	  84 threads	       19.723	       6.819	      70.215	     100.000
loopback	 112 threads	       11.283	       5.371	      55.371	      99.999
loopback	 140 threads	        0.000	       0.000	       0.000	       0.000
loopback	 168 threads	        0.000	       0.000	       0.000	       0.000
loopback	 196 threads	        0.000	       0.000	       0.000	       0.000
loopback	 224 threads	        0.000	       0.000	       0.000	       0.000

According to the test above, if the system becomes busy, the
SIS Search Efficiency(se_eff%) drops significantly. Although some
benchmarks would finally find an idle CPU(success_rate% = 100%), it is
doubtful whether it is worth it to search the whole LLC domain.

[Proposal]
It would be ideal to have a crystal ball to answer this question:
How many CPUs must a wakeup path walk down, before it can find an idle
CPU? Many potential metrics could be used to predict the number.
One candidate is the sum of util_avg in this LLC domain. The benefit
of choosing util_avg is that it is a metric of accumulated historic
activity, which seems to be smoother than instantaneous metrics
(such as rq->nr_running). Besides, choosing the sum of util_avg
would help predict the load of the LLC domain more precisely, because
SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle
time.

In summary, the lower the util_avg is, the more select_idle_cpu()
should scan for idle CPU, and vice versa. When the sum of util_avg
in this LLC domain hits 85% or above, the scan stops. The reason to
choose 85% as the threshold is that this is the imbalance_pct(117)
when a LLC sched group is overloaded.

Introduce the quadratic function:

y = SCHED_CAPACITY_SCALE - p * x^2
and y'= y / SCHED_CAPACITY_SCALE

x is the ratio of sum_util compared to the CPU capacity:
x = sum_util / (llc_weight * SCHED_CAPACITY_SCALE)
y' is the ratio of CPUs to be scanned in the LLC domain,
and the number of CPUs to scan is calculated by:

nr_scan = llc_weight * y'

Choosing quadratic function is because:
[1] Compared to the linear function, it scans more aggressively when the
    sum_util is low.
[2] Compared to the exponential function, it is easier to calculate.
[3] It seems that there is no accurate mapping between the sum of util_avg
    and the number of CPUs to be scanned. Use heuristic scan for now.

For a platform with 112 CPUs per LLC, the number of CPUs to scan is:
sum_util%   0    5   15   25  35  45  55   65   75   85   86 ...
scan_nr   112  111  108  102  93  81  65   47   25    1    0 ...

For a platform with 16 CPUs per LLC, the number of CPUs to scan is:
sum_util%   0    5   15   25  35  45  55   65   75   85   86 ...
scan_nr    16   15   15   14  13  11   9    6    3    0    0 ...

Furthermore, to minimize the overhead of calculating the metrics in
select_idle_cpu(), borrow the statistics from periodic load balance.
As mentioned by Abel, on a platform with 112 CPUs per LLC, the
sum_util calculated by periodic load balance after 112 ms would
decay to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay
in reflecting the latest utilization. But it is a trade-off.
Checking the util_avg in newidle load balance would be more frequent,
but it brings overhead - multiple CPUs write/read the per-LLC shared
variable and introduces cache contention. Tim also mentioned that,
it is allowed to be non-optimal in terms of scheduling for the
short-term variations, but if there is a long-term trend in the load
behavior, the scheduler can adjust for that.

When SIS_UTIL is enabled, the select_idle_cpu() uses the nr_scan
calculated by SIS_UTIL instead of the one from SIS_PROP. As Peter and
Mel suggested, SIS_UTIL should be enabled by default.

This patch is based on the util_avg, which is very sensitive to the
CPU frequency invariance. There is an issue that, when the max frequency
has been clamp, the util_avg would decay insanely fast when
the CPU is idle. Commit addca28512 ("cpufreq: intel_pstate: Handle no_turbo
in frequency invariance") could be used to mitigate this symptom, by adjusting
the arch_max_freq_ratio when turbo is disabled. But this issue is still
not thoroughly fixed, because the current code is unaware of the user-specified
max CPU frequency.

[Test result]

netperf and tbench were launched with 25% 50% 75% 100% 125% 150%
175% 200% of CPU number respectively. Hackbench and schbench were launched
by 1, 2 ,4, 8 groups. Each test lasts for 100 seconds and repeats 3 times.

The following is the benchmark result comparison between
baseline:vanilla v5.19-rc1 and compare:patched kernel. Positive compare%
indicates better performance.

Each netperf test is a:
netperf -4 -H 127.0.1 -t TCP/UDP_RR -c -C -l 100
netperf.throughput
=======
case            	load    	baseline(std%)	compare%( std%)
TCP_RR          	28 threads	 1.00 (  0.34)	 -0.16 (  0.40)
TCP_RR          	56 threads	 1.00 (  0.19)	 -0.02 (  0.20)
TCP_RR          	84 threads	 1.00 (  0.39)	 -0.47 (  0.40)
TCP_RR          	112 threads	 1.00 (  0.21)	 -0.66 (  0.22)
TCP_RR          	140 threads	 1.00 (  0.19)	 -0.69 (  0.19)
TCP_RR          	168 threads	 1.00 (  0.18)	 -0.48 (  0.18)
TCP_RR          	196 threads	 1.00 (  0.16)	+194.70 ( 16.43)
TCP_RR          	224 threads	 1.00 (  0.16)	+197.30 (  7.85)
UDP_RR          	28 threads	 1.00 (  0.37)	 +0.35 (  0.33)
UDP_RR          	56 threads	 1.00 ( 11.18)	 -0.32 (  0.21)
UDP_RR          	84 threads	 1.00 (  1.46)	 -0.98 (  0.32)
UDP_RR          	112 threads	 1.00 ( 28.85)	 -2.48 ( 19.61)
UDP_RR          	140 threads	 1.00 (  0.70)	 -0.71 ( 14.04)
UDP_RR          	168 threads	 1.00 ( 14.33)	 -0.26 ( 11.16)
UDP_RR          	196 threads	 1.00 ( 12.92)	+186.92 ( 20.93)
UDP_RR          	224 threads	 1.00 ( 11.74)	+196.79 ( 18.62)

Take the 224 threads as an example, the SIS search metrics changes are
illustrated below:

    vanilla                    patched
   4544492          +237.5%   15338634        sched_debug.cpu.sis_domain_search.avg
     38539        +39686.8%   15333634        sched_debug.cpu.sis_failed.avg
  128300000          -87.9%   15551326        sched_debug.cpu.sis_scanned.avg
   5842896          +162.7%   15347978        sched_debug.cpu.sis_search.avg

There is -87.9% less CPU scans after patched, which indicates lower overhead.
Besides, with this patch applied, there is -13% less rq lock contention
in perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested
.try_to_wake_up.default_wake_function.woken_wake_function.
This might help explain the performance improvement - Because this patch allows
the waking task to remain on the previous CPU, rather than grabbing other CPUs'
lock.

Each hackbench test is a:
hackbench -g $job --process/threads --pipe/sockets -l 1000000 -s 100
hackbench.throughput
=========
case            	load    	baseline(std%)	compare%( std%)
process-pipe    	1 group 	 1.00 (  1.29)	 +0.57 (  0.47)
process-pipe    	2 groups 	 1.00 (  0.27)	 +0.77 (  0.81)
process-pipe    	4 groups 	 1.00 (  0.26)	 +1.17 (  0.02)
process-pipe    	8 groups 	 1.00 (  0.15)	 -4.79 (  0.02)
process-sockets 	1 group 	 1.00 (  0.63)	 -0.92 (  0.13)
process-sockets 	2 groups 	 1.00 (  0.03)	 -0.83 (  0.14)
process-sockets 	4 groups 	 1.00 (  0.40)	 +5.20 (  0.26)
process-sockets 	8 groups 	 1.00 (  0.04)	 +3.52 (  0.03)
threads-pipe    	1 group 	 1.00 (  1.28)	 +0.07 (  0.14)
threads-pipe    	2 groups 	 1.00 (  0.22)	 -0.49 (  0.74)
threads-pipe    	4 groups 	 1.00 (  0.05)	 +1.88 (  0.13)
threads-pipe    	8 groups 	 1.00 (  0.09)	 -4.90 (  0.06)
threads-sockets 	1 group 	 1.00 (  0.25)	 -0.70 (  0.53)
threads-sockets 	2 groups 	 1.00 (  0.10)	 -0.63 (  0.26)
threads-sockets 	4 groups 	 1.00 (  0.19)	+11.92 (  0.24)
threads-sockets 	8 groups 	 1.00 (  0.08)	 +4.31 (  0.11)

Each tbench test is a:
tbench -t 100 $job 127.0.0.1
tbench.throughput
======
case            	load    	baseline(std%)	compare%( std%)
loopback        	28 threads	 1.00 (  0.06)	 -0.14 (  0.09)
loopback        	56 threads	 1.00 (  0.03)	 -0.04 (  0.17)
loopback        	84 threads	 1.00 (  0.05)	 +0.36 (  0.13)
loopback        	112 threads	 1.00 (  0.03)	 +0.51 (  0.03)
loopback        	140 threads	 1.00 (  0.02)	 -1.67 (  0.19)
loopback        	168 threads	 1.00 (  0.38)	 +1.27 (  0.27)
loopback        	196 threads	 1.00 (  0.11)	 +1.34 (  0.17)
loopback        	224 threads	 1.00 (  0.11)	 +1.67 (  0.22)

Each schbench test is a:
schbench -m $job -t 28 -r 100 -s 30000 -c 30000
schbench.latency_90%_us
========
case            	load    	baseline(std%)	compare%( std%)
normal          	1 mthread	 1.00 ( 31.22)	 -7.36 ( 20.25)*
normal          	2 mthreads	 1.00 (  2.45)	 -0.48 (  1.79)
normal          	4 mthreads	 1.00 (  1.69)	 +0.45 (  0.64)
normal          	8 mthreads	 1.00 (  5.47)	 +9.81 ( 14.28)

*Consider the Standard Deviation, this -7.36% regression might not be valid.

Also, a OLTP workload with a commercial RDBMS has been tested, and there
is no significant change.

There were concerns that unbalanced tasks among CPUs would cause problems.
For example, suppose the LLC domain is composed of 8 CPUs, and 7 tasks are
bound to CPU0~CPU6, while CPU7 is idle:

          CPU0    CPU1    CPU2    CPU3    CPU4    CPU5    CPU6    CPU7
util_avg  1024    1024    1024    1024    1024    1024    1024    0

Since the util_avg ratio is 87.5%( = 7/8 ), which is higher than 85%,
select_idle_cpu() will not scan, thus CPU7 is undetected during scan.
But according to Mel, it is unlikely the CPU7 will be idle all the time
because CPU7 could pull some tasks via CPU_NEWLY_IDLE.

lkp(kernel test robot) has reported a regression on stress-ng.sock on a
very busy system. According to the sched_debug statistics, it might be caused
by SIS_UTIL terminates the scan and chooses a previous CPU earlier, and this
might introduce more context switch, especially involuntary preemption, which
impacts a busy stress-ng. This regression has shown that, not all benchmarks
in every scenario benefit from idle CPU scan limit, and it needs further
investigation.

Besides, there is slight regression in hackbench's 16 groups case when the
LLC domain has 16 CPUs. Prateek mentioned that we should scan aggressively
in an LLC domain with 16 CPUs. Because the cost to search for an idle one
among 16 CPUs is negligible. The current patch aims to propose a generic
solution and only considers the util_avg. Something like the below could
be applied on top of the current patch to fulfill the requirement:

	if (llc_weight <= 16)
		nr_scan = nr_scan * 32 / llc_weight;

For LLC domain with 16 CPUs, the nr_scan will be expanded to 2 times large.
The smaller the CPU number this LLC domain has, the larger nr_scan will be
expanded. This needs further investigation.

There is also ongoing work[2] from Abel to filter out the busy CPUs during
wakeup, to further speed up the idle CPU scan. And it could be a following-up
optimization on top of this change.

Suggested-by: Tim Chen <tim.c.chen@intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Tested-by: Mohini Narkhede <mohini.narkhede@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20220612163428.849378-1-yu.c.chen@intel.com
2022-06-28 09:08:30 +02:00
Christian Göttsche
700a78335f sched: only perform capability check on privileged operation
sched_setattr(2) issues via kernel/sched/core.c:__sched_setscheduler()
a CAP_SYS_NICE audit event unconditionally, even when the requested
operation does not require that capability / is unprivileged, i.e. for
reducing niceness.
This is relevant in connection with SELinux, where a capability check
results in a policy decision and by default a denial message on
insufficient permission is issued.
It can lead to three undesired cases:
  1. A denial message is generated, even in case the operation was an
     unprivileged one and thus the syscall succeeded, creating noise.
  2. To avoid the noise from 1. the policy writer adds a rule to ignore
     those denial messages, hiding future syscalls, where the task
     performs an actual privileged operation, leading to hidden limited
     functionality of that task.
  3. To avoid the noise from 1. the policy writer adds a rule to allow
     the task the capability CAP_SYS_NICE, while it does not need it,
     violating the principle of least privilege.

Conduct privilged/unprivileged categorization first and perform a
capable test (and at most once) only if needed.

Signed-off-by: Christian Göttsche <cgzones@googlemail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220615152505.310488-1-cgzones@googlemail.com
2022-06-28 09:08:29 +02:00
Zhang Qiao
c64b551f6a sched: Remove unused function group_first_cpu()
As of commit afe06efdf0 ("sched: Extend scheduler's asym packing")
group_first_cpu() became an unused function, remove it.

Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20220617181151.29980-3-zhangqiao22@huawei.com
2022-06-28 09:08:29 +02:00
Zhang Qiao
fb95a5a04d sched/fair: Remove redundant word " *"
" *" is redundant. so remove it.

Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220617181151.29980-2-zhangqiao22@huawei.com
2022-06-28 09:08:29 +02:00
Tianchen Ding
f3dd3f6745 sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle
Wakelist can help avoid cache bouncing and offload the overhead of waker
cpu. So far, using wakelist within the same llc only happens on
WF_ON_CPU, and this limitation could be removed to further improve
wakeup performance.

The commit 518cd62341 ("sched: Only queue remote wakeups when
crossing cache boundaries") disabled queuing tasks on wakelist when
the cpus share llc. This is because, at that time, the scheduler must
send IPIs to do ttwu_queue_wakelist. Nowadays, ttwu_queue_wakelist also
supports TIF_POLLING, so this is not a problem now when the wakee cpu is
in idle polling.

Benefits:
  Queuing the task on idle cpu can help improving performance on waker cpu
  and utilization on wakee cpu, and further improve locality because
  the wakee cpu can handle its own rq. This patch helps improving rt on
  our real java workloads where wakeup happens frequently.

  Consider the normal condition (CPU0 and CPU1 share same llc)
  Before this patch:

         CPU0                                       CPU1

    select_task_rq()                                idle
    rq_lock(CPU1->rq)
    enqueue_task(CPU1->rq)
    notify CPU1 (by sending IPI or CPU1 polling)

                                                    resched()

  After this patch:

         CPU0                                       CPU1

    select_task_rq()                                idle
    add to wakelist of CPU1
    notify CPU1 (by sending IPI or CPU1 polling)

                                                    rq_lock(CPU1->rq)
                                                    enqueue_task(CPU1->rq)
                                                    resched()

  We see CPU0 can finish its work earlier. It only needs to put task to
  wakelist and return.
  While CPU1 is idle, so let itself handle its own runqueue data.

This patch brings no difference about IPI.
  This patch only takes effect when the wakee cpu is:
  1) idle polling
  2) idle not polling

  For 1), there will be no IPI with or without this patch.

  For 2), there will always be an IPI before or after this patch.
  Before this patch: waker cpu will enqueue task and check preempt. Since
  "idle" will be sure to be preempted, waker cpu must send a resched IPI.
  After this patch: waker cpu will put the task to the wakelist of wakee
  cpu, and send an IPI.

Benchmark:
We've tested schbench, unixbench, and hachbench on both x86 and arm64.

On x86 (Intel Xeon Platinum 8269CY):
  schbench -m 2 -t 8

    Latency percentiles (usec)              before        after
        50.0000th:                             8            6
        75.0000th:                            10            7
        90.0000th:                            11            8
        95.0000th:                            12            8
        *99.0000th:                           13           10
        99.5000th:                            15           11
        99.9000th:                            18           14

  Unixbench with full threads (104)
                                            before        after
    Dhrystone 2 using register variables  3011862938    3009935994  -0.06%
    Double-Precision Whetstone              617119.3      617298.5   0.03%
    Execl Throughput                         27667.3       27627.3  -0.14%
    File Copy 1024 bufsize 2000 maxblocks   785871.4      784906.2  -0.12%
    File Copy 256 bufsize 500 maxblocks     210113.6      212635.4   1.20%
    File Copy 4096 bufsize 8000 maxblocks  2328862.2     2320529.1  -0.36%
    Pipe Throughput                      145535622.8   145323033.2  -0.15%
    Pipe-based Context Switching           3221686.4     3583975.4  11.25%
    Process Creation                        101347.1      103345.4   1.97%
    Shell Scripts (1 concurrent)            120193.5      123977.8   3.15%
    Shell Scripts (8 concurrent)             17233.4       17138.4  -0.55%
    System Call Overhead                   5300604.8     5312213.6   0.22%

  hackbench -g 1 -l 100000
                                            before        after
    Time                                     3.246        2.251

On arm64 (Ampere Altra):
  schbench -m 2 -t 8

    Latency percentiles (usec)              before        after
        50.0000th:                            14           10
        75.0000th:                            19           14
        90.0000th:                            22           16
        95.0000th:                            23           16
        *99.0000th:                           24           17
        99.5000th:                            24           17
        99.9000th:                            28           25

  Unixbench with full threads (80)
                                            before        after
    Dhrystone 2 using register variables  3536194249    3537019613   0.02%
    Double-Precision Whetstone              629383.6      629431.6   0.01%
    Execl Throughput                         65920.5       65846.2  -0.11%
    File Copy 1024 bufsize 2000 maxblocks  1063722.8     1064026.8   0.03%
    File Copy 256 bufsize 500 maxblocks     322684.5      318724.5  -1.23%
    File Copy 4096 bufsize 8000 maxblocks  2348285.3     2328804.8  -0.83%
    Pipe Throughput                      133542875.3   131619389.8  -1.44%
    Pipe-based Context Switching           3215356.1     3576945.1  11.25%
    Process Creation                        108520.5      120184.6  10.75%
    Shell Scripts (1 concurrent)            122636.3        121888  -0.61%
    Shell Scripts (8 concurrent)             17462.1       17381.4  -0.46%
    System Call Overhead                   4429998.9     4435006.7   0.11%

  hackbench -g 1 -l 100000
                                            before        after
    Time                                     4.217        2.916

Our patch has improvement on schbench, hackbench
and Pipe-based Context Switching of unixbench
when there exists idle cpus,
and no obvious regression on other tests of unixbench.
This can help improve rt in scenes where wakeup happens frequently.

Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20220608233412.327341-3-dtcccc@linux.alibaba.com
2022-06-13 10:30:01 +02:00
Tianchen Ding
28156108fe sched: Fix the check of nr_running at queue wakelist
The commit 2ebb177175 ("sched/core: Offload wakee task activation if it
the wakee is descheduling") checked rq->nr_running <= 1 to avoid task
stacking when WF_ON_CPU.

Per the ordering of writes to p->on_rq and p->on_cpu, observing p->on_cpu
(WF_ON_CPU) in ttwu_queue_cond() implies !p->on_rq, IOW p has gone through
the deactivate_task() in __schedule(), thus p has been accounted out of
rq->nr_running. As such, the task being the only runnable task on the rq
implies reading rq->nr_running == 0 at that point.

The benchmark result is in [1].

[1] https://lore.kernel.org/all/e34de686-4e85-bde1-9f3c-9bbc86b38627@linux.alibaba.com/

Suggested-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20220608233412.327341-2-dtcccc@linux.alibaba.com
2022-06-13 10:30:01 +02:00
Josh Don
792b9f65a5 sched: Allow newidle balancing to bail out of load_balance
While doing newidle load balancing, it is possible for new tasks to
arrive, such as with pending wakeups. newidle_balance() already accounts
for this by exiting the sched_domain load_balance() iteration if it
detects these cases. This is very important for minimizing wakeup
latency.

However, if we are already in load_balance(), we may stay there for a
while before returning back to newidle_balance(). This is most
exacerbated if we enter a 'goto redo' loop in the LBF_ALL_PINNED case. A
very straightforward workaround to this is to adjust should_we_balance()
to bail out if we're doing a CPU_NEWLY_IDLE balance and new tasks are
detected.

This was tested with the following reproduction:
- two threads that take turns sleeping and waking each other up are
  affined to two cores
- a large number of threads with 100% utilization are pinned to all
  other cores

Without this patch, wakeup latency was ~120us for the pair of threads,
almost entirely spent in load_balance(). With this patch, wakeup latency
is ~6us.

Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220609025515.2086253-1-joshdon@google.com
2022-06-13 10:30:01 +02:00
Yajun Deng
2ed81e7654 sched/deadline: Use proc_douintvec_minmax() limit minimum value
sysctl_sched_dl_period_max and sysctl_sched_dl_period_min are unsigned
integer, but proc_dointvec() wouldn't return error even if we set a
negative number.

Use proc_douintvec_minmax() instead of proc_dointvec(). Add extra1 for
sysctl_sched_dl_period_max and extra2 for sysctl_sched_dl_period_min.

It's just an optimization for match data and proc_handler in struct
ctl_table. The 'if (period < min || period > max)' in __checkparam_dl()
will work fine even if there hasn't this patch.

Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Link: https://lore.kernel.org/r/20220607101807.249965-1-yajun.deng@linux.dev
2022-06-13 10:30:00 +02:00
Chengming Zhou
51bf903b64 sched/fair: Optimize and simplify rq leaf_cfs_rq_list
We notice the rq leaf_cfs_rq_list has two problems when do bugfix
backports and some test profiling.

1. cfs_rqs under throttled subtree could be added to the list, and
   make their fully decayed ancestors on the list, even though not needed.

2. #1 also make the leaf_cfs_rq_list management complex and error prone,
   this is the list of related bugfix so far:

   commit 31bc6aeaab ("sched/fair: Optimize update_blocked_averages()")
   commit fe61468b2c ("sched/fair: Fix enqueue_task_fair warning")
   commit b34cb07dde ("sched/fair: Fix enqueue_task_fair() warning some more")
   commit 39f23ce07b ("sched/fair: Fix unthrottle_cfs_rq() for leaf_cfs_rq list")
   commit 0258bdfaff ("sched/fair: Fix unfairness caused by missing load decay")
   commit a7b359fc6a ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
   commit fdaba61ef8 ("sched/fair: Ensure that the CFS parent is added after unthrottling")
   commit 2630cde267 ("sched/fair: Add ancestors of unthrottled undecayed cfs_rq")

commit 31bc6aeaab ("sched/fair: Optimize update_blocked_averages()")
delete every cfs_rq under throttled subtree from rq->leaf_cfs_rq_list,
and delete the throttled_hierarchy() test in update_blocked_averages(),
which optimized update_blocked_averages().

But those later bugfix add cfs_rqs under throttled subtree back to
rq->leaf_cfs_rq_list again, with their fully decayed ancestors, for
the integrity of rq->leaf_cfs_rq_list.

This patch takes another method, skip all cfs_rqs under throttled
hierarchy when list_add_leaf_cfs_rq(), to completely make cfs_rqs
under throttled subtree off the leaf_cfs_rq_list.

So we don't need to consider throttled related things in
enqueue_entity(), unthrottle_cfs_rq() and enqueue_task_fair(),
which simplify the code a lot. Also optimize update_blocked_averages()
since cfs_rqs under throttled hierarchy and their ancestors
won't be on the leaf_cfs_rq_list.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20220601021848.76943-1-zhouchengming@bytedance.com
2022-06-13 10:30:00 +02:00
K Prateek Nayak
f5b2eeb499 sched/fair: Consider CPU affinity when allowing NUMA imbalance in find_idlest_group()
In the case of systems containing multiple LLCs per socket, like
AMD Zen systems, users want to spread bandwidth hungry applications
across multiple LLCs. Stream is one such representative workload where
the best performance is obtained by limiting one stream thread per LLC.
To ensure this, users are known to pin the tasks to a specify a subset
of the CPUs consisting of one CPU per LLC while running such bandwidth
hungry tasks.

Suppose we kickstart a multi-threaded task like stream with 8 threads
using taskset or numactl to run on a subset of CPUs on a 2 socket Zen3
server where each socket contains 128 CPUs
(0-63,128-191 in one socket, 64-127,192-255 in another socket)

Eg: numactl -C 0,16,32,48,64,80,96,112 ./stream8

Here each CPU in the list is from a different LLC and 4 of those LLCs
are on one socket, while the other 4 are on another socket.

Ideally we would prefer that each stream thread runs on a different
CPU from the allowed list of CPUs. However, the current heuristics in
find_idlest_group() do not allow this during the initial placement.

Suppose the first socket (0-63,128-191) is our local group from which
we are kickstarting the stream tasks. The first four stream threads
will be placed in this socket. When it comes to placing the 5th
thread, all the allowed CPUs are from the local group (0,16,32,48)
would have been taken.

However, the current scheduler code simply checks if the number of
tasks in the local group is fewer than the allowed numa-imbalance
threshold. This threshold was previously 25% of the NUMA domain span
(in this case threshold = 32) but after the v6 of Mel's patchset
"Adjust NUMA imbalance for multiple LLCs", got merged in sched-tip,
Commit: e496132ebe ("sched/fair: Adjust the allowed NUMA imbalance
when SD_NUMA spans multiple LLCs") it is now equal to number of LLCs
in the NUMA domain, for processors with multiple LLCs.
(in this case threshold = 8).

For this example, the number of tasks will always be within threshold
and thus all the 8 stream threads will be woken up on the first socket
thereby resulting in sub-optimal performance.

The following sched_wakeup_new tracepoint output shows the initial
placement of tasks in the current tip/sched/core on the Zen3 machine:

stream-5313    [016] d..2.   627.005036: sched_wakeup_new: comm=stream pid=5315 prio=120 target_cpu=032
stream-5313    [016] d..2.   627.005086: sched_wakeup_new: comm=stream pid=5316 prio=120 target_cpu=048
stream-5313    [016] d..2.   627.005141: sched_wakeup_new: comm=stream pid=5317 prio=120 target_cpu=000
stream-5313    [016] d..2.   627.005183: sched_wakeup_new: comm=stream pid=5318 prio=120 target_cpu=016
stream-5313    [016] d..2.   627.005218: sched_wakeup_new: comm=stream pid=5319 prio=120 target_cpu=016
stream-5313    [016] d..2.   627.005256: sched_wakeup_new: comm=stream pid=5320 prio=120 target_cpu=016
stream-5313    [016] d..2.   627.005295: sched_wakeup_new: comm=stream pid=5321 prio=120 target_cpu=016

Once the first four threads are distributed among the allowed CPUs of
socket one, the rest of the treads start piling on these same CPUs
when clearly there are CPUs on the second socket that can be used.

Following the initial pile up on a small number of CPUs, though the
load-balancer eventually kicks in, it takes a while to get to {4}{4}
and even {4}{4} isn't stable as we observe a bunch of ping ponging
between {4}{4} to {5}{3} and back before a stable state is reached
much later (1 Stream thread per allowed CPU) and no more migration is
required.

We can detect this piling and avoid it by checking if the number of
allowed CPUs in the local group are fewer than the number of tasks
running in the local group and use this information to spread the
5th task out into the next socket (after all, the goal in this
slowpath is to find the idlest group and the idlest CPU during the
initial placement!).

The following sched_wakeup_new tracepoint output shows the initial
placement of tasks after adding this fix on the Zen3 machine:

stream-4485    [016] d..2.   230.784046: sched_wakeup_new: comm=stream pid=4487 prio=120 target_cpu=032
stream-4485    [016] d..2.   230.784123: sched_wakeup_new: comm=stream pid=4488 prio=120 target_cpu=048
stream-4485    [016] d..2.   230.784167: sched_wakeup_new: comm=stream pid=4489 prio=120 target_cpu=000
stream-4485    [016] d..2.   230.784222: sched_wakeup_new: comm=stream pid=4490 prio=120 target_cpu=112
stream-4485    [016] d..2.   230.784271: sched_wakeup_new: comm=stream pid=4491 prio=120 target_cpu=096
stream-4485    [016] d..2.   230.784322: sched_wakeup_new: comm=stream pid=4492 prio=120 target_cpu=080
stream-4485    [016] d..2.   230.784368: sched_wakeup_new: comm=stream pid=4493 prio=120 target_cpu=064

We see that threads are using all of the allowed CPUs and there is
no pileup.

No output is generated for tracepoint sched_migrate_task with this
patch due to a perfect initial placement which removes the need
for balancing later on - both across NUMA boundaries and within
NUMA boundaries for stream.

Following are the results from running 8 Stream threads with and
without pinning on a dual socket Zen3 Machine (2 x 64C/128T):

During the testing of this patch, the tip sched/core was at
commit: 089c02ae27 "ftrace: Use preemption model accessors for trace
header printout"

Pinning is done using: numactl -C 0,16,32,48,64,80,96,112 ./stream8

	           5.18.0-rc1               5.18.0-rc1                5.18.0-rc1
               tip sched/core           tip sched/core            tip sched/core
                 (no pinning)                + pinning              + this-patch
								       + pinning

 Copy:   109364.74 (0.00 pct)     94220.50 (-13.84 pct)    158301.28 (44.74 pct)
Scale:   109670.26 (0.00 pct)     90210.59 (-17.74 pct)    149525.64 (36.34 pct)
  Add:   129029.01 (0.00 pct)    101906.00 (-21.02 pct)    186658.17 (44.66 pct)
Triad:   127260.05 (0.00 pct)    106051.36 (-16.66 pct)    184327.30 (44.84 pct)

Pinning currently hurts the performance compared to unbound case on
tip/sched/core. With the addition of this patch, we are able to
outperform tip/sched/core by a good margin with pinning.

Following are the results from running 16 Stream threads with and
without pinning on a dual socket IceLake Machine (2 x 32C/64T):

NUMA Topology of Intel Skylake machine:
Node 1: 0,2,4,6 ... 126 (Even numbers)
Node 2: 1,3,5,7 ... 127 (Odd numbers)

Pinning is done using: numactl -C 0-15 ./stream16

	           5.18.0-rc1               5.18.0-rc1                5.18.0-rc1
               tip sched/core           tip sched/core            tip sched/core
                 (no pinning)                 +pinning              + this-patch
								       + pinning

 Copy:    85815.31 (0.00 pct)     149819.21 (74.58 pct)    156807.48 (82.72 pct)
Scale:    64795.60 (0.00 pct)      97595.07 (50.61 pct)     99871.96 (54.13 pct)
  Add:    71340.68 (0.00 pct)     111549.10 (56.36 pct)    114598.33 (60.63 pct)
Triad:    68890.97 (0.00 pct)     111635.16 (62.04 pct)    114589.24 (66.33 pct)

In case of Icelake machine, with single LLC per socket, pinning across
the two sockets reduces cache contention, thus showing great
improvement in pinned case which is further benefited by this patch.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Link: https://lkml.kernel.org/r/20220407111222.22649-1-kprateek.nayak@amd.com
2022-06-13 10:30:00 +02:00
Mel Gorman
026b98a93b sched/numa: Adjust imb_numa_nr to a better approximation of memory channels
For a single LLC per node, a NUMA imbalance is allowed up until 25%
of CPUs sharing a node could be active. One intent of the cut-off is
to avoid an imbalance of memory channels but there is no topological
information based on active memory channels. Furthermore, there can
be differences between nodes depending on the number of populated
DIMMs.

A cut-off of 25% was arbitrary but generally worked. It does have a severe
corner cases though when an parallel workload is using 25% of all available
CPUs over-saturates memory channels. This can happen due to the initial
forking of tasks that get pulled more to one node after early wakeups
(e.g. a barrier synchronisation) that is not quickly corrected by the
load balancer. The LB may fail to act quickly as the parallel tasks are
considered to be poor migrate candidates due to locality or cache hotness.

On a range of modern Intel CPUs, 12.5% appears to be a better cut-off
assuming all memory channels are populated and is used as the new cut-off
point. A minimum of 1 is specified to allow a communicating pair to
remain local even for CPUs with low numbers of cores. For modern AMDs,
there are multiple LLCs and are not affected.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20220520103519.1863-5-mgorman@techsingularity.net
2022-06-13 10:30:00 +02:00
Mel Gorman
cb29a5c19d sched/numa: Apply imbalance limitations consistently
The imbalance limitations are applied inconsistently at fork time
and at runtime. At fork, a new task can remain local until there are
too many running tasks even if the degree of imbalance is larger than
NUMA_IMBALANCE_MIN which is different to runtime. Secondly, the imbalance
figure used during load balancing is different to the one used at NUMA
placement. Load balancing uses the number of tasks that must move to
restore imbalance where as NUMA balancing uses the total imbalance.

In combination, it is possible for a parallel workload that uses a small
number of CPUs without applying scheduler policies to have very variable
run-to-run performance.

[lkp@intel.com: Fix build breakage for arc-allyesconfig]

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20220520103519.1863-4-mgorman@techsingularity.net
2022-06-13 10:29:59 +02:00
Mel Gorman
13ede33150 sched/numa: Do not swap tasks between nodes when spare capacity is available
If a destination node has spare capacity but there is an imbalance then
two tasks are selected for swapping. If the tasks have no numa group
or are within the same NUMA group, it's simply shuffling tasks around
without having any impact on the compute imbalance. Instead, it's just
punishing one task to help another.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20220520103519.1863-3-mgorman@techsingularity.net
2022-06-13 10:29:59 +02:00
Mel Gorman
70ce3ea9aa sched/numa: Initialise numa_migrate_retry
On clone, numa_migrate_retry is inherited from the parent which means
that the first NUMA placement of a task is non-deterministic. This
affects when load balancing recognises numa tasks and whether to
migrate "regular", "remote" or "all" tasks between NUMA scheduler
domains.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20220520103519.1863-2-mgorman@techsingularity.net
2022-06-13 10:29:59 +02:00
Linus Torvalds
b0cb8db396 Workqueue fixes for v5.19-rc1
Tetsuo's patch to trigger build warnings if system-wide wq's are flushed
 along with a TP type update and trivial comment update.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCYqUyqQ4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGQPtAQCQZuNFoWhCtdpjW/MWuGdY1pGGPMVl+60xwvew
 Ad8gegD/eoAsXP1XAzJ9Z1BPqr/IxncfOgGGDGHbR1Ll39qLlwE=
 =i3Sx
 -----END PGP SIGNATURE-----

Merge tag 'wq-for-5.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

Pull workqueue fixes from Tejun Heo:
 "Tetsuo's patch to trigger build warnings if system-wide wq's are
  flushed along with a TP type update and trivial comment update"

* tag 'wq-for-5.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Switch to new kerneldoc syntax for named variable macro argument
  workqueue: Fix type of cpu in trace event
  workqueue: Wrap flush_workqueue() using a macro
2022-06-12 11:16:00 -07:00
Linus Torvalds
1bc27dec7e Power management fixes for 5.19-rc2
- Fix CPUIDLE_FLAG_IRQ_ENABLE handling in intel_idle (Peter Zijlstra).
 
  - Allow all platforms to use the global poweroff handler and make
    non-syscall poweroff code paths work again (Dmitry Osipenko).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmKjkJUSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxugkQAKjDzcrVBffv4zOjijJS+ojnctRZShIw
 tWw/CUjm+VVAASSQ9XsaWFP/Eu3sA2tLOEaZddoAMCxzlwCgCVVZUgwChGmytiqr
 s4hJHyLkVAP8TszQoXvoyy5AIJI7Un33moXveGH9ZIbIAjtYckfInQJbyIsmVDF6
 VmOryqDxt+LS7hiXaAhEkExiy6HdfbmTt5f+tBO76p+eXtyj1hvkbUYrk7yX+Rv4
 6V+wDxZxy4LbhQDZQRzGYL3OBG33t9ThmVKi0Km2Aq6DXd5XIUMXZ9tYJ6+TYfe2
 scn6ZpTT87HLb8j96ZQUHU+MESvlBlpGdPGHwA2WORRqBL1sR0tsGqcfvHt5c57u
 mTrDFXaVs1vG4Zaa9JpxbrKwrNyi3xspG7EVKKtPfFHjREKhcLEh5k5F/x4ZQMY/
 U45hbEoLEWrP4Z68th7vL0T7E85tCW3+wp3FfzVkJ36H5cCy7CWAG5qpto0d3uo7
 6M3Xe+ms547R2zWWlGcdxynlhIfWK3lzYc1YZb6eoeb67m7Iyav2aehnOmko9wIO
 iNONbaB4SrwPdBi7BqaAX30G+IX/MDSu6EotI6FTJEWlJmBZPXryPBXaIgluhZWL
 8CurDLxNPFyXR3tj7BiLAYbIxGJwIqFHODS6cm6kioM8VA+WskcvLfE9+QbEXCEh
 tH+l1dzk+NYt
 =2OMR
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These fix an intel_idle issue introduced during the 5.16 development
  cycle and two recent regressions in the system reboot/poweroff code.

  Specifics:

   - Fix CPUIDLE_FLAG_IRQ_ENABLE handling in intel_idle (Peter Zijlstra)

   - Allow all platforms to use the global poweroff handler and make
     non-syscall poweroff code paths work again (Dmitry Osipenko)"

* tag 'pm-5.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpuidle,intel_idle: Fix CPUIDLE_FLAG_IRQ_ENABLE
  kernel/reboot: Fix powering off using a non-syscall code paths
  kernel/reboot: Use static handler for register_platform_power_off()
2022-06-10 11:49:27 -07:00
Rafael J. Wysocki
67e59f8d01 Merge branch 'pm-sysoff'
Merge fixes for regressions introduced by the recent rework of the
system reboot/poweroff code.

* pm-sysoff:
  kernel/reboot: Fix powering off using a non-syscall code paths
  kernel/reboot: Use static handler for register_platform_power_off()
2022-06-10 20:24:10 +02:00
Linus Torvalds
f2ecc964b9 xen: branch for v5.19-rc2
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCYqLLcQAKCRCAXGG7T9hj
 vggVAP0Wjf81TxvIGrwtcn1q2LW7xyNp5TQk1T6GYx40sHXWlQD/SCKl5iorSGn+
 +g1eXYmOpnSIAcMX3B3T4Ra8uo3mnA8=
 =PnB/
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-5.19a-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen updates from Juergen Gross:

 - a small cleanup removing "export" of an __init function

 - a small series adding a new infrastructure for platform flags

 - a series adding generic virtio support for Xen guests (frontend side)

* tag 'for-linus-5.19a-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen: unexport __init-annotated xen_xlate_map_ballooned_pages()
  arm/xen: Assign xen-grant DMA ops for xen-grant DMA devices
  xen/grant-dma-ops: Retrieve the ID of backend's domain for DT devices
  xen/grant-dma-iommu: Introduce stub IOMMU driver
  dt-bindings: Add xen,grant-dma IOMMU description for xen-grant DMA ops
  xen/virtio: Enable restricted memory access using Xen grant mappings
  xen/grant-dma-ops: Add option to restrict memory access under Xen
  xen/grants: support allocating consecutive grants
  arm/xen: Introduce xen_setup_dma_ops()
  virtio: replace arch_has_restricted_virtio_memory_access()
  kernel: add platform_has() infrastructure
2022-06-10 09:57:11 -07:00
Linus Torvalds
825464e79d Networking fixes for 5.19-rc2, including fixes from bpf and netfilter.
Current release - regressions:
   - eth: amt: fix possible null-ptr-deref in amt_rcv()
 
 Previous releases - regressions:
   - tcp: use alloc_large_system_hash() to allocate table_perturb
 
   - af_unix: fix a data-race in unix_dgram_peer_wake_me()
 
   - nfc: st21nfca: fix memory leaks in EVT_TRANSACTION handling
 
   - eth: ixgbe: fix unexpected VLAN rx in promisc mode on VF
 
 Previous releases - always broken:
   - ipv6: fix signed integer overflow in __ip6_append_data
 
   - netfilter:
     - nat: really support inet nat without l3 address
     - nf_tables: memleak flow rule from commit path
 
   - bpf: fix calling global functions from BPF_PROG_TYPE_EXT programs
 
   - openvswitch: fix misuse of the cached connection on tuple changes
 
   - nfc: nfcmrvl: fix memory leak in nfcmrvl_play_deferred
 
   - eth: altera: fix refcount leak in altera_tse_mdio_create
 
 Misc:
   - add Quentin Monnet to bpftool maintainers
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmKhykgSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkN7sQAIn+ZmzQqTm5MVWnlvt/GcRGjjMP2VQY
 60oS2re8QC773yWoP6PvXqxCSFc99paDCC5BmCK6DMLbp9yuVSp5W8iAPuFuyjXE
 /Nur4Ti57LcGJ8ZpcJheBD4cRFbf+xtsGzx9a1WhUDrCYASo7vqRes5Eos2dT7P7
 qjgTduhUtaj6S1CfenfTnYqemZPzSGa+1euDuQ/Bu4mjCPUTrNZZQVYjmfTYM9p1
 UzwfCQr9TtmRKo8wLFHnYDLoWHNpfp55SNL0ShAwIQqgldiJ2OdMje+a2Sa4m6uF
 etRz8H0WrGVqfneD424tdyZv4nwhHw5dnaSrGe8DGq98c4/lIIcVyC38oDAbfWqI
 l8p7ZmtvNid7rpgoQFcxKpb2TAYAI+jaFq5GySEhvj5ZAblNQgFyghfMGPoncXCO
 XW6va8TtP2lmHFScAljQiQb6GNwDO52x77/q14Jkwvr+DILRKXMZZ3hCGrKUn5JM
 lafGkdL5ufm+E9C9RlaWN3imb2KoRj+wdThgV79efEPGG1py7yLOPVMoOCP3qmLq
 torcGcfDi1LGb7ohQxN6tCMv0JgXjS5nd1i+qJnImpkhRrUmahOfmpnElHoPuzs3
 6FU8HR77Eo15x70Jt+WOMy4oXrNh2MeEm8/Fhpj84MEhKpxVn+2o/53M+++5h+ru
 YtiLwEri0dCA
 =rdoB
 -----END PGP SIGNATURE-----

Merge tag 'net-5.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from bpf and netfilter.

  Current release - regressions:

   - eth: amt: fix possible null-ptr-deref in amt_rcv()

  Previous releases - regressions:

   - tcp: use alloc_large_system_hash() to allocate table_perturb

   - af_unix: fix a data-race in unix_dgram_peer_wake_me()

   - nfc: st21nfca: fix memory leaks in EVT_TRANSACTION handling

   - eth: ixgbe: fix unexpected VLAN rx in promisc mode on VF

  Previous releases - always broken:

   - ipv6: fix signed integer overflow in __ip6_append_data

   - netfilter:
       - nat: really support inet nat without l3 address
       - nf_tables: memleak flow rule from commit path

   - bpf: fix calling global functions from BPF_PROG_TYPE_EXT programs

   - openvswitch: fix misuse of the cached connection on tuple changes

   - nfc: nfcmrvl: fix memory leak in nfcmrvl_play_deferred

   - eth: altera: fix refcount leak in altera_tse_mdio_create

  Misc:

   - add Quentin Monnet to bpftool maintainers"

* tag 'net-5.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (45 commits)
  net: amd-xgbe: fix clang -Wformat warning
  tcp: use alloc_large_system_hash() to allocate table_perturb
  net: dsa: realtek: rtl8365mb: fix GMII caps for ports with internal PHY
  net: dsa: mv88e6xxx: correctly report serdes link failure
  net: dsa: mv88e6xxx: fix BMSR error to be consistent with others
  net: dsa: mv88e6xxx: use BMSR_ANEGCOMPLETE bit for filling an_complete
  net: altera: Fix refcount leak in altera_tse_mdio_create
  net: openvswitch: fix misuse of the cached connection on tuple changes
  net: ethernet: mtk_eth_soc: fix misuse of mem alloc interface netdev[napi]_alloc_frag
  ip_gre: test csum_start instead of transport header
  au1000_eth: stop using virt_to_bus()
  ipv6: Fix signed integer overflow in l2tp_ip6_sendmsg
  ipv6: Fix signed integer overflow in __ip6_append_data
  nfc: nfcmrvl: Fix memory leak in nfcmrvl_play_deferred
  nfc: st21nfca: fix incorrect sizing calculations in EVT_TRANSACTION
  nfc: st21nfca: fix memory leaks in EVT_TRANSACTION handling
  nfc: st21nfca: fix incorrect validating logic in EVT_TRANSACTION
  net: ipv6: unexport __init-annotated seg6_hmac_init()
  net: xfrm: unexport __init-annotated xfrm4_protocol_init()
  net: mdio: unexport __init-annotated mdio_bus_init()
  ...
2022-06-09 12:06:52 -07:00
Linus Torvalds
34f4335c16 * Fix syzkaller NULL pointer dereference
* Fix TDP MMU performance issue with disabling dirty logging
 * Fix 5.14 regression with SVM TSC scaling
 * Fix indefinite stall on applying live patches
 * Fix unstable selftest
 * Fix memory leak from wrong copy-and-paste
 * Fix missed PV TLB flush when racing with emulation
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmKglysUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOJDAgArpPcAnJbeT2VQTQcp94e4tp9k1Sf
 gmUewajco4zFVB/sldE0fIporETkaX+FYYPiaNDdNgJ2lUw/HUJBN7KoFEYTZ37N
 Xx/qXiIXQYFw1bmxTnacLzIQtD3luMCzOs/6/Q7CAFZIBpUtUEjkMlQOBuxoKeG0
 B0iLCTJSw0taWcN170aN8G6T+5+bdR3AJW1k2wkgfESfYF9NfJoTUHQj9WTMzM2R
 aBRuXvUI/rWKvQY3DfoRmgg9Ig/SirSC+abbKIs4H08vZIEUlPk3WOZSKpsN/Wzh
 3XDnVRxgnaRLx6NI/ouI2UYJCmjPKbNcueGCf5IfUcHvngHjAEG/xxe4Qw==
 =zQ9u
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:

 - syzkaller NULL pointer dereference

 - TDP MMU performance issue with disabling dirty logging

 - 5.14 regression with SVM TSC scaling

 - indefinite stall on applying live patches

 - unstable selftest

 - memory leak from wrong copy-and-paste

 - missed PV TLB flush when racing with emulation

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86: do not report a vCPU as preempted outside instruction boundaries
  KVM: x86: do not set st->preempted when going back to user space
  KVM: SVM: fix tsc scaling cache logic
  KVM: selftests: Make hyperv_clock selftest more stable
  KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty logging
  x86: drop bogus "cc" clobber from __try_cmpxchg_user_asm()
  KVM: x86/mmu: Check every prev_roots in __kvm_mmu_free_obsolete_roots()
  entry/kvm: Exit to user mode when TIF_NOTIFY_SIGNAL is set
  KVM: Don't null dereference ops->destroy
2022-06-08 09:16:31 -07:00
Dmitry Osipenko
2b8c612c61 kernel/reboot: Fix powering off using a non-syscall code paths
There are other methods of powering off machine than the reboot syscall.
Previously we missed to cover those methods and it created power-off
regression for some machines, like the PowerPC e500.

Fix this problem by moving the legacy sys-off handler registration to
the latest phase of power-off process and making the kernel_can_power_off()
check the legacy pm_power_off presence.

Tested-by: Michael Ellerman <mpe@ellerman.id.au> # ppce500
Reported-by: Michael Ellerman <mpe@ellerman.id.au> # ppce500
Fixes: da007f171f ("kernel/reboot: Change registration order of legacy power-off handler")
Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-06-07 19:42:31 +02:00
Toke Høiland-Jørgensen
f858c2b2ca bpf: Fix calling global functions from BPF_PROG_TYPE_EXT programs
The verifier allows programs to call global functions as long as their
argument types match, using BTF to check the function arguments. One of the
allowed argument types to such global functions is PTR_TO_CTX; however the
check for this fails on BPF_PROG_TYPE_EXT functions because the verifier
uses the wrong type to fetch the vmlinux BTF ID for the program context
type. This failure is seen when an XDP program is loaded using
libxdp (which loads it as BPF_PROG_TYPE_EXT and attaches it to a global XDP
type program).

Fix the issue by passing in the target program type instead of the
BPF_PROG_TYPE_EXT type to bpf_prog_get_ctx() when checking function
argument compatibility.

The first Fixes tag refers to the latest commit that touched the code in
question, while the second one points to the code that first introduced
the global function call verification.

v2:
- Use resolve_prog_type()

Fixes: 3363bd0cfb ("bpf: Extend kfunc with PTR_TO_CTX, PTR_TO_MEM argument support")
Fixes: 51c39bb1d5 ("bpf: Introduce function-by-function verification")
Reported-by: Simon Sundberg <simon.sundberg@kau.se>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20220606075253.28422-1-toke@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-07 10:41:20 -07:00
Dan Carpenter
fd58f7df24 bpf: Use safer kvmalloc_array() where possible
The kvmalloc_array() function is safer because it has a check for
integer overflows.  These sizes come from the user and I was not
able to see any bounds checking so an integer overflow seems like a
realistic concern.

Fixes: 0dcac27254 ("bpf: Add multi kprobe link")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/Yo9VRVMeHbALyjUH@kili
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-07 10:40:53 -07:00
Tetsuo Handa
c4f135d643 workqueue: Wrap flush_workqueue() using a macro
Since flush operation synchronously waits for completion, flushing
system-wide WQs (e.g. system_wq) might introduce possibility of deadlock
due to unexpected locking dependency. Tejun Heo commented at [1] that it
makes no sense at all to call flush_workqueue() on the shared WQs as the
caller has no idea what it's gonna end up waiting for.

Although there is flush_scheduled_work() which flushes system_wq WQ with
"Think twice before calling this function! It's very easy to get into
trouble if you don't take great care." warning message, syzbot found a
circular locking dependency caused by flushing system_wq WQ [2].

Therefore, let's change the direction to that developers had better use
their local WQs if flush_scheduled_work()/flush_workqueue(system_*_wq) is
inevitable.

Steps for converting system-wide WQs into local WQs are explained at [3],
and a conversion to stop flushing system-wide WQs is in progress. Now we
want some mechanism for preventing developers who are not aware of this
conversion from again start flushing system-wide WQs.

Since I found that WARN_ON() is complete but awkward approach for teaching
developers about this problem, let's use __compiletime_warning() for
incomplete but handy approach. For completeness, we will also insert
WARN_ON() into __flush_workqueue() after all in-tree users stopped calling
flush_scheduled_work().

Link: https://lore.kernel.org/all/YgnQGZWT%2Fn3VAITX@slm.duckdns.org/ [1]
Link: https://syzkaller.appspot.com/bug?extid=bde0f89deacca7c765b8 [2]
Link: https://lkml.kernel.org/r/49925af7-78a8-a3dd-bce6-cfc02e1a9236@I-love.SAKURA.ne.jp [3]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-06-07 07:07:14 -10:00
Seth Forshee
3e684903a8 entry/kvm: Exit to user mode when TIF_NOTIFY_SIGNAL is set
A livepatch transition may stall indefinitely when a kvm vCPU is heavily
loaded. To the host, the vCPU task is a user thread which is spending a
very long time in the ioctl(KVM_RUN) syscall. During livepatch
transition, set_notify_signal() will be called on such tasks to
interrupt the syscall so that the task can be transitioned. This
interrupts guest execution, but when xfer_to_guest_mode_work() sees that
TIF_NOTIFY_SIGNAL is set but not TIF_SIGPENDING it concludes that an
exit to user mode is unnecessary, and guest execution is resumed without
transitioning the task for the livepatch.

This handling of TIF_NOTIFY_SIGNAL is incorrect, as set_notify_signal()
is expected to break tasks out of interruptible kernel loops and cause
them to return to userspace. Change xfer_to_guest_mode_work() to handle
TIF_NOTIFY_SIGNAL the same as TIF_SIGPENDING, signaling to the vCPU run
loop that an exit to userpsace is needed. Any pending task_work will be
run when get_signal() is called from exit_to_user_mode_loop(), so there
is no longer any need to run task work from xfer_to_guest_mode_work().

Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Petr Mladek <pmladek@suse.com>
Signed-off-by: Seth Forshee <sforshee@digitalocean.com>
Message-Id: <20220504180840.2907296-1-sforshee@digitalocean.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-07 11:19:00 -04:00
Linus Torvalds
e71e60cd74 dma-mapping fixes for Linux 5.19
- fix a regressin in setting swiotlb ->force_bounce (me)
  - make dma-debug less chatty (Rob Clark)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmKdoHMLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYOzIw/9FHEZpid2sn1eL9mkJIwl+OlM3lDXg2VmJ4xUto7V
 5ha17ftGO5P005P7YuPHsGgtdAA6Lg8Q3LtbZ+HkUnb04ApOhnDhtqo9KqT/ylGL
 PTAzOcQPdTFJTElU9ijDCH2LlzgOWK1qcsnlnDRV8X9iCh1SFoU0ByGtjZC6e3sG
 g5cXHkdT91ONiD3vdh/puzZ43VpWIW2YUoumtmbf5WtmjEKSyfPjkks4Wb/KSPxQ
 wJdLkgzBdGzMg9Sr6bH2JUMjXnKRkWvHxa7m47ybmeC1s7HTVxlRh9lUEEC/suIo
 1+i/zsV6Tbif3ixu2Anxtk+x3aY2NwW5FDTU86wNHw5vYyO/FCaBCOBvo+d+GZB7
 xNARWh9V6r/HwTYflzgBlwUCfsnIWFZVZ+Q8xm1zVuYuoVVnzjl6yidHL1sVWf3H
 0w+aA63ivMTtsQOfzG9rEbYvtPOFU+Z84SgHqHgq7GxaxGy3xyiVXYdKznKt4QY7
 HmKdzh0XSOu6roj2pNlT3F1xlp47J3DOtFBZ2kQ3yw4bhLHGHSr4+GX2FPnxdSTC
 zrfG0ODDLfRTi/TkfjjDM2/f7HrQPWmutfBwyNwj/0vG/UT4XWSjo1HEcdghR5Hm
 J0/zD2elkNHX/ysHs52oLoKuLB1EzqG9Fmwl7RV8bNMhf4X0QtAGfLon+rc0SuuS
 P4I=
 =JdkI
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.19-2022-06-06' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fixes from Christoph Hellwig:

 - fix a regressin in setting swiotlb ->force_bounce (me)

 - make dma-debug less chatty (Rob Clark)

* tag 'dma-mapping-5.19-2022-06-06' of git://git.infradead.org/users/hch/dma-mapping:
  swiotlb: fix setting ->force_bounce
  dma-debug: make things less spammy under memory pressure
2022-06-06 17:56:18 -07:00
Juergen Gross
2130a790ca kernel: add platform_has() infrastructure
Add a simple infrastructure for setting, resetting and querying
platform feature flags.

Flags can be either global or architecture specific.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Tested-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> # Arm64 only
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Juergen Gross <jgross@suse.com>
2022-06-06 08:06:00 +02:00
Linus Torvalds
e17fee8976 A single featurelet for delay accounting. Delayed a bit
because,unusually, it has dependencies on both the mm-stable and
 mm-nonmm-stable queues.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYpz3SAAKCRDdBJ7gKXxA
 jou6AP9bY89NifR7Tc8U59Xu4c9amphXS9rTJv7Ysj3GxBMoRwEAuXvvJTet6mEn
 UdmytDdb4BtAlx7Itd7IKu4S9JD6mQw=
 =bAU1
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull delay-accounting update from Andrew Morton:
 "A single featurette for delay accounting.

  Delayed a bit because, unusually, it had dependencies on both the
  mm-stable and mm-nonmm-stable queues"

* tag 'mm-nonmm-stable-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  delayacct: track delays from write-protect copy
2022-06-05 16:58:27 -07:00
Linus Torvalds
bc1e02c3e5 Fix the fallout of sysctl code move which placed the init function wrong.
-----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmKcc+ATHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYobuAD/i9gZZ1b1Qtr9vCeioy8hXkP1QltXC5
 sBK/l+LH8kWMi2eQtqhr5T8Xhca0K84Ml2gCLimnhX5PZxUJ7VTL9DVhhU0vPbFx
 7dKVAF6xU6B0gV92Ojn4ztHXM/qulqbOExESaxOXpIQ8Rh5QUqokVSeTFVzQmxve
 GykHUGK6DFi3+W1aID4oW5BsGgG1i+5Qn4HuSP35vfvS4e0K056s5ANfJdzt4+Sz
 q4APbOrB5Dgbg351vRk+ms49TariVrdWOBc4ujPAYJms+oCrDHpTbHtTAKt/n1UF
 z5AbfFkebSK7hf2nplo1Qr+QZ0vLlWchbc+IdjhPvkXrZwqbgdJKJUznieHyUEOO
 xUUYKErEAcDNuF6wkhum5hhRARuHpHOdd6qAIFGsexTN7gOIgFF3Mdq2GD4YBebF
 O86tinGdOD1hzzxzW+r+AIgZgCUILmP0+lRasEZuJMT/iTWAiU1MjrNNPL45ZX1e
 Ldie5b7rGubwKaCwCv1R//D9NI6OTalVnmCyf1NRIeb9py4bG3SgCQecHP0pZeh5
 +xmsO8rrMlkq2fOe9K6N57vWeRcvQzTMGb17wZh/fAWCa2Ny7Si52NFgRwPNExCu
 YKR2bptFSZmKe0cZ8Q6mdw7J1tCJm+4Htgw9LSgaf/jvmvHyTV261ZZTHeglsdTQ
 JDGbad9Vt6h1
 =arj3
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fix from Thomas Gleixner:
 "Fix the fallout of sysctl code move which placed the init function
  wrong"

* tag 'sched-urgent-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/autogroup: Fix sysctl move
2022-06-05 10:42:40 -07:00
Linus Torvalds
fa11c28046 Two small perf updates:
- Make the ICL event constraints match reality
 
   - Remove a unused local variable
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmKcc6gTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoZA7EAC6zZqvX+yJn3c5C3rQDOyCJ74Zvawc
 IG3t35ADS7IbZm0rOjSdZz89V5Gz8bzMsjYHdADQw1URajuXvVwLkCFG5oY0owyo
 XSlAYBnMOYQ0nnZ/PqKO91oieyU0z3dkP5BHpFF0ZcqD4fSZaFvwtx2kneXDJBlw
 WIQqRYgRaKBRt+FTwmcm2ow3+afO9TY0oleOSbc0dokbGhwCO9hsuP0LwyMtA374
 zGPzsHdHizvDUq+gEW5ZnmjAhrXqNnK6QOLZQukRlP4I6bnQlCkB9mtu/k6jEaAo
 xl4o0VvB0y4NoSNPBSvrIHF360nIsSMqIBV0TPv4lI1B+p4/Dr9cQTn3J+J43x+c
 SqpDs2JH22S6zhjgDPKt+hAyaKkPJan+E4dy33sbT8zlrwODGPcso6WdpdWQshQg
 WJU1HhHq6PU1vcKJEoQmagipFN70O3elufVsW5+qddJ3tmPM2ghET5Ib1E2+Odrk
 NICXW2aduogQL+9H62SOdXwRcuF6kUrO0J8I7l61eyGxr/XpOlg0vljgNgS7HxF8
 IyrrfuNsj+p5SpVmg9LRwE9+vG5xi9lpGWoYmlp2RPTzOn0OfLtYI2rto70s6i6F
 xjvIaer1ulW4OP4cvHZd6VTVUl8ZE5lRM4VqdTFgzFpR7AuBMW+PVXcF5l4ubtZp
 x2NC+i0/3sQ7JQ==
 =ZK+2
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Thomas Gleixner:

  - Make the ICL event constraints match reality

  - Remove a unused local variable

* tag 'perf-urgent-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Remove unused local variable
  perf/x86/intel: Fix event constraints for ICL
2022-06-05 10:40:31 -07:00
Linus Torvalds
cbd76edeab Cleanups (and one fix) around struct mount handling.
The fix is usermode_driver.c one - once you've done kern_mount(), you
 must kern_unmount(); simple mntput() will end up with a leak.  Several
 failure exits in there messed up that way...  In practice you won't
 hit those particular failure exits without fault injection, though.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYpvrWQAKCRBZ7Krx/gZQ
 6z29AP9EZVSyIvnwXleehpa2mEZhsp+KAKgV/ENaKHMn7jiH0wD/bfgnhxIDNuc5
 108E2R5RWEYTynW5k7nnP5PsTsMq5Qc=
 =b3Wc
 -----END PGP SIGNATURE-----

Merge tag 'pull-18-rc1-work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull mount handling updates from Al Viro:
 "Cleanups (and one fix) around struct mount handling.

  The fix is usermode_driver.c one - once you've done kern_mount(), you
  must kern_unmount(); simple mntput() will end up with a leak. Several
  failure exits in there messed up that way... In practice you won't hit
  those particular failure exits without fault injection, though"

* tag 'pull-18-rc1-work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  move mount-related externs from fs.h to mount.h
  blob_to_mnt(): kern_unmount() is needed to undo kern_mount()
  m->mnt_root->d_inode->i_sb is a weird way to spell m->mnt_sb...
  linux/mount.h: trim includes
  uninline may_mount() and don't opencode it in fspick(2)/fsopen(2)
2022-06-04 19:00:05 -07:00
Linus Torvalds
67850b7bdc While looking at the ptrace problems with PREEMPT_RT and the problems
of Peter Zijlstra was encountering with ptrace in his freezer rewrite
 I identified some cleanups to ptrace_stop that make sense on their own
 and move make resolving the other problems much simpler.
 
 The biggest issue is the habbit of the ptrace code to change task->__state
 from the tracer to suppress TASK_WAKEKILL from waking up the tracee.  No
 other code in the kernel does that and it is straight forward to update
 signal_wake_up and friends to make that unnecessary.
 
 Peter's task freezer sets frozen tasks to a new state TASK_FROZEN and
 then it stores them by calling "wake_up_state(t, TASK_FROZEN)" relying
 on the fact that all stopped states except the special stop states can
 tolerate spurious wake up and recover their state.
 
 The state of stopped and traced tasked is changed to be stored in
 task->jobctl as well as in task->__state.  This makes it possible for
 the freezer to recover tasks in these special states, as well as
 serving as a general cleanup.  With a little more work in that
 direction I believe TASK_STOPPED can learn to tolerate spurious wake
 ups and become an ordinary stop state.
 
 The TASK_TRACED state has to remain a special state as the registers for
 a process are only reliably available when the process is stopped in
 the scheduler.  Fundamentally ptrace needs acess to the saved
 register values of a task.
 
 There are bunch of semi-random ptrace related cleanups that were found
 while looking at these issues.
 
 One cleanup that deserves to be called out is from commit 57b6de08b5
 ("ptrace: Admit ptrace_stop can generate spuriuos SIGTRAPs").  This
 makes a change that is technically user space visible, in the handling
 of what happens to a tracee when a tracer dies unexpectedly.
 According to our testing and our understanding of userspace nothing
 cares that spurious SIGTRAPs can be generated in that case.
 
 The entire discussion can be found at:
   https://lkml.kernel.org/r/87a6bv6dl6.fsf_-_@email.froward.int.ebiederm.org
 
 Eric W. Biederman (11):
       signal: Rename send_signal send_signal_locked
       signal: Replace __group_send_sig_info with send_signal_locked
       ptrace/um: Replace PT_DTRACE with TIF_SINGLESTEP
       ptrace/xtensa: Replace PT_SINGLESTEP with TIF_SINGLESTEP
       ptrace: Remove arch_ptrace_attach
       signal: Use lockdep_assert_held instead of assert_spin_locked
       ptrace: Reimplement PTRACE_KILL by always sending SIGKILL
       ptrace: Document that wait_task_inactive can't fail
       ptrace: Admit ptrace_stop can generate spuriuos SIGTRAPs
       ptrace: Don't change __state
       ptrace: Always take siglock in ptrace_resume
 
 Peter Zijlstra (1):
       sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state
 
  arch/ia64/include/asm/ptrace.h    |   4 --
  arch/ia64/kernel/ptrace.c         |  57 ----------------
  arch/um/include/asm/thread_info.h |   2 +
  arch/um/kernel/exec.c             |   2 +-
  arch/um/kernel/process.c          |   2 +-
  arch/um/kernel/ptrace.c           |   8 +--
  arch/um/kernel/signal.c           |   4 +-
  arch/x86/kernel/step.c            |   3 +-
  arch/xtensa/kernel/ptrace.c       |   4 +-
  arch/xtensa/kernel/signal.c       |   4 +-
  drivers/tty/tty_jobctrl.c         |   4 +-
  include/linux/ptrace.h            |   7 --
  include/linux/sched.h             |  10 ++-
  include/linux/sched/jobctl.h      |   8 +++
  include/linux/sched/signal.h      |  20 ++++--
  include/linux/signal.h            |   3 +-
  kernel/ptrace.c                   |  87 ++++++++---------------
  kernel/sched/core.c               |   5 +-
  kernel/signal.c                   | 140 +++++++++++++++++---------------------
  kernel/time/posix-cpu-timers.c    |   6 +-
  20 files changed, 140 insertions(+), 240 deletions(-)
 
 Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEgjlraLDcwBA2B+6cC/v6Eiajj0AFAmKaXaYACgkQC/v6Eiaj
 j0CgoA/+JncSQ6PY2D5Jh1apvHzmnRsFXzr3DRvtv/CVx4oIebOXRQFyVDeD5tRn
 TmMgB29HpBlHRDLojlmlZRGAld1HR/aPEW9j8W1D3Sy/ZFO5L8lQitv9aDHO9Ntw
 4lZvlhS1M0KhATudVVBqSPixiG6CnV5SsGmixqdOyg7xcXSY6G1l2nB7Zk9I3Tat
 ZlmhuZ6R5Z5qsm4MEq0vUSrnsHiGxYrpk6uQOaVz8Wkv8ZFmbutt6XgxF0tsyZNn
 mHSmWSiZzIgBjTlaibEmxi8urYJTPj3vGBeJQVYHblFwLFi6+Oy7bDxQbWjQvaZh
 DsgWPScfBF4Jm0+8hhCiSYpvPp8XnZuklb4LNCeok/VFr+KfSmpJTIhn00kagQ1u
 vxQDqLws8YLW4qsfGydfx9uUIFCbQE/V2VDYk5J3Re3gkUNDOOR1A56hPniKv6VB
 2aqGO2Fl0RdBbUa3JF+XI5Pwq5y1WrqR93EUvj+5+u5W9rZL/8WLBHBMEz6gbmfD
 DhwFE0y8TG2WRlWJVEDRId+5zo3di/YvasH0vJZ5HbrxhS2RE/yIGAd+kKGx/lZO
 qWDJC7IHvFJ7Mw5KugacyF0SHeNdloyBM7KZW6HeXmgKn9IMJBpmwib92uUkRZJx
 D8j/bHHqD/zsgQ39nO+c4M0MmhO/DsPLG/dnGKrRCu7v1tmEnkY=
 =ZUuO
 -----END PGP SIGNATURE-----

Merge tag 'ptrace_stop-cleanup-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace

Pull ptrace_stop cleanups from Eric Biederman:
 "While looking at the ptrace problems with PREEMPT_RT and the problems
  Peter Zijlstra was encountering with ptrace in his freezer rewrite I
  identified some cleanups to ptrace_stop that make sense on their own
  and move make resolving the other problems much simpler.

  The biggest issue is the habit of the ptrace code to change
  task->__state from the tracer to suppress TASK_WAKEKILL from waking up
  the tracee. No other code in the kernel does that and it is straight
  forward to update signal_wake_up and friends to make that unnecessary.

  Peter's task freezer sets frozen tasks to a new state TASK_FROZEN and
  then it stores them by calling "wake_up_state(t, TASK_FROZEN)" relying
  on the fact that all stopped states except the special stop states can
  tolerate spurious wake up and recover their state.

  The state of stopped and traced tasked is changed to be stored in
  task->jobctl as well as in task->__state. This makes it possible for
  the freezer to recover tasks in these special states, as well as
  serving as a general cleanup. With a little more work in that
  direction I believe TASK_STOPPED can learn to tolerate spurious wake
  ups and become an ordinary stop state.

  The TASK_TRACED state has to remain a special state as the registers
  for a process are only reliably available when the process is stopped
  in the scheduler. Fundamentally ptrace needs acess to the saved
  register values of a task.

  There are bunch of semi-random ptrace related cleanups that were found
  while looking at these issues.

  One cleanup that deserves to be called out is from commit 57b6de08b5
  ("ptrace: Admit ptrace_stop can generate spuriuos SIGTRAPs"). This
  makes a change that is technically user space visible, in the handling
  of what happens to a tracee when a tracer dies unexpectedly. According
  to our testing and our understanding of userspace nothing cares that
  spurious SIGTRAPs can be generated in that case"

* tag 'ptrace_stop-cleanup-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state
  ptrace: Always take siglock in ptrace_resume
  ptrace: Don't change __state
  ptrace: Admit ptrace_stop can generate spuriuos SIGTRAPs
  ptrace: Document that wait_task_inactive can't fail
  ptrace: Reimplement PTRACE_KILL by always sending SIGKILL
  signal: Use lockdep_assert_held instead of assert_spin_locked
  ptrace: Remove arch_ptrace_attach
  ptrace/xtensa: Replace PT_SINGLESTEP with TIF_SINGLESTEP
  ptrace/um: Replace PT_DTRACE with TIF_SINGLESTEP
  signal: Replace __group_send_sig_info with send_signal_locked
  signal: Rename send_signal send_signal_locked
2022-06-03 16:13:25 -07:00
Linus Torvalds
1ec6574a3c This set of changes updates init and user mode helper tasks to be
ordinary user mode tasks.
 
 In commit 40966e316f ("kthread: Ensure struct kthread is present for
 all kthreads") caused init and the user mode helper threads that call
 kernel_execve to have struct kthread allocated for them.  This struct
 kthread going away during execve in turned made a use after free of
 struct kthread possible.
 
 The commit 343f4c49f2 ("kthread: Don't allocate kthread_struct for
 init and umh") is enough to fix the use after free and is simple enough
 to be backportable.
 
 The rest of the changes pass struct kernel_clone_args to clean things
 up and cause the code to make sense.
 
 In making init and the user mode helpers tasks purely user mode tasks
 I ran into two complications.  The function task_tick_numa was
 detecting tasks without an mm by testing for the presence of
 PF_KTHREAD.  The initramfs code in populate_initrd_image was using
 flush_delayed_fput to ensuere the closing of all it's file descriptors
 was complete, and flush_delayed_fput does not work in a userspace thread.
 
 I have looked and looked and more complications and in my code review
 I have not found any, and neither has anyone else with the code sitting
 in linux-next.
 
 Link: https://lkml.kernel.org/r/87mtfu4up3.fsf@email.froward.int.ebiederm.org
 
 Eric W. Biederman (8):
       kthread: Don't allocate kthread_struct for init and umh
       fork: Pass struct kernel_clone_args into copy_thread
       fork: Explicity test for idle tasks in copy_thread
       fork: Generalize PF_IO_WORKER handling
       init: Deal with the init process being a user mode process
       fork: Explicitly set PF_KTHREAD
       fork: Stop allowing kthreads to call execve
       sched: Update task_tick_numa to ignore tasks without an mm
 
  arch/alpha/kernel/process.c      | 13 ++++++------
  arch/arc/kernel/process.c        | 13 ++++++------
  arch/arm/kernel/process.c        | 12 ++++++-----
  arch/arm64/kernel/process.c      | 12 ++++++-----
  arch/csky/kernel/process.c       | 15 ++++++-------
  arch/h8300/kernel/process.c      | 10 ++++-----
  arch/hexagon/kernel/process.c    | 12 ++++++-----
  arch/ia64/kernel/process.c       | 15 +++++++------
  arch/m68k/kernel/process.c       | 12 ++++++-----
  arch/microblaze/kernel/process.c | 12 ++++++-----
  arch/mips/kernel/process.c       | 13 ++++++------
  arch/nios2/kernel/process.c      | 12 ++++++-----
  arch/openrisc/kernel/process.c   | 12 ++++++-----
  arch/parisc/kernel/process.c     | 18 +++++++++-------
  arch/powerpc/kernel/process.c    | 15 +++++++------
  arch/riscv/kernel/process.c      | 12 ++++++-----
  arch/s390/kernel/process.c       | 12 ++++++-----
  arch/sh/kernel/process_32.c      | 12 ++++++-----
  arch/sparc/kernel/process_32.c   | 12 ++++++-----
  arch/sparc/kernel/process_64.c   | 12 ++++++-----
  arch/um/kernel/process.c         | 15 +++++++------
  arch/x86/include/asm/fpu/sched.h |  2 +-
  arch/x86/include/asm/switch_to.h |  8 +++----
  arch/x86/kernel/fpu/core.c       |  4 ++--
  arch/x86/kernel/process.c        | 18 +++++++++-------
  arch/xtensa/kernel/process.c     | 17 ++++++++-------
  fs/exec.c                        |  8 ++++---
  include/linux/sched/task.h       |  8 +++++--
  init/initramfs.c                 |  2 ++
  init/main.c                      |  2 +-
  kernel/fork.c                    | 46 +++++++++++++++++++++++++++++++++-------
  kernel/sched/fair.c              |  2 +-
  kernel/umh.c                     |  6 +++---
  33 files changed, 234 insertions(+), 160 deletions(-)
 
 Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEgjlraLDcwBA2B+6cC/v6Eiajj0AFAmKaR/MACgkQC/v6Eiaj
 j0Aayg/7Bx66872d9c6igkJ+MPCTuh+v9QKCGwiYEmiU4Q5sVAFB0HPJO27qC14u
 630X0RFNZTkPzNNEJNIW4kw6Dj8s8YRKf+FgQAVt4SzdRwT7eIPDjk1nGraopPJ3
 O04pjvuTmUyidyViRyFcf2ptx/pnkrwP8jUSc+bGTgfASAKAgAokqKE5ecjewbBc
 Y/EAkQ6QW7KxPjeSmpAHwI+t3BpBev9WEC4PbhRhsBCQFO2+PJiklvqdhVNBnIjv
 qUezll/1xv9UYgniB15Q4Nb722SmnWSU3r8as1eFPugzTHizKhufrrpyP+KMK1A0
 tdtEJNs5t2DZF7ZbGTFSPqJWmyTYLrghZdO+lOmnaSjHxK4Nda1d4NzbefJ0u+FE
 tutewowvHtBX6AFIbx+H3O+DOJM2IgNMf+ReQDU/TyNyVf3wBrTbsr9cLxypIJIp
 zze8npoLMlB7B4yxVo5ES5e63EXfi3iHl0L3/1EhoGwriRz1kWgVLUX/VZOUpscL
 RkJHsW6bT8sqxPWAA5kyWjEN+wNR2PxbXi8OE4arT0uJrEBMUgDCzydzOv5tJB00
 mSQdytxH9LVdsmxBKAOBp5X6WOLGA4yb1cZ6E/mEhlqXMpBDF1DaMfwbWqxSYi4q
 sp5zU3SBAW0qceiZSsWZXInfbjrcQXNV/DkDRDO9OmzEZP4m1j0=
 =x6fy
 -----END PGP SIGNATURE-----

Merge tag 'kthread-cleanups-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace

Pull kthread updates from Eric Biederman:
 "This updates init and user mode helper tasks to be ordinary user mode
  tasks.

  Commit 40966e316f ("kthread: Ensure struct kthread is present for
  all kthreads") caused init and the user mode helper threads that call
  kernel_execve to have struct kthread allocated for them. This struct
  kthread going away during execve in turned made a use after free of
  struct kthread possible.

  Here, commit 343f4c49f2 ("kthread: Don't allocate kthread_struct for
  init and umh") is enough to fix the use after free and is simple
  enough to be backportable.

  The rest of the changes pass struct kernel_clone_args to clean things
  up and cause the code to make sense.

  In making init and the user mode helpers tasks purely user mode tasks
  I ran into two complications. The function task_tick_numa was
  detecting tasks without an mm by testing for the presence of
  PF_KTHREAD. The initramfs code in populate_initrd_image was using
  flush_delayed_fput to ensuere the closing of all it's file descriptors
  was complete, and flush_delayed_fput does not work in a userspace
  thread.

  I have looked and looked and more complications and in my code review
  I have not found any, and neither has anyone else with the code
  sitting in linux-next"

* tag 'kthread-cleanups-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  sched: Update task_tick_numa to ignore tasks without an mm
  fork: Stop allowing kthreads to call execve
  fork: Explicitly set PF_KTHREAD
  init: Deal with the init process being a user mode process
  fork: Generalize PF_IO_WORKER handling
  fork: Explicity test for idle tasks in copy_thread
  fork: Pass struct kernel_clone_args into copy_thread
  kthread: Don't allocate kthread_struct for init and umh
2022-06-03 16:03:05 -07:00
Linus Torvalds
21873bd66b arm64 fixes for 5.19-rc1:
- Initialise jump labels before setup_machine_fdt(), needed by commit
   f5bda35fba ("random: use static branch for crng_ready()").
 
 - Sparse warnings: missing prototype, incorrect __user annotation.
 
 - Skip SVE kselftest if not sufficient vector lengths supported.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmKaQ/8ACgkQa9axLQDI
 XvE+eRAAjVQr4ehVgT75SIYq9cv/6A8cAlYzK4hi7cwhFLTii/9flZLmqTfjUwbY
 Ifnm1/c/ka11vAC4JfzhEcAjIm5jJ7Wlw4XqMhGC2woTjmCrDxSlG1qrdt1zZlgI
 H7RXCjCpfstCWFoZy4uDaaWZYgN6c91/IPuuamBUfADXu5NbeD9yfZALYUhiE9FP
 yZ61BLDP0f8l0x5PMmDATidZ/vtIp+Hfiq1iezX0VugIY4+v6nbi6/Ba20Wc419v
 cmDgjRc1ODaWnuWLUTiriUlt2VEtBs3LrGXa81WSrTKR9+DXih4Gwj4orpW7Rqx3
 CFeSgVCDd5V9/XGEExN5GtMtiolDWNG8dA58jwwaWHWgp8AveSGY8AxPT+8/30/S
 FQyVGjxrX1JUkIya3lbVuDLKHdzCKEX0O83dFQSm3bP6lUMZUCYCcFNpVpSYPsQr
 ZZea9dNWhwiI82z19p7igfWKwerBMdUeptxMm247jKmdz4H/i6xGtK6nJc731zCZ
 kjxoI6HCYAMH/r9kgWYldgeov/UNQXi5Q2ftsXkUepPPjpEd5YBJfkL6CzKgQ6u5
 wtOTI8fainjPVWPGbcpHnEjlwG5UMY80jcJpfkD54g9N5WxunCWf7L3OZj77toTp
 Z2kU2b189Di4bNOV9bCKFqrecqajNychMhXcc5anMfdN+284mdg=
 =25eW
 -----END PGP SIGNATURE-----

Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 fixes from Catalin Marinas:
 "Most of issues addressed were introduced during this merging window.

   - Initialise jump labels before setup_machine_fdt(), needed by commit
     f5bda35fba ("random: use static branch for crng_ready()").

   - Sparse warnings: missing prototype, incorrect __user annotation.

   - Skip SVE kselftest if not sufficient vector lengths supported"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  kselftest/arm64: signal: Skip SVE signal test if not enough VLs supported
  arm64: Initialize jump labels before setup_machine_fdt()
  arm64: hibernate: Fix syntax errors in comments
  arm64: Remove the __user annotation for the restore_za_context() argument
  ftrace/fgraph: fix increased missing-prototypes warnings
2022-06-03 14:05:34 -07:00
Linus Torvalds
58f9d52ff6 Networking fixes for 5.19-rc1, including fixes from bpf, and netfilter.
Current release - new code bugs:
 
  - af_packet: make sure to pull the MAC header, avoid skb panic in GSO
 
  - ptp_clockmatrix: fix inverted logic in is_single_shot()
 
  - netfilter: flowtable: fix missing FLOWI_FLAG_ANYSRC flag
 
  - dt-bindings: net: adin: fix adi,phy-output-clock description syntax
 
  - wifi: iwlwifi: pcie: rename CAUSE macro, avoid MIPS build warning
 
 Previous releases - regressions:
 
  - Revert "net: af_key: add check for pfkey_broadcast in function
    pfkey_process"
 
  - tcp: fix tcp_mtup_probe_success vs wrong snd_cwnd
 
  - nf_tables: disallow non-stateful expression in sets earlier
 
  - nft_limit: clone packet limits' cost value
 
  - nf_tables: double hook unregistration in netns path
 
  - ping6: fix ping -6 with interface name
 
 Previous releases - always broken:
 
  - sched: fix memory barriers to prevent skbs from getting stuck
    in lockless qdiscs
 
  - neigh: set lower cap for neigh_managed_work rearming, avoid
    constantly scheduling the probe work
 
  - bpf: fix probe read error on big endian in ___bpf_prog_run()
 
  - amt: memory leak and error handling fixes
 
 Misc:
 
  - ipv6: expand & rename accept_unsolicited_na to accept_untracked_na
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmKY9lMACgkQMUZtbf5S
 IrtNvA//WCpG53NwSy8aV2X/0vkprVEuO8EQeIYhaw1R4KlVcqrITQcaLqkq/xL/
 RUq6F/plftMSiuGRhTp/Sgbl0o0XgJkf769m4zQxz9NqWqgcw5kwJPu4Xq1nSM9t
 /2qAFNDnXShxRiSYrI0qxQrmd0OUjtsibxsKRTSrrlvcd6zYfrx/7+QK5qpLMF9E
 zJpBSYQm2R0RLGRith99G8w3WauhlprPaxyQ71ogQtBhTF+Eg7K+xEm2D5DKtyvj
 7CLyrQtR0jyDBAt2ZPCh5D/yVPkNI1rigQ8m4uiW9DE6mk1DsxxY+DIOt5vQPBdR
 x9Pq0qG54KS5sP18ABeNRQn4NWdkhVf/CcPkaRxHJdRs13mpQUATJRpZ3Ytd9Nt0
 HW6Kby+zY6bdpUX8+UYdhcG6wbt0Lw8B+bSCjiqfE/CBbfUFA3L9/q/5Hk8Xbnxn
 lCIk4asxQgpNhcZ+PAkZfFgE0GNDKnXDu1thO+q7/N9srZrrh9WQW5qoq5lexo8V
 c01jRbPTKa64Gbvm+xDDGEwSl2uIRITtea284bL3q6lnI50n50dlLOAW0z5tmbEg
 X9OHae5bMAdtvS5A1ForJaWA/Mj35ZqtGG5oj0WcGcLupVyec3rgaYaJtNvwgoDx
 ptCQVIMLTAHXtZMohm0YrBizg0qbqmCd2c0/LB+3odX328YStJU=
 =bWkn
 -----END PGP SIGNATURE-----

Merge tag 'net-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bpf and netfilter.

  Current release - new code bugs:

   - af_packet: make sure to pull the MAC header, avoid skb panic in GSO

   - ptp_clockmatrix: fix inverted logic in is_single_shot()

   - netfilter: flowtable: fix missing FLOWI_FLAG_ANYSRC flag

   - dt-bindings: net: adin: fix adi,phy-output-clock description syntax

   - wifi: iwlwifi: pcie: rename CAUSE macro, avoid MIPS build warning

  Previous releases - regressions:

   - Revert "net: af_key: add check for pfkey_broadcast in function
     pfkey_process"

   - tcp: fix tcp_mtup_probe_success vs wrong snd_cwnd

   - nf_tables: disallow non-stateful expression in sets earlier

   - nft_limit: clone packet limits' cost value

   - nf_tables: double hook unregistration in netns path

   - ping6: fix ping -6 with interface name

  Previous releases - always broken:

   - sched: fix memory barriers to prevent skbs from getting stuck in
     lockless qdiscs

   - neigh: set lower cap for neigh_managed_work rearming, avoid
     constantly scheduling the probe work

   - bpf: fix probe read error on big endian in ___bpf_prog_run()

   - amt: memory leak and error handling fixes

  Misc:

   - ipv6: expand & rename accept_unsolicited_na to accept_untracked_na"

* tag 'net-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (80 commits)
  net/af_packet: make sure to pull mac header
  net: add debug info to __skb_pull()
  net: CONFIG_DEBUG_NET depends on CONFIG_NET
  stmmac: intel: Add RPL-P PCI ID
  net: stmmac: use dev_err_probe() for reporting mdio bus registration failure
  tipc: check attribute length for bearer name
  ice: fix access-beyond-end in the switch code
  nfp: remove padding in nfp_nfdk_tx_desc
  ax25: Fix ax25 session cleanup problems
  net: usb: qmi_wwan: Add support for Cinterion MV31 with new baseline
  sfc/siena: fix wrong tx channel offset with efx_separate_tx_channels
  sfc/siena: fix considering that all channels have TX queues
  socket: Don't use u8 type in uapi socket.h
  net/sched: act_api: fix error code in tcf_ct_flow_table_fill_tuple_ipv6()
  net: ping6: Fix ping -6 with interface name
  macsec: fix UAF bug for real_dev
  octeontx2-af: fix error code in is_valid_offset()
  wifi: mac80211: fix use-after-free in chanctx code
  bonding: guard ns_targets by CONFIG_IPV6
  tcp: tcp_rtx_synack() can be called from process context
  ...
2022-06-02 12:50:16 -07:00
Saravana Kannan
73503963b7 module: Fix prefix for module.sig_enforce module param
Commit cfc1d27789 ("module: Move all into module/") changed the prefix
of the module param by moving/renaming files.  A later commit also moves
the module_param() into a different file, thereby changing the prefix
yet again.

This would break kernel cmdline compatibility and also userspace
compatibility at /sys/module/module/parameters/sig_enforce.

So, set the prefix back to "module.".

Fixes: cfc1d27789 ("module: Move all into module/")
Link: https://lore.kernel.org/lkml/20220602034111.4163292-1-saravanak@google.com/
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Aaron Tomlin <atomlin@redhat.com>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Saravana Kannan <saravanak@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-06-02 12:44:33 -07:00
Dmitry Osipenko
587b9bfe06 kernel/reboot: Use static handler for register_platform_power_off()
The register_platform_power_off() fails on m68k platform due to the
memory allocation error that happens at a very early boot time when
memory allocator isn't available yet. Fix it by using a static sys-off
handler for the platform-level power-off handlers.

Fixes: f0f7e5265b ("m68k: Switch to new sys-off handler API")
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-06-02 20:36:13 +02:00
Linus Torvalds
7c9e960c63 Livepatching changes for 5.19
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmKYlREACgkQUqAMR0iA
 lPLHUQ/9EdD1kQsxuvZRTBAr79WmAI9vS41R3owXRUhmpFf0pSjIkD23gRni1Y+a
 f2NVxQY8y9mLUEp/YXeDUX5lqdCDD2iJtin+ZlUGv8+F7rRlMyIEDoudSUKsTRlF
 ufkzKE00tqUkz7J/U1KEbD2u/nWFS5q8DyHc81pC0bHpkiBT8y+wgEtIQy1oGVqV
 OilcHPXpQUV/sw1RsaRGdwJAgJSJcoHk57JelWWlV5fo0ogK1gjIl/kUadAChNzT
 2o8zoWu6fQlqCAI1AJkKLngdcybbXPwWKclcGUze1sfR9+fGGRzqorwAupm5bmkp
 1oZsODqxDdVfxz2/VMtdjjacm3ECJmznqBumdMrdM3WVjh944xkb7xVOp/1xJWfc
 wrmQy4dshXa4OTnvAivBYlgbaUzld5HPQD/v89KBLP3SJkd6p9PVT+qa87hjJ2uH
 sRDjtZxTkcKKoYU8CnASFBkaOHYHOWbRHbEjCkIgo2nHXE4W3fOnaQKBF3jUiZJr
 OimQMWjAqtXh8Gb0B7IIPxuGFjAjKepTEdN+Jwk8/eqXNTFqgGPbu1hOf+zpq/oj
 ekMIy3CUIsempqZfdrJlMOAGIEsx5JGBrdC5KwAEsHymW1XNJVrkpLtTFrtsF1tW
 X6qWzhkNMkOv9/0yoR2GmXiBR10pcXGMsd+EvNewpTIDCcjkYxI=
 =X4cs
 -----END PGP SIGNATURE-----

Merge tag 'livepatching-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching

Pull livepatching cleanup from Petr Mladek:

 - Remove duplicated livepatch code [Christophe]

* tag 'livepatching-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching:
  livepatch: Remove klp_arch_set_pc() and asm/livepatch.h
2022-06-02 08:55:01 -07:00
Linus Torvalds
12831f6486 printk fixup for 5.19
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmKYkm0ACgkQUqAMR0iA
 lPJMnA/9F/YWXU9DRFW3kuQlBmm4fLf21VbPvIFDRE7obSKur9p0LbKG+nA2b75s
 u65nvILVY1iQKsp+DxW7xgS0pQZbUgPd4pisBCsKOZ7UMjTOJrms5vB/xBTE41Ak
 x3KLL1E0BTOOVM8ehcPgYE2nsS43LUxasi2Uh2ja9uT3oxHoR4o4znGtRi6v56rX
 rdh6sQhOHBqikLtZxiMSoHIrnwiBwSGzc+LjvrbGjri20pogqPI/Vc+lRplmsI2M
 GMAoTuIk56nR398jE18OpaRvIzjMQ4j8hh78QyrzhuSe5jRDzVOuLa4khyo/RP4B
 2H0ayundeUlsDRAcsBJZc/e4tel/nQvAppYCbtWUVTZTPo+jcclXJK/N5VKdaPUL
 8vcYb18t1xUWjucglSbxhnd7sM6eAP8rwB/2iJajrzctASCNK+mD2daa52mGOBXB
 SUJS5839y5LdXbx+wR8BvbStHzJDJjRYHSisNDgV0e5JSQAnqVUYhUJCj+aTICua
 CJXjqm68cQzJ68ZZjbAdZf1FWlJMearVwzQzTrOU4CgbWZIt+K7TaE3ur3MpA8M7
 OaY6s2t6nUS4ZcMWB/vtExod0aCwv8yq2GUN5NeGcYAn9wy4v9Zp9tJoqSSSoNnS
 iETRDSqT4fRvK27WVgWyaWay6UyWaa1jy9AYaVVoztof7pBIdTE=
 =UYhA
 -----END PGP SIGNATURE-----

Merge tag 'printk-for-5.19-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux

Pull printk fixup from Petr Mladek:

 - Revert inappropriate use of wake_up_interruptible_all() in printk()

* tag 'printk-for-5.19-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
  Revert "printk: wake up all waiters"
2022-06-02 08:49:54 -07:00
Christoph Hellwig
e15db62bc5 swiotlb: fix setting ->force_bounce
The swiotlb_init refactor messed up assigning ->force_bounce by doing
it in different places based on what caused the setting of the flag.

Fix this by passing the SWIOTLB_* flags to swiotlb_init_io_tlb_mem
and just setting it there.

Fixes: c6af2aa9ff ("swiotlb: make the swiotlb_init interface more useful")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Nathan Chancellor <nathan@kernel.org>
2022-06-02 07:17:59 +02:00
Rob Clark
e19f8fa6ce dma-debug: make things less spammy under memory pressure
Limit the error msg to avoid flooding the console.  If you have a lot of
threads hitting this at once, they could have already gotten passed the
dma_debug_disabled() check before they get to the point of allocation
failure, resulting in quite a lot of this error message spamming the
log.  Use pr_err_once() to limit that.

Signed-off-by: Rob Clark <robdclark@chromium.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-06-02 07:17:59 +02:00