Commit graph

3942 commits

Author SHA1 Message Date
Hou Tao
5771e306b6 bpf: Rename __htab_percpu_map_update_elem to htab_map_update_elem_in_place
Rename __htab_percpu_map_update_elem to htab_map_update_elem_in_place,
and add a new percpu argument for the helper to support in-place update
for both per-cpu htab and htab of maps.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250401062250.543403-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-09 20:12:53 -07:00
Hou Tao
ba2b31b0f3 bpf: Factor out htab_elem_value helper()
All hash maps store map key and map value together. The relative offset
of the map value compared to the map key is round_up(key_size, 8).
Therefore, factor out a common helper htab_elem_value() to calculate the
address of the map value instead of duplicating the logic.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250401062250.543403-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-09 20:12:53 -07:00
NeilBrown
fa6fe07d15
VFS: rename lookup_one_len family to lookup_noperm and remove permission check
The lookup_one_len family of functions is (now) only used internally by
a filesystem on itself either
- in a context where permission checking is irrelevant such as by a
  virtual filesystem populating itself, or xfs accessing its ORPHANAGE
  or dquota accessing the quota file; or
- in a context where a permission check (MAY_EXEC on the parent) has just
  been performed such as a network filesystem finding in "silly-rename"
  file in the same directory.  This is also the context after the
  _parentat() functions where currently lookup_one_qstr_excl() is used.

So the permission check is pointless.

The name "one_len" is unhelpful in understanding the purpose of these
functions and should be changed.  Most of the callers pass the len as
"strlen()" so using a qstr and QSTR() can simplify the code.

This patch renames these functions (include lookup_positive_unlocked()
which is part of the family despite the name) to have a name based on
"lookup_noperm".  They are changed to receive a 'struct qstr' instead
of separate name and len.  In a few cases the use of QSTR() results in a
new call to strlen().

try_lookup_noperm() takes a pointer to a qstr instead of the whole
qstr.  This is consistent with d_hash_and_lookup() (which is nearly
identical) and useful for lookup_noperm_unlocked().

The new lookup_noperm_common() doesn't take a qstr yet.  That will be
tidied up in a subsequent patch.

Signed-off-by: NeilBrown <neil@brown.name>
Link: https://lore.kernel.org/r/20250319031545.2999807-5-neil@brown.name
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-08 11:24:36 +02:00
Linus Torvalds
aa918db707 bpf_try_alloc_pages
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmfkHCQACgkQ6rmadz2v
 bTrzWhAAnDcJsGgSJ9EbElpTfgBWE7aijXo/MsPxxRhORc0uR6MnhPx1iADP4KYj
 lTGEIBgRuDG3qaM4EXpPd32rUJJHv8hot7z9zfvUgSuFNLZEHWXJtz/i4ileOxin
 08zV+zA5WL2fqamAmMRFMI37DeSWy3xU0/qlbWgNnURjPjRri6CF4rVFUWq+QMY+
 XP8ITD/6nOLUR6Bq2M18aHnk2VJWkxVP9Oi+vz1VHbOjKaJC7ATa1+Q4qMqWyTb1
 8IAYWiZR1ZPc214ITaspVzLoLb/wxHxy3QMrdAWAL6sjp0B4J8YxIq1qsBuR1FN7
 TxTRQND/+LjqrAgs5AmFqz3ndKmahjGQWnQEh/rDYJtx+sLJk9hfsMIDF8Wmxuwl
 RftdV0g9bPljR5Qgc9i8DNtEjoAbNjoP8xLjt9HfQakVl8V9jPe0bxZ5tJDf+T0M
 n/VgEjaRzdXqFOLal6Z5wl/jkIn1l1kWQuCMI2z5Z0Ls+PlYX56xdZxfK2Rh3m+e
 3W89vqj9ytJ3rZKG8DRsxukuHwnJ+Gia3XI2h/5cc8kEM5ss1Ase8oIkmrwaLd9x
 +zVXNoDCCPRQgTStwItW+2YdFmE9uijhEZh9yPwT1/rtFuKd0oSebVIpjih/bGqH
 mMN9gYO4+ArSbqku9X2lP3VjMOf6M6SZGm+PzG25PAMGzjqGqwk=
 =AHTr
 -----END PGP SIGNATURE-----

Merge tag 'bpf_try_alloc_pages' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Pull bpf try_alloc_pages() support from Alexei Starovoitov:
 "The pull includes work from Sebastian, Vlastimil and myself with a lot
  of help from Michal and Shakeel.

  This is a first step towards making kmalloc reentrant to get rid of
  slab wrappers: bpf_mem_alloc, kretprobe's objpool, etc. These patches
  make page allocator safe from any context.

  Vlastimil kicked off this effort at LSFMM 2024:

    https://lwn.net/Articles/974138/

  and we continued at LSFMM 2025:

    https://lore.kernel.org/all/CAADnVQKfkGxudNUkcPJgwe3nTZ=xohnRshx9kLZBTmR_E1DFEg@mail.gmail.com/

  Why:

  SLAB wrappers bind memory to a particular subsystem making it
  unavailable to the rest of the kernel. Some BPF maps in production
  consume Gbytes of preallocated memory. Top 5 in Meta: 1.5G, 1.2G,
  1.1G, 300M, 200M. Once we have kmalloc that works in any context BPF
  map preallocation won't be necessary.

  How:

  Synchronous kmalloc/page alloc stack has multiple stages going from
  fast to slow: cmpxchg16 -> slab_alloc -> new_slab -> alloc_pages ->
  rmqueue_pcplist -> __rmqueue, where rmqueue_pcplist was already
  relying on trylock.

  This set changes rmqueue_bulk/rmqueue_buddy to attempt a trylock and
  return ENOMEM if alloc_flags & ALLOC_TRYLOCK. It then wraps this
  functionality into try_alloc_pages() helper. We make sure that the
  logic is sane in PREEMPT_RT.

  End result: try_alloc_pages()/free_pages_nolock() are safe to call
  from any context.

  try_kmalloc() for any context with similar trylock approach will
  follow. It will use try_alloc_pages() when slab needs a new page.
  Though such try_kmalloc/page_alloc() is an opportunistic allocator,
  this design ensures that the probability of successful allocation of
  small objects (up to one page in size) is high.

  Even before we have try_kmalloc(), we already use try_alloc_pages() in
  BPF arena implementation and it's going to be used more extensively in
  BPF"

* tag 'bpf_try_alloc_pages' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
  mm: Fix the flipped condition in gfpflags_allow_spinning()
  bpf: Use try_alloc_pages() to allocate pages for bpf needs.
  mm, bpf: Use memcg in try_alloc_pages().
  memcg: Use trylock to access memcg stock_lock.
  mm, bpf: Introduce free_pages_nolock()
  mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  locking/local_lock: Introduce localtry_lock_t
2025-03-30 13:45:28 -07:00
Linus Torvalds
494e7fe591 bpf_res_spin_lock
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmfcq3kACgkQ6rmadz2v
 bToxkw/8DHIqjVnzU2O9hbRM1anYo6yM8e34IxCt0ajHTSEVJ93+C161QDWo/6Dk
 +RNlaeGekaBUk+QOLb4u+rzZ2eR/pWSm37xuDRAiBCQ+3MgR60gGRaSljpS3IUem
 0FvS6C1HObBCEUXMU2rNv/5cJB5/qrQYa9FEEjRvBTLqgQkdS7yaW/KKuZaNb+Ts
 KiEeWvPrPSZXStfRGy8Wr4eS2rYhxPAikUR+xde9CM+HtMWwKTCTSp8qXrqA92Dj
 Cz9ix01scznuf78QCRDZp09im3lZys8ZQprmPgMxyEscN+CDL7n68wAhmTJq0uo3
 3NqIv7zBQ8wMChj0f0HjwZ0Wrj7BJAveY2Q0RterxdzT4vMKdtNkThX46ISaCoX/
 XQAAhZHemK6MvBJk+LKkqqMgrD+3FAzvY7O+SCyUBAMs4FK1myRJQihdLXHGfiBU
 DMDZE1jsE8qBaeUbz4LIuCy8fx2LhtVwVNwbNIBUZHdyfjxIXnQT/8Cnrgklwy2i
 tnYekhAsHDQY+QDkrvJpc4E1vUtiXwSDI5ErcnWdSzctEOyVeUg7OuuGD4riCd1c
 emdJmtASM1z9Ajqa1dytDxVaF6wjKlbhQgnKamuex5JLGCK6makk8ZoB+DBfKYHD
 VoWummTu8ldf+Dp4ehBh7AbeF2vn4kLqcF1PLRsBO6ytJs4HIt8=
 =5O7h
 -----END PGP SIGNATURE-----

Merge tag 'bpf_res_spin_lock' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Pull bpf relisient spinlock support from Alexei Starovoitov:
 "This patch set introduces Resilient Queued Spin Lock (or rqspinlock
  with res_spin_lock() and res_spin_unlock() APIs).

  This is a qspinlock variant which recovers the kernel from a stalled
  state when the lock acquisition path cannot make forward progress.
  This can occur when a lock acquisition attempt enters a deadlock
  situation (e.g. AA, or ABBA), or more generally, when the owner of the
  lock (which we’re trying to acquire) isn’t making forward progress.
  Deadlock detection is the main mechanism used to provide instant
  recovery, with the timeout mechanism acting as a final line of
  defense. Detection is triggered immediately when beginning the waiting
  loop of a lock slow path.

  Additionally, BPF programs attached to different parts of the kernel
  can introduce new control flow into the kernel, which increases the
  likelihood of deadlocks in code not written to handle reentrancy.
  There have been multiple syzbot reports surfacing deadlocks in
  internal kernel code due to the diverse ways in which BPF programs can
  be attached to different parts of the kernel. By switching the BPF
  subsystem’s lock usage to rqspinlock, all of these issues are
  mitigated at runtime.

  This spin lock implementation allows BPF maps to become safer and
  remove mechanisms that have fallen short in assuring safety when
  nesting programs in arbitrary ways in the same context or across
  different contexts.

  We run benchmarks that stress locking scalability and perform
  comparison against the baseline (qspinlock). For the rqspinlock case,
  we replace the default qspinlock with it in the kernel, such that all
  spin locks in the kernel use the rqspinlock slow path. As such,
  benchmarks that stress kernel spin locks end up exercising rqspinlock.

  More details in the cover letter in commit 6ffb9017e9 ("Merge branch
  'resilient-queued-spin-lock'")"

* tag 'bpf_res_spin_lock' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (24 commits)
  selftests/bpf: Add tests for rqspinlock
  bpf: Maintain FIFO property for rqspinlock unlock
  bpf: Implement verifier support for rqspinlock
  bpf: Introduce rqspinlock kfuncs
  bpf: Convert lpm_trie.c to rqspinlock
  bpf: Convert percpu_freelist.c to rqspinlock
  bpf: Convert hashtab.c to rqspinlock
  rqspinlock: Add locktorture support
  rqspinlock: Add entry to Makefile, MAINTAINERS
  rqspinlock: Add macros for rqspinlock usage
  rqspinlock: Add basic support for CONFIG_PARAVIRT
  rqspinlock: Add a test-and-set fallback
  rqspinlock: Add deadlock detection and recovery
  rqspinlock: Protect waiters in trylock fallback from stalls
  rqspinlock: Protect waiters in queue from stalls
  rqspinlock: Protect pending bit owners from stalls
  rqspinlock: Hardcode cond_acquire loops for arm64
  rqspinlock: Add support for timeouts
  rqspinlock: Drop PV and virtualization support
  rqspinlock: Add rqspinlock.h header
  ...
2025-03-30 13:06:27 -07:00
Linus Torvalds
fa593d0f96 bpf-next-6.15
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmfi6ZAACgkQ6rmadz2v
 bTpLOg/+J7xUddPMhlpFAUlifQEadE5hmw6v1tXpM3zyKHzUWJiv/qsx3j8/ckgD
 D+d4P8bqIbI9SSuIS4oZ0+D9pr/g7GYztnoYZmPiYJ7v2AijPuof5dsagFQE8E2y
 rhfbt9KHTMzzkdkTvaAZaITS/HWAoJ2YVRB6gfLex2ghcXYHcgmtKRZniQrbBiFZ
 MIXBN8Rg6HP+pUdIVllSXFcQCb3XIgjPONRAos4hr5tIm+3Ku7Jvkgk2H/9vUcoF
 bdXAcg8xygyH7eY+1l3e7nEPQlG0jUZEsL+tq+vpdoLRLqlIpAUYmwUvqcmq4dPS
 QGFjiUcpDbXlxsUFpzjXHIFto7fXCfND7HEICQPwAncdflIIfYaATSQUfkEexn0a
 wBCFlAChrEzAmg2vFl4EeEr0fdSe/3jswrgKx0m6ctKieMjgloBUeeH4fXOpfkhS
 9tvhuduVFuronlebM8ew4w9T/mBgbyxkE5KkvP4hNeB3ni3N0K6Mary5/u2HyN1e
 lqTlnZxRA4p6lrvxce/mDrR4VSwlKLcSeQVjxAL1afD5KRkuZJnUv7bUhS361vkG
 IjNrQX30EisDAz+X7tMn3ndBf9vVatwFT4+c3yaxlQRor1WofhDfT88HPiyB4QqQ
 Kdx2EHgbQxJp4vkzhp4/OXlTfkihsMEn8egzZuphdPEQ9Y+Jdwg=
 =aN/V
 -----END PGP SIGNATURE-----

Merge tag 'bpf-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Pull bpf updates from Alexei Starovoitov:
 "For this merge window we're splitting BPF pull request into three for
  higher visibility: main changes, res_spin_lock, try_alloc_pages.

  These are the main BPF changes:

   - Add DFA-based live registers analysis to improve verification of
     programs with loops (Eduard Zingerman)

   - Introduce load_acquire and store_release BPF instructions and add
     x86, arm64 JIT support (Peilin Ye)

   - Fix loop detection logic in the verifier (Eduard Zingerman)

   - Drop unnecesary lock in bpf_map_inc_not_zero() (Eric Dumazet)

   - Add kfunc for populating cpumask bits (Emil Tsalapatis)

   - Convert various shell based tests to selftests/bpf/test_progs
     format (Bastien Curutchet)

   - Allow passing referenced kptrs into struct_ops callbacks (Amery
     Hung)

   - Add a flag to LSM bpf hook to facilitate bpf program signing
     (Blaise Boscaccy)

   - Track arena arguments in kfuncs (Ihor Solodrai)

   - Add copy_remote_vm_str() helper for reading strings from remote VM
     and bpf_copy_from_user_task_str() kfunc (Jordan Rome)

   - Add support for timed may_goto instruction (Kumar Kartikeya
     Dwivedi)

   - Allow bpf_get_netns_cookie() int cgroup_skb programs (Mahe Tardy)

   - Reduce bpf_cgrp_storage_busy false positives when accessing cgroup
     local storage (Martin KaFai Lau)

   - Introduce bpf_dynptr_copy() kfunc (Mykyta Yatsenko)

   - Allow retrieving BTF data with BTF token (Mykyta Yatsenko)

   - Add BPF kfuncs to set and get xattrs with 'security.bpf.' prefix
     (Song Liu)

   - Reject attaching programs to noreturn functions (Yafang Shao)

   - Introduce pre-order traversal of cgroup bpf programs (Yonghong
     Song)"

* tag 'bpf-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (186 commits)
  selftests/bpf: Add selftests for load-acquire/store-release when register number is invalid
  bpf: Fix out-of-bounds read in check_atomic_load/store()
  libbpf: Add namespace for errstr making it libbpf_errstr
  bpf: Add struct_ops context information to struct bpf_prog_aux
  selftests/bpf: Sanitize pointer prior fclose()
  selftests/bpf: Migrate test_xdp_vlan.sh into test_progs
  selftests/bpf: test_xdp_vlan: Rename BPF sections
  bpf: clarify a misleading verifier error message
  selftests/bpf: Add selftest for attaching fexit to __noreturn functions
  bpf: Reject attaching fexit/fmod_ret to __noreturn functions
  bpf: Only fails the busy counter check in bpf_cgrp_storage_get if it creates storage
  bpf: Make perf_event_read_output accessible in all program types.
  bpftool: Using the right format specifiers
  bpftool: Add -Wformat-signedness flag to detect format errors
  selftests/bpf: Test freplace from user namespace
  libbpf: Pass BPF token from find_prog_btf_id to BPF_BTF_GET_FD_BY_ID
  bpf: Return prog btf_id without capable check
  bpf: BPF token support for BPF_BTF_GET_FD_BY_ID
  bpf, x86: Fix objtool warning for timed may_goto
  bpf: Check map->record at the beginning of check_and_free_fields()
  ...
2025-03-30 12:43:03 -07:00
Linus Torvalds
1a9239bb42 Networking changes for 6.15.
Core & protocols
 ----------------
 
  - Continue Netlink conversions to per-namespace RTNL lock
    (IPv4 routing, routing rules, routing next hops, ARP ioctls).
 
  - Continue extending the use of netdev instance locks. As a driver
    opt-in protect queue operations and (in due course) ethtool
    operations with the instance lock and not RTNL lock.
 
  - Support collecting TCP timestamps (data submitted, sent, acked)
    in BPF, allowing for transparent (to the application) and lower
    overhead tracking of TCP RPC performance.
 
  - Tweak existing networking Rx zero-copy infra to support zero-copy
    Rx via io_uring.
 
  - Optimize MPTCP performance in single subflow mode by 29%.
 
  - Enable GRO on packets which went thru XDP CPU redirect (were queued
    for processing on a different CPU). Improving TCP stream performance
    up to 2x.
 
  - Improve performance of contended connect() by 200% by searching
    for an available 4-tuple under RCU rather than a spin lock.
    Bring an additional 229% improvement by tweaking hash distribution.
 
  - Avoid unconditionally touching sk_tsflags on RX, improving
    performance under UDP flood by as much as 10%.
 
  - Avoid skb_clone() dance in ping_rcv() to improve performance under
    ping flood.
 
  - Avoid FIB lookup in netfilter if socket is available, 20% perf win.
 
  - Rework network device creation (in-kernel) API to more clearly
    identify network namespaces and their roles.
    There are up to 4 namespace roles but we used to have just 2 netns
    pointer arguments, interpreted differently based on context.
 
  - Use sysfs_break_active_protection() instead of trylock to avoid
    deadlocks between unregistering objects and sysfs access.
 
  - Add a new sysctl and sockopt for capping max retransmit timeout
    in TCP.
 
  - Support masking port and DSCP in routing rule matches.
 
  - Support dumping IPv4 multicast addresses with RTM_GETMULTICAST.
 
  - Support specifying at what time packet should be sent on AF_XDP
    sockets.
 
  - Expose TCP ULP diagnostic info (for TLS and MPTCP) to non-admin users.
 
  - Add Netlink YAML spec for WiFi (nl80211) and conntrack.
 
  - Introduce EXPORT_IPV6_MOD() and EXPORT_IPV6_MOD_GPL() for symbols
    which only need to be exported when IPv6 support is built as a module.
 
  - Age FDB entries based on Rx not Tx traffic in VxLAN, similar
    to normal bridging.
 
  - Allow users to specify source port range for GENEVE tunnels.
 
  - netconsole: allow attaching kernel release, CPU ID and task name
    to messages as metadata
 
 Driver API
 ----------
 
  - Continue rework / fixing of Energy Efficient Ethernet (EEE) across
    the SW layers. Delegate the responsibilities to phylink where possible.
    Improve its handling in phylib.
 
  - Support symmetric OR-XOR RSS hashing algorithm.
 
  - Support tracking and preserving IRQ affinity by NAPI itself.
 
  - Support loopback mode speed selection for interface selftests.
 
 Device drivers
 --------------
 
  - Remove the IBM LCS driver for s390.
 
  - Remove the sb1000 cable modem driver.
 
  - Add support for SFP module access over SMBus.
 
  - Add MCTP transport driver for MCTP-over-USB.
 
  - Enable XDP metadata support in multiple drivers.
 
  - Ethernet high-speed NICs:
    - Broadcom (bnxt):
      - add PCIe TLP Processing Hints (TPH) support for new AMD platforms
      - support dumping RoCE queue state for debug
      - opt into instance locking
    - Intel (100G, ice, idpf):
      - ice: rework MSI-X IRQ management and distribution
      - ice: support for E830 devices
      - iavf: add support for Rx timestamping
      - iavf: opt into instance locking
    - nVidia/Mellanox:
      - mlx4: use page pool memory allocator for Rx
      - mlx5: support for one PTP device per hardware clock
      - mlx5: support for 200Gbps per-lane link modes
      - mlx5: move IPSec policy check after decryption
    - AMD/Solarflare:
      - support FW flashing via devlink
    - Cisco (enic):
      - use page pool memory allocator for Rx
      - enable 32, 64 byte CQEs
      - get max rx/tx ring size from the device
    - Meta (fbnic):
      - support flow steering and RSS configuration
      - report queue stats
      - support TCP segmentation
      - support IRQ coalescing
      - support ring size configuration
    - Marvell/Cavium:
      - support AF_XDP
    - Wangxun:
      - support for PTP clock and timestamping
    - Huawei (hibmcge):
      - checksum offload
      - add more statistics
 
  - Ethernet virtual:
    - VirtIO net:
      - aggressively suppress Tx completions, improve perf by 96% with
        1 CPU and 55% with 2 CPUs
      - expose NAPI to IRQ mapping and persist NAPI settings
    - Google (gve):
      - support XDP in DQO RDA Queue Format
      - opt into instance locking
    - Microsoft vNIC:
      - support BIG TCP
 
  - Ethernet NICs consumer, and embedded:
    - Synopsys (stmmac):
      - cleanup Tx and Tx clock setting and other link-focused cleanups
      - enable SGMII and 2500BASEX mode switching for Intel platforms
      - support Sophgo SG2044
    - Broadcom switches (b53):
      - support for BCM53101
    - TI:
      - iep: add perout configuration support
      - icssg: support XDP
    - Cadence (macb):
      - implement BQL
    - Xilinx (axinet):
      - support dynamic IRQ moderation and changing coalescing at runtime
      - implement BQL
      - report standard stats
    - MediaTek:
      - support phylink managed EEE
    - Intel:
      - igc: don't restart the interface on every XDP program change
    - RealTek (r8169):
      - support reading registers of internal PHYs directly
      - increase max jumbo packet size on RTL8125/RTL8126
    - Airoha:
      - support for RISC-V NPU packet processing unit
      - enable scatter-gather and support MTU up to 9kB
    - Tehuti (tn40xx):
      - support cards with TN4010 MAC and an Aquantia AQR105 PHY
 
  - Ethernet PHYs:
    - support for TJA1102S, TJA1121
    - dp83tg720: add randomized polling intervals for link detection
    - dp83822: support changing the transmit amplitude voltage
    - support for LEDs on 88q2xxx
 
  - CAN:
    - canxl: support Remote Request Substitution bit access
    - flexcan: add S32G2/S32G3 SoC
 
  - WiFi:
    - remove cooked monitor support
    - strict mode for better AP testing
    - basic EPCS support
    - OMI RX bandwidth reduction support
    - batman-adv: add support for jumbo frames
 
  - WiFi drivers:
    - RealTek (rtw88):
      - support RTL8814AE and RTL8814AU
    - RealTek (rtw89):
      - switch using wiphy_lock and wiphy_work
      - add BB context to manipulate two PHY as preparation of MLO
      - improve BT-coexistence mechanism to play A2DP smoothly
    - Intel (iwlwifi):
      - add new iwlmld sub-driver for latest HW/FW combinations
    - MediaTek (mt76):
      - preparation for mt7996 Multi-Link Operation (MLO) support
    - Qualcomm/Atheros (ath12k):
      - continued work on MLO
    - Silabs (wfx):
      - Wake-on-WLAN support
 
  - Bluetooth:
    - add support for skb TX SND/COMPLETION timestamping
    - hci_core: enable buffer flow control for SCO/eSCO
    - coredump: log devcd dumps into the monitor
 
  - Bluetooth drivers:
    - intel: add support to configure TX power
    - nxp: handle bootloader error during cmd5 and cmd7
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmfkLC8ACgkQMUZtbf5S
 Irsb5g/+L7oKOf0ALbaV9kxFsoz8AymZfAW9i/27F07omGJGpks8oX6j6rQLgIRO
 OQOFcp7XEdDh1+jh82gHVuPrw2/6lchLtW8ARtzdiQKFr5DRjrsbtua6GRc8iBqA
 DIRCBFoV2HuMkF39Vr09HMa9AZAT7QR2RLsRGpSq8E8Z8xxKz0X7oujs10PFpMTE
 IVKhTrVrk+NDot/IU2hzVpnpup+0ld+T2/ZaBklJGcU8uDffImsqNepHRyCG5UC3
 xz74Ju23MAj24Gct+og0yFUooF+lUltKyVm0FYCDCY3bASTwgY01NR3kEH/0NQvM
 cywLzd/ngHm/SMD2ggVAHkjZUieiIVHdaZ53dgjDeBOQoVP6p0dgUK7EumXX8Mx4
 8ReR2UiGoYRPaq9c4o+IjG4K027MwVK2p+mF1a6MLa+20XcyMbev8FIRbbHtC/V4
 z5/FsOAxcuICWkA1hU9bODrrGzIqemmdRgKG8sGuTJCt/kYGAn72/TCATGNSaCJ0
 00n2jN1aepa7wtywHJ5MhVzxN9iQX7+geUHXz0BI+lK4e1Pmk+vjGksymb9ai2fk
 eQAUV9ekub6q68/J16scD7XeOUM37bTLiMBQeIF8UtZBOJscKiS71zn9QP9Twwxv
 P2pm01RDZUI+z5ZX3hc12Pm1vjRHaAh9S1JpAw/pTOVlQ+mAJEM=
 =XY0S
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core & protocols:

   - Continue Netlink conversions to per-namespace RTNL lock
     (IPv4 routing, routing rules, routing next hops, ARP ioctls)

   - Continue extending the use of netdev instance locks. As a driver
     opt-in protect queue operations and (in due course) ethtool
     operations with the instance lock and not RTNL lock.

   - Support collecting TCP timestamps (data submitted, sent, acked) in
     BPF, allowing for transparent (to the application) and lower
     overhead tracking of TCP RPC performance.

   - Tweak existing networking Rx zero-copy infra to support zero-copy
     Rx via io_uring.

   - Optimize MPTCP performance in single subflow mode by 29%.

   - Enable GRO on packets which went thru XDP CPU redirect (were queued
     for processing on a different CPU). Improving TCP stream
     performance up to 2x.

   - Improve performance of contended connect() by 200% by searching for
     an available 4-tuple under RCU rather than a spin lock. Bring an
     additional 229% improvement by tweaking hash distribution.

   - Avoid unconditionally touching sk_tsflags on RX, improving
     performance under UDP flood by as much as 10%.

   - Avoid skb_clone() dance in ping_rcv() to improve performance under
     ping flood.

   - Avoid FIB lookup in netfilter if socket is available, 20% perf win.

   - Rework network device creation (in-kernel) API to more clearly
     identify network namespaces and their roles. There are up to 4
     namespace roles but we used to have just 2 netns pointer arguments,
     interpreted differently based on context.

   - Use sysfs_break_active_protection() instead of trylock to avoid
     deadlocks between unregistering objects and sysfs access.

   - Add a new sysctl and sockopt for capping max retransmit timeout in
     TCP.

   - Support masking port and DSCP in routing rule matches.

   - Support dumping IPv4 multicast addresses with RTM_GETMULTICAST.

   - Support specifying at what time packet should be sent on AF_XDP
     sockets.

   - Expose TCP ULP diagnostic info (for TLS and MPTCP) to non-admin
     users.

   - Add Netlink YAML spec for WiFi (nl80211) and conntrack.

   - Introduce EXPORT_IPV6_MOD() and EXPORT_IPV6_MOD_GPL() for symbols
     which only need to be exported when IPv6 support is built as a
     module.

   - Age FDB entries based on Rx not Tx traffic in VxLAN, similar to
     normal bridging.

   - Allow users to specify source port range for GENEVE tunnels.

   - netconsole: allow attaching kernel release, CPU ID and task name to
     messages as metadata

  Driver API:

   - Continue rework / fixing of Energy Efficient Ethernet (EEE) across
     the SW layers. Delegate the responsibilities to phylink where
     possible. Improve its handling in phylib.

   - Support symmetric OR-XOR RSS hashing algorithm.

   - Support tracking and preserving IRQ affinity by NAPI itself.

   - Support loopback mode speed selection for interface selftests.

  Device drivers:

   - Remove the IBM LCS driver for s390

   - Remove the sb1000 cable modem driver

   - Add support for SFP module access over SMBus

   - Add MCTP transport driver for MCTP-over-USB

   - Enable XDP metadata support in multiple drivers

   - Ethernet high-speed NICs:
      - Broadcom (bnxt):
         - add PCIe TLP Processing Hints (TPH) support for new AMD
           platforms
         - support dumping RoCE queue state for debug
         - opt into instance locking
      - Intel (100G, ice, idpf):
         - ice: rework MSI-X IRQ management and distribution
         - ice: support for E830 devices
         - iavf: add support for Rx timestamping
         - iavf: opt into instance locking
      - nVidia/Mellanox:
         - mlx4: use page pool memory allocator for Rx
         - mlx5: support for one PTP device per hardware clock
         - mlx5: support for 200Gbps per-lane link modes
         - mlx5: move IPSec policy check after decryption
      - AMD/Solarflare:
         - support FW flashing via devlink
      - Cisco (enic):
         - use page pool memory allocator for Rx
         - enable 32, 64 byte CQEs
         - get max rx/tx ring size from the device
      - Meta (fbnic):
         - support flow steering and RSS configuration
         - report queue stats
         - support TCP segmentation
         - support IRQ coalescing
         - support ring size configuration
      - Marvell/Cavium:
         - support AF_XDP
      - Wangxun:
         - support for PTP clock and timestamping
      - Huawei (hibmcge):
         - checksum offload
         - add more statistics

   - Ethernet virtual:
      - VirtIO net:
         - aggressively suppress Tx completions, improve perf by 96%
           with 1 CPU and 55% with 2 CPUs
         - expose NAPI to IRQ mapping and persist NAPI settings
      - Google (gve):
         - support XDP in DQO RDA Queue Format
         - opt into instance locking
      - Microsoft vNIC:
         - support BIG TCP

   - Ethernet NICs consumer, and embedded:
      - Synopsys (stmmac):
         - cleanup Tx and Tx clock setting and other link-focused
           cleanups
         - enable SGMII and 2500BASEX mode switching for Intel platforms
         - support Sophgo SG2044
      - Broadcom switches (b53):
         - support for BCM53101
      - TI:
         - iep: add perout configuration support
         - icssg: support XDP
      - Cadence (macb):
         - implement BQL
      - Xilinx (axinet):
         - support dynamic IRQ moderation and changing coalescing at
           runtime
         - implement BQL
         - report standard stats
      - MediaTek:
         - support phylink managed EEE
      - Intel:
         - igc: don't restart the interface on every XDP program change
      - RealTek (r8169):
         - support reading registers of internal PHYs directly
         - increase max jumbo packet size on RTL8125/RTL8126
      - Airoha:
         - support for RISC-V NPU packet processing unit
         - enable scatter-gather and support MTU up to 9kB
      - Tehuti (tn40xx):
         - support cards with TN4010 MAC and an Aquantia AQR105 PHY

   - Ethernet PHYs:
      - support for TJA1102S, TJA1121
      - dp83tg720: add randomized polling intervals for link detection
      - dp83822: support changing the transmit amplitude voltage
      - support for LEDs on 88q2xxx

   - CAN:
      - canxl: support Remote Request Substitution bit access
      - flexcan: add S32G2/S32G3 SoC

   - WiFi:
      - remove cooked monitor support
      - strict mode for better AP testing
      - basic EPCS support
      - OMI RX bandwidth reduction support
      - batman-adv: add support for jumbo frames

   - WiFi drivers:
      - RealTek (rtw88):
         - support RTL8814AE and RTL8814AU
      - RealTek (rtw89):
         - switch using wiphy_lock and wiphy_work
         - add BB context to manipulate two PHY as preparation of MLO
         - improve BT-coexistence mechanism to play A2DP smoothly
      - Intel (iwlwifi):
         - add new iwlmld sub-driver for latest HW/FW combinations
      - MediaTek (mt76):
         - preparation for mt7996 Multi-Link Operation (MLO) support
      - Qualcomm/Atheros (ath12k):
         - continued work on MLO
      - Silabs (wfx):
         - Wake-on-WLAN support

   - Bluetooth:
      - add support for skb TX SND/COMPLETION timestamping
      - hci_core: enable buffer flow control for SCO/eSCO
      - coredump: log devcd dumps into the monitor

   - Bluetooth drivers:
      - intel: add support to configure TX power
      - nxp: handle bootloader error during cmd5 and cmd7"

* tag 'net-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1681 commits)
  unix: fix up for "apparmor: add fine grained af_unix mediation"
  mctp: Fix incorrect tx flow invalidation condition in mctp-i2c
  net: usb: asix: ax88772: Increase phy_name size
  net: phy: Introduce PHY_ID_SIZE — minimum size for PHY ID string
  net: libwx: fix Tx L4 checksum
  net: libwx: fix Tx descriptor content for some tunnel packets
  atm: Fix NULL pointer dereference
  net: tn40xx: add pci-id of the aqr105-based Tehuti TN4010 cards
  net: tn40xx: prepare tn40xx driver to find phy of the TN9510 card
  net: tn40xx: create swnode for mdio and aqr105 phy and add to mdiobus
  net: phy: aquantia: add essential functions to aqr105 driver
  net: phy: aquantia: search for firmware-name in fwnode
  net: phy: aquantia: add probe function to aqr105 for firmware loading
  net: phy: Add swnode support to mdiobus_scan
  gve: add XDP DROP and PASS support for DQ
  gve: update XDP allocation path support RX buffer posting
  gve: merge packet buffer size fields
  gve: update GQ RX to use buf_size
  gve: introduce config-based allocation for XDP
  gve: remove xdp_xsk_done and xdp_xsk_wakeup statistics
  ...
2025-03-26 21:48:21 -07:00
Linus Torvalds
a50b4fe095 A treewide hrtimer timer cleanup
hrtimers are initialized with hrtimer_init() and a subsequent store to
   the callback pointer. This turned out to be suboptimal for the upcoming
   Rust integration and is obviously a silly implementation to begin with.
 
   This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
   with hrtimer_setup(T, cb);
 
   The conversion was done with Coccinelle and a few manual fixups.
 
   Once the conversion has completely landed in mainline, hrtimer_init()
   will be removed and the hrtimer::function becomes a private member.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmff5jQTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoVvRD/wKtuwmiA66NJFgXC0qVq82A6fO3bY8
 GBdbfysDJIbqGu5PTcULTbJ8qkqv3jeLUv6CcXvS4sZ7y/uJQl2lzf8yrD/0bbwc
 rLI6sHiPSZmK93kNVN4X5H7kvt7cE/DYC9nnEOgK3BY5FgKc4n9887d4aVBhL8Lv
 ODwVXvZ+xi351YCj7qRyPU24zt/p4tkkT1o2k4a0HBluqLI0D+V20fke9IERUL8r
 d1uWKlcn0TqYDesE8HXKIhbst3gx52rMJrXBJDHwFmG6v8Pj1fkTXCVpPo8QcBz8
 OTVkpomN9f/Tx4+GZwhZOF86LhLL3OhxD6pT7JhFCXdmSGv+Ez8uyk1YZysM/XpV
 Juy/1yAcBpDIDkmhMFGdAAn48Nn9Fotty0r4je60zSEp1d/4QMXcFme29qr2JTUE
 iWnQ/HD6DxUjVHqy7CYvvo26Xegg1C7qgyOVt4PYZwAM1VKF5P3kzYTb4SAdxtop
 Tpji1sfW9QV08jqMNo6XntD32DSP9S2HqjO9LwBw700jnx2jjJ35fcJs6iodMOUn
 gckIZLMn3L0OoglPdyA5O7SNTbKE7aFiRKdnT/cJtR3Fa39Qu27CwC5gfiyuie9I
 Q+LG8GLuYSBHXAR+PBK4GWlzJ7Dn8k3eqmbnLeKpRMsU6ZzcttgA64xhaviN2wN0
 iJbvLJeisXr3GA==
 =bYAX
 -----END PGP SIGNATURE-----

Merge tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer cleanups from Thomas Gleixner:
 "A treewide hrtimer timer cleanup

  hrtimers are initialized with hrtimer_init() and a subsequent store to
  the callback pointer. This turned out to be suboptimal for the
  upcoming Rust integration and is obviously a silly implementation to
  begin with.

  This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
  with hrtimer_setup(T, cb);

  The conversion was done with Coccinelle and a few manual fixups.

  Once the conversion has completely landed in mainline, hrtimer_init()
  will be removed and the hrtimer::function becomes a private member"

* tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
  wifi: rt2x00: Switch to use hrtimer_update_function()
  io_uring: Use helper function hrtimer_update_function()
  serial: xilinx_uartps: Use helper function hrtimer_update_function()
  ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup()
  RDMA: Switch to use hrtimer_setup()
  virtio: mem: Switch to use hrtimer_setup()
  drm/vmwgfx: Switch to use hrtimer_setup()
  drm/xe/oa: Switch to use hrtimer_setup()
  drm/vkms: Switch to use hrtimer_setup()
  drm/msm: Switch to use hrtimer_setup()
  drm/i915/request: Switch to use hrtimer_setup()
  drm/i915/uncore: Switch to use hrtimer_setup()
  drm/i915/pmu: Switch to use hrtimer_setup()
  drm/i915/perf: Switch to use hrtimer_setup()
  drm/i915/gvt: Switch to use hrtimer_setup()
  drm/i915/huc: Switch to use hrtimer_setup()
  drm/amdgpu: Switch to use hrtimer_setup()
  stm class: heartbeat: Switch to use hrtimer_setup()
  i2c: Switch to use hrtimer_setup()
  iio: Switch to use hrtimer_setup()
  ...
2025-03-25 10:54:15 -07:00
Linus Torvalds
e34c38057a [ Merge note: this pull request depends on you having merged
two locking commits in the locking tree,
 	      part of the locking-core-2025-03-22 pull request. ]
 
 x86 CPU features support:
   - Generate the <asm/cpufeaturemasks.h> header based on build config
     (H. Peter Anvin, Xin Li)
   - x86 CPUID parsing updates and fixes (Ahmed S. Darwish)
   - Introduce the 'setcpuid=' boot parameter (Brendan Jackman)
   - Enable modifying CPU bug flags with '{clear,set}puid='
     (Brendan Jackman)
   - Utilize CPU-type for CPU matching (Pawan Gupta)
   - Warn about unmet CPU feature dependencies (Sohil Mehta)
   - Prepare for new Intel Family numbers (Sohil Mehta)
 
 Percpu code:
   - Standardize & reorganize the x86 percpu layout and
     related cleanups (Brian Gerst)
   - Convert the stackprotector canary to a regular percpu
     variable (Brian Gerst)
   - Add a percpu subsection for cache hot data (Brian Gerst)
   - Unify __pcpu_op{1,2}_N() macros to __pcpu_op_N() (Uros Bizjak)
   - Construct __percpu_seg_override from __percpu_seg (Uros Bizjak)
 
 MM:
   - Add support for broadcast TLB invalidation using AMD's INVLPGB instruction
     (Rik van Riel)
   - Rework ROX cache to avoid writable copy (Mike Rapoport)
   - PAT: restore large ROX pages after fragmentation
     (Kirill A. Shutemov, Mike Rapoport)
   - Make memremap(MEMREMAP_WB) map memory as encrypted by default
     (Kirill A. Shutemov)
   - Robustify page table initialization (Kirill A. Shutemov)
   - Fix flush_tlb_range() when used for zapping normal PMDs (Jann Horn)
   - Clear _PAGE_DIRTY for kernel mappings when we clear _PAGE_RW
     (Matthew Wilcox)
 
 KASLR:
   - x86/kaslr: Reduce KASLR entropy on most x86 systems,
     to support PCI BAR space beyond the 10TiB region
     (CONFIG_PCI_P2PDMA=y) (Balbir Singh)
 
 CPU bugs:
   - Implement FineIBT-BHI mitigation (Peter Zijlstra)
   - speculation: Simplify and make CALL_NOSPEC consistent (Pawan Gupta)
   - speculation: Add a conditional CS prefix to CALL_NOSPEC (Pawan Gupta)
   - RFDS: Exclude P-only parts from the RFDS affected list (Pawan Gupta)
 
 System calls:
   - Break up entry/common.c (Brian Gerst)
   - Move sysctls into arch/x86 (Joel Granados)
 
 Intel LAM support updates: (Maciej Wieczor-Retman)
   - selftests/lam: Move cpu_has_la57() to use cpuinfo flag
   - selftests/lam: Skip test if LAM is disabled
   - selftests/lam: Test get_user() LAM pointer handling
 
 AMD SMN access updates:
   - Add SMN offsets to exclusive region access (Mario Limonciello)
   - Add support for debugfs access to SMN registers (Mario Limonciello)
   - Have HSMP use SMN through AMD_NODE (Yazen Ghannam)
 
 Power management updates: (Patryk Wlazlyn)
   - Allow calling mwait_play_dead with an arbitrary hint
   - ACPI/processor_idle: Add FFH state handling
   - intel_idle: Provide the default enter_dead() handler
   - Eliminate mwait_play_dead_cpuid_hint()
 
 Bootup:
 
 Build system:
   - Raise the minimum GCC version to 8.1 (Brian Gerst)
   - Raise the minimum LLVM version to 15.0.0
     (Nathan Chancellor)
 
 Kconfig: (Arnd Bergmann)
   - Add cmpxchg8b support back to Geode CPUs
   - Drop 32-bit "bigsmp" machine support
   - Rework CONFIG_GENERIC_CPU compiler flags
   - Drop configuration options for early 64-bit CPUs
   - Remove CONFIG_HIGHMEM64G support
   - Drop CONFIG_SWIOTLB for PAE
   - Drop support for CONFIG_HIGHPTE
   - Document CONFIG_X86_INTEL_MID as 64-bit-only
   - Remove old STA2x11 support
   - Only allow CONFIG_EISA for 32-bit
 
 Headers:
   - Replace __ASSEMBLY__ with __ASSEMBLER__ in UAPI and non-UAPI headers
     (Thomas Huth)
 
 Assembly code & machine code patching:
   - x86/alternatives: Simplify alternative_call() interface (Josh Poimboeuf)
   - x86/alternatives: Simplify callthunk patching (Peter Zijlstra)
   - KVM: VMX: Use named operands in inline asm (Josh Poimboeuf)
   - x86/hyperv: Use named operands in inline asm (Josh Poimboeuf)
   - x86/traps: Cleanup and robustify decode_bug() (Peter Zijlstra)
   - x86/kexec: Merge x86_32 and x86_64 code using macros from <asm/asm.h>
     (Uros Bizjak)
   - Use named operands in inline asm (Uros Bizjak)
   - Improve performance by using asm_inline() for atomic locking instructions
     (Uros Bizjak)
 
 Earlyprintk:
   - Harden early_serial (Peter Zijlstra)
 
 NMI handler:
   - Add an emergency handler in nmi_desc & use it in nmi_shootdown_cpus()
     (Waiman Long)
 
 Miscellaneous fixes and cleanups:
 
   - by Ahmed S. Darwish, Andy Shevchenko, Ard Biesheuvel,
     Artem Bityutskiy, Borislav Petkov, Brendan Jackman, Brian Gerst,
     Dan Carpenter, Dr. David Alan Gilbert, H. Peter Anvin,
     Ingo Molnar, Josh Poimboeuf, Kevin Brodsky, Mike Rapoport,
     Lukas Bulwahn, Maciej Wieczor-Retman, Max Grobecker,
     Patryk Wlazlyn, Pawan Gupta, Peter Zijlstra,
     Philip Redkin, Qasim Ijaz, Rik van Riel, Thomas Gleixner,
     Thorsten Blum, Tom Lendacky, Tony Luck, Uros Bizjak,
     Vitaly Kuznetsov, Xin Li, liuye.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfenkQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1g1FRAAi6OFTSn/5aeLMI0IMNBxJ6ddQiFc3imd
 7+C/vU5nul4CyDs8mKyj/+f/DDrbkG9lKz3VG631Yl237lXHjD8XWcVMeC/1z/q0
 3zInDIloE9/nBHRPkF6F7fARBLBZ0LFgaBsGrCo7mwpGybiQdqGcqcxllvTbtXaw
 OHta4q6ok+lBDNlfc0v6H4cRnzhmmlKu6Ng0j6UI3V7uFhi3vtxas32ltDQtzorq
 2+jbV6/+kbrrv+xPC+jlzOFhTEKRupNPQXmvyQteoQg6G3kqAKMDvBthGXd1rHuX
 Qa+BoDIifE/2NiVeRwNrhoqYH/pHCzUzDREW5IW8+ca+4XNKuzAC6EuC8CeCzyK1
 q8ZjZjooQW4zEeVFeJYllHONzJYfxfSH5CLsnbcuhq99yfGlrQhF1qL72/Omn1w/
 DfPJM8Zt5zyKvLqUg3Md+fkVCO2wyDNhB61QPzRgHF+yD+rvuDpoqvUWir+w7cSn
 fwEDVZGXlFx6dumtSrqRaTd1nvFt80s8yP2ll09DMvGQ8D/yruS7hndGAmmJVCSW
 NAfd8pSjq5v2+ux2UR92/Cc3VF3SjaUqHBOp/Nq9rESya18ZVa3cJpHhVYYtPIVf
 THW0h07RIkGVKs1uq+5ekLCr/8uAZg58UPIqmhTuW0ttymRHCNfohR45FQZzy+0M
 tJj1oc2TIZw=
 =Dcb3
 -----END PGP SIGNATURE-----

Merge tag 'x86-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull core x86 updates from Ingo Molnar:
 "x86 CPU features support:
   - Generate the <asm/cpufeaturemasks.h> header based on build config
     (H. Peter Anvin, Xin Li)
   - x86 CPUID parsing updates and fixes (Ahmed S. Darwish)
   - Introduce the 'setcpuid=' boot parameter (Brendan Jackman)
   - Enable modifying CPU bug flags with '{clear,set}puid=' (Brendan
     Jackman)
   - Utilize CPU-type for CPU matching (Pawan Gupta)
   - Warn about unmet CPU feature dependencies (Sohil Mehta)
   - Prepare for new Intel Family numbers (Sohil Mehta)

  Percpu code:
   - Standardize & reorganize the x86 percpu layout and related cleanups
     (Brian Gerst)
   - Convert the stackprotector canary to a regular percpu variable
     (Brian Gerst)
   - Add a percpu subsection for cache hot data (Brian Gerst)
   - Unify __pcpu_op{1,2}_N() macros to __pcpu_op_N() (Uros Bizjak)
   - Construct __percpu_seg_override from __percpu_seg (Uros Bizjak)

  MM:
   - Add support for broadcast TLB invalidation using AMD's INVLPGB
     instruction (Rik van Riel)
   - Rework ROX cache to avoid writable copy (Mike Rapoport)
   - PAT: restore large ROX pages after fragmentation (Kirill A.
     Shutemov, Mike Rapoport)
   - Make memremap(MEMREMAP_WB) map memory as encrypted by default
     (Kirill A. Shutemov)
   - Robustify page table initialization (Kirill A. Shutemov)
   - Fix flush_tlb_range() when used for zapping normal PMDs (Jann Horn)
   - Clear _PAGE_DIRTY for kernel mappings when we clear _PAGE_RW
     (Matthew Wilcox)

  KASLR:
   - x86/kaslr: Reduce KASLR entropy on most x86 systems, to support PCI
     BAR space beyond the 10TiB region (CONFIG_PCI_P2PDMA=y) (Balbir
     Singh)

  CPU bugs:
   - Implement FineIBT-BHI mitigation (Peter Zijlstra)
   - speculation: Simplify and make CALL_NOSPEC consistent (Pawan Gupta)
   - speculation: Add a conditional CS prefix to CALL_NOSPEC (Pawan
     Gupta)
   - RFDS: Exclude P-only parts from the RFDS affected list (Pawan
     Gupta)

  System calls:
   - Break up entry/common.c (Brian Gerst)
   - Move sysctls into arch/x86 (Joel Granados)

  Intel LAM support updates: (Maciej Wieczor-Retman)
   - selftests/lam: Move cpu_has_la57() to use cpuinfo flag
   - selftests/lam: Skip test if LAM is disabled
   - selftests/lam: Test get_user() LAM pointer handling

  AMD SMN access updates:
   - Add SMN offsets to exclusive region access (Mario Limonciello)
   - Add support for debugfs access to SMN registers (Mario Limonciello)
   - Have HSMP use SMN through AMD_NODE (Yazen Ghannam)

  Power management updates: (Patryk Wlazlyn)
   - Allow calling mwait_play_dead with an arbitrary hint
   - ACPI/processor_idle: Add FFH state handling
   - intel_idle: Provide the default enter_dead() handler
   - Eliminate mwait_play_dead_cpuid_hint()

  Build system:
   - Raise the minimum GCC version to 8.1 (Brian Gerst)
   - Raise the minimum LLVM version to 15.0.0 (Nathan Chancellor)

  Kconfig: (Arnd Bergmann)
   - Add cmpxchg8b support back to Geode CPUs
   - Drop 32-bit "bigsmp" machine support
   - Rework CONFIG_GENERIC_CPU compiler flags
   - Drop configuration options for early 64-bit CPUs
   - Remove CONFIG_HIGHMEM64G support
   - Drop CONFIG_SWIOTLB for PAE
   - Drop support for CONFIG_HIGHPTE
   - Document CONFIG_X86_INTEL_MID as 64-bit-only
   - Remove old STA2x11 support
   - Only allow CONFIG_EISA for 32-bit

  Headers:
   - Replace __ASSEMBLY__ with __ASSEMBLER__ in UAPI and non-UAPI
     headers (Thomas Huth)

  Assembly code & machine code patching:
   - x86/alternatives: Simplify alternative_call() interface (Josh
     Poimboeuf)
   - x86/alternatives: Simplify callthunk patching (Peter Zijlstra)
   - KVM: VMX: Use named operands in inline asm (Josh Poimboeuf)
   - x86/hyperv: Use named operands in inline asm (Josh Poimboeuf)
   - x86/traps: Cleanup and robustify decode_bug() (Peter Zijlstra)
   - x86/kexec: Merge x86_32 and x86_64 code using macros from
     <asm/asm.h> (Uros Bizjak)
   - Use named operands in inline asm (Uros Bizjak)
   - Improve performance by using asm_inline() for atomic locking
     instructions (Uros Bizjak)

  Earlyprintk:
   - Harden early_serial (Peter Zijlstra)

  NMI handler:
   - Add an emergency handler in nmi_desc & use it in
     nmi_shootdown_cpus() (Waiman Long)

  Miscellaneous fixes and cleanups:
   - by Ahmed S. Darwish, Andy Shevchenko, Ard Biesheuvel, Artem
     Bityutskiy, Borislav Petkov, Brendan Jackman, Brian Gerst, Dan
     Carpenter, Dr. David Alan Gilbert, H. Peter Anvin, Ingo Molnar,
     Josh Poimboeuf, Kevin Brodsky, Mike Rapoport, Lukas Bulwahn, Maciej
     Wieczor-Retman, Max Grobecker, Patryk Wlazlyn, Pawan Gupta, Peter
     Zijlstra, Philip Redkin, Qasim Ijaz, Rik van Riel, Thomas Gleixner,
     Thorsten Blum, Tom Lendacky, Tony Luck, Uros Bizjak, Vitaly
     Kuznetsov, Xin Li, liuye"

* tag 'x86-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (211 commits)
  zstd: Increase DYNAMIC_BMI2 GCC version cutoff from 4.8 to 11.0 to work around compiler segfault
  x86/asm: Make asm export of __ref_stack_chk_guard unconditional
  x86/mm: Only do broadcast flush from reclaim if pages were unmapped
  perf/x86/intel, x86/cpu: Replace Pentium 4 model checks with VFM ones
  perf/x86/intel, x86/cpu: Simplify Intel PMU initialization
  x86/headers: Replace __ASSEMBLY__ with __ASSEMBLER__ in non-UAPI headers
  x86/headers: Replace __ASSEMBLY__ with __ASSEMBLER__ in UAPI headers
  x86/locking/atomic: Improve performance by using asm_inline() for atomic locking instructions
  x86/asm: Use asm_inline() instead of asm() in clwb()
  x86/asm: Use CLFLUSHOPT and CLWB mnemonics in <asm/special_insns.h>
  x86/hweight: Use asm_inline() instead of asm()
  x86/hweight: Use ASM_CALL_CONSTRAINT in inline asm()
  x86/hweight: Use named operands in inline asm()
  x86/stackprotector/64: Only export __ref_stack_chk_guard on CONFIG_SMP
  x86/head/64: Avoid Clang < 17 stack protector in startup code
  x86/kexec: Merge x86_32 and x86_64 code using macros from <asm/asm.h>
  x86/runtime-const: Add the RUNTIME_CONST_PTR assembly macro
  x86/cpu/intel: Limit the non-architectural constant_tsc model checks
  x86/mm/pat: Replace Intel x86_model checks with VFM ones
  x86/cpu/intel: Fix fast string initialization for extended Families
  ...
2025-03-24 22:06:11 -07:00
Kohei Enju
c03bb2fa32 bpf: Fix out-of-bounds read in check_atomic_load/store()
syzbot reported the following splat [0].

In check_atomic_load/store(), register validity is not checked before
atomic_ptr_type_ok(). This causes the out-of-bounds read in is_ctx_reg()
called from atomic_ptr_type_ok() when the register number is MAX_BPF_REG
or greater.

Call check_load_mem()/check_store_reg() before atomic_ptr_type_ok()
to avoid the OOB read.

However, some tests introduced by commit ff3afe5da9 ("selftests/bpf: Add
selftests for load-acquire and store-release instructions") assume
calling atomic_ptr_type_ok() before checking register validity.
Therefore the swapping of order unintentionally changes verifier messages
of these tests.

For example in the test load_acquire_from_pkt_pointer(), expected message
is 'BPF_ATOMIC loads from R2 pkt is not allowed' although actual messages
are different.

  validate_msgs:FAIL:754 expect_msg
  VERIFIER LOG:
  =============
  Global function load_acquire_from_pkt_pointer() doesn't return scalar. Only those are supported.
  0: R1=ctx() R10=fp0
  ; asm volatile ( @ verifier_load_acquire.c:140
  0: (61) r2 = *(u32 *)(r1 +0)          ; R1=ctx() R2_w=pkt(r=0)
  1: (d3) r0 = load_acquire((u8 *)(r2 +0))
  invalid access to packet, off=0 size=1, R2(id=0,off=0,r=0)
  R2 offset is outside of the packet
  processed 2 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
  =============
  EXPECTED   SUBSTR: 'BPF_ATOMIC loads from R2 pkt is not allowed'
  #505/19  verifier_load_acquire/load-acquire from pkt pointer:FAIL

This is because instructions in the test don't pass check_load_mem() and
therefore don't enter the atomic_ptr_type_ok() path.
In this case, we have to modify instructions so that they pass the
check_load_mem() and trigger atomic_ptr_type_ok().
Similarly for store-release tests, we need to modify instructions so that
they pass check_store_reg().

Like load_acquire_from_pkt_pointer(), modify instructions in:
  load_acquire_from_sock_pointer()
  store_release_to_ctx_pointer()
  store_release_to_pkt_pointer()

Also in store_release_to_sock_pointer(), check_store_reg() returns error
early and atomic_ptr_type_ok() is not triggered, since write to sock
pointer is not possible in general.
We might be able to remove the test, but for now let's leave it and just
change the expected message.

[0]
 BUG: KASAN: slab-out-of-bounds in is_ctx_reg kernel/bpf/verifier.c:6185 [inline]
 BUG: KASAN: slab-out-of-bounds in atomic_ptr_type_ok+0x3d7/0x550 kernel/bpf/verifier.c:6223
 Read of size 4 at addr ffff888141b0d690 by task syz-executor143/5842

 CPU: 1 UID: 0 PID: 5842 Comm: syz-executor143 Not tainted 6.14.0-rc3-syzkaller-gf28214603dc6 #0
 Call Trace:
  <TASK>
  __dump_stack lib/dump_stack.c:94 [inline]
  dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
  print_address_description mm/kasan/report.c:408 [inline]
  print_report+0x16e/0x5b0 mm/kasan/report.c:521
  kasan_report+0x143/0x180 mm/kasan/report.c:634
  is_ctx_reg kernel/bpf/verifier.c:6185 [inline]
  atomic_ptr_type_ok+0x3d7/0x550 kernel/bpf/verifier.c:6223
  check_atomic_store kernel/bpf/verifier.c:7804 [inline]
  check_atomic kernel/bpf/verifier.c:7841 [inline]
  do_check+0x89dd/0xedd0 kernel/bpf/verifier.c:19334
  do_check_common+0x1678/0x2080 kernel/bpf/verifier.c:22600
  do_check_main kernel/bpf/verifier.c:22691 [inline]
  bpf_check+0x165c8/0x1cca0 kernel/bpf/verifier.c:23821

Reported-by: syzbot+a5964227adc0f904549c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=a5964227adc0f904549c
Tested-by: syzbot+a5964227adc0f904549c@syzkaller.appspotmail.com
Fixes: e24bbad29a8d ("bpf: Introduce load-acquire and store-release instructions")
Fixes: ff3afe5da9 ("selftests/bpf: Add selftests for load-acquire and store-release instructions")
Signed-off-by: Kohei Enju <enjuk@amazon.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250322045340.18010-5-enjuk@amazon.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-22 06:15:57 -07:00
Juntong Deng
51d65049cd bpf: Add struct_ops context information to struct bpf_prog_aux
This patch adds struct_ops context information to struct bpf_prog_aux.

This context information will be used in the kfunc filter.

Currently the added context information includes struct_ops member
offset and a pointer to struct bpf_struct_ops.

Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://patch.msgid.link/20250319215358.2287371-2-ameryhung@gmail.com
2025-03-20 16:54:41 -07:00
Kumar Kartikeya Dwivedi
ea21771c07 bpf: Maintain FIFO property for rqspinlock unlock
Since out-of-order unlocks are unsupported for rqspinlock, and irqsave
variants enforce strict FIFO ordering anyway, make the same change for
normal non-irqsave variants, such that FIFO ordering is enforced.

Two new verifier state fields (active_lock_id, active_lock_ptr) are used
to denote the top of the stack, and prev_id and prev_ptr are ascertained
whenever popping the topmost entry through an unlock.

Take special care to make these fields part of the state comparison in
refsafe.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-25-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19 08:03:06 -07:00
Kumar Kartikeya Dwivedi
0de2046137 bpf: Implement verifier support for rqspinlock
Introduce verifier-side support for rqspinlock kfuncs. The first step is
allowing bpf_res_spin_lock type to be defined in map values and
allocated objects, so BTF-side is updated with a new BPF_RES_SPIN_LOCK
field to recognize and validate.

Any object cannot have both bpf_spin_lock and bpf_res_spin_lock, only
one of them (and at most one of them per-object, like before) must be
present. The bpf_res_spin_lock can also be used to protect objects that
require lock protection for their kfuncs, like BPF rbtree and linked
list.

The verifier plumbing to simulate success and failure cases when calling
the kfuncs is done by pushing a new verifier state to the verifier state
stack which will verify the failure case upon calling the kfunc. The
path where success is indicated creates all lock reference state and IRQ
state (if necessary for irqsave variants). In the case of failure, the
state clears the registers r0-r5, sets the return value, and skips kfunc
processing, proceeding to the next instruction.

When marking the return value for success case, the value is marked as
0, and for the failure case as [-MAX_ERRNO, -1]. Then, in the program,
whenever user checks the return value as 'if (ret)' or 'if (ret < 0)'
the verifier never traverses such branches for success cases, and would
be aware that the lock is not held in such cases.

We push the kfunc state in check_kfunc_call whenever rqspinlock kfuncs
are invoked. We introduce a kfunc_class state to avoid mixing lock
irqrestore kfuncs with IRQ state created by bpf_local_irq_save.

With all this infrastructure, these kfuncs become usable in programs
while satisfying all safety properties required by the kernel.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-24-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19 08:03:06 -07:00
Kumar Kartikeya Dwivedi
97eb35f3ad bpf: Introduce rqspinlock kfuncs
Introduce four new kfuncs, bpf_res_spin_lock, and bpf_res_spin_unlock,
and their irqsave/irqrestore variants, which wrap the rqspinlock APIs.
bpf_res_spin_lock returns a conditional result, depending on whether the
lock was acquired (NULL is returned when lock acquisition succeeds,
non-NULL upon failure). The memory pointed to by the returned pointer
upon failure can be dereferenced after the NULL check to obtain the
error code.

Instead of using the old bpf_spin_lock type, introduce a new type with
the same layout, and the same alignment, but a different name to avoid
type confusion.

Preemption is disabled upon successful lock acquisition, however IRQs
are not. Special kfuncs can be introduced later to allow disabling IRQs
when taking a spin lock. Resilient locks are safe against AA deadlocks,
hence not disabling IRQs currently does not allow violation of kernel
safety.

__irq_flag annotation is used to accept IRQ flags for the IRQ-variants,
with the same semantics as existing bpf_local_irq_{save, restore}.

These kfuncs will require additional verifier-side support in subsequent
commits, to allow programs to hold multiple locks at the same time.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-23-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19 08:03:06 -07:00
Kumar Kartikeya Dwivedi
47979314c0 bpf: Convert lpm_trie.c to rqspinlock
Convert all LPM trie usage of raw_spinlock to rqspinlock.

Note that rcu_dereference_protected in trie_delete_elem is switched over
to plain rcu_dereference, the RCU read lock should be held from BPF
program side or eBPF syscall path, and the trie->lock is just acquired
before the dereference. It is not clear the reason the protected variant
was used from the commit history, but the above reasoning makes sense so
switch over.

Closes: https://lore.kernel.org/lkml/000000000000adb08b061413919e@google.com
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-22-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19 08:03:05 -07:00
Kumar Kartikeya Dwivedi
f2ac0e5d1c bpf: Convert percpu_freelist.c to rqspinlock
Convert the percpu_freelist.c code to use rqspinlock, and remove the
extralist fallback and trylock-based acquisitions to avoid deadlocks.

Key thing to note is the retained while (true) loop to search through
other CPUs when failing to push a node due to locking errors. This
retains the behavior of the old code, where it would keep trying until
it would be able to successfully push the node back into the freelist of
a CPU.

Technically, we should start iteration for this loop from
raw_smp_processor_id() + 1, but to avoid hitting the edge of nr_cpus,
we skip execution in the loop body instead.

Closes: https://lore.kernel.org/bpf/CAPPBnEa1_pZ6W24+WwtcNFvTUHTHO7KUmzEbOcMqxp+m2o15qQ@mail.gmail.com
Closes: https://lore.kernel.org/bpf/CAPPBnEYm+9zduStsZaDnq93q1jPLqO-PiKX9jy0MuL8LCXmCrQ@mail.gmail.com
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-21-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19 08:03:05 -07:00
Kumar Kartikeya Dwivedi
4fa8d68aa5 bpf: Convert hashtab.c to rqspinlock
Convert hashtab.c from raw_spinlock to rqspinlock, and drop the hashed
per-cpu counter crud from the code base which is no longer necessary.

Closes: https://lore.kernel.org/bpf/675302fd.050a0220.2477f.0004.GAE@google.com
Closes: https://lore.kernel.org/bpf/000000000000b3e63e061eed3f6b@google.com
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-20-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19 08:03:05 -07:00
Kumar Kartikeya Dwivedi
e2082e32fd rqspinlock: Add entry to Makefile, MAINTAINERS
Ensure that the rqspinlock code is only built when the BPF subsystem is
compiled in. Depending on queued spinlock support, we may or may not end
up building the queued spinlock slowpath, and instead fallback to the
test-and-set implementation. Also add entries to MAINTAINERS file.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-18-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19 08:03:05 -07:00
Kumar Kartikeya Dwivedi
ecbd804752 rqspinlock: Add basic support for CONFIG_PARAVIRT
We ripped out PV and virtualization related bits from rqspinlock in an
earlier commit, however, a fair lock performs poorly within a virtual
machine when the lock holder is preempted. As such, retain the
virt_spin_lock fallback to test and set lock, but with timeout and
deadlock detection. We can do this by simply depending on the
resilient_tas_spin_lock implementation from the previous patch.

We don't integrate support for CONFIG_PARAVIRT_SPINLOCKS yet, as that
requires more involved algorithmic changes and introduces more
complexity. It can be done when the need arises in the future.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-15-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19 08:03:05 -07:00
Kumar Kartikeya Dwivedi
c9102a68c0 rqspinlock: Add a test-and-set fallback
Include a test-and-set fallback when queued spinlock support is not
available. Introduce a rqspinlock type to act as a fallback when
qspinlock support is absent.

Include ifdef guards to ensure the slow path in this file is only
compiled when CONFIG_QUEUED_SPINLOCKS=y. Subsequent patches will add
further logic to ensure fallback to the test-and-set implementation
when queued spinlock support is unavailable on an architecture.

Unlike other waiting loops in rqspinlock code, the one for test-and-set
has no theoretical upper bound under contention, therefore we need a
longer timeout than usual. Bump it up to a second in this case.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-14-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19 08:03:05 -07:00
Kumar Kartikeya Dwivedi
31158ad02d rqspinlock: Add deadlock detection and recovery
While the timeout logic provides guarantees for the waiter's forward
progress, the time until a stalling waiter unblocks can still be long.
The default timeout of 1/4 sec can be excessively long for some use
cases.  Additionally, custom timeouts may exacerbate recovery time.

Introduce logic to detect common cases of deadlocks and perform quicker
recovery. This is done by dividing the time from entry into the locking
slow path until the timeout into intervals of 1 ms. Then, after each
interval elapses, deadlock detection is performed, while also polling
the lock word to ensure we can quickly break out of the detection logic
and proceed with lock acquisition.

A 'held_locks' table is maintained per-CPU where the entry at the bottom
denotes a lock being waited for or already taken. Entries coming before
it denote locks that are already held. The current CPU's table can thus
be looked at to detect AA deadlocks. The tables from other CPUs can be
looked at to discover ABBA situations. Finally, when a matching entry
for the lock being taken on the current CPU is found on some other CPU,
a deadlock situation is detected. This function can take a long time,
therefore the lock word is constantly polled in each loop iteration to
ensure we can preempt detection and proceed with lock acquisition, using
the is_lock_released check.

We set 'spin' member of rqspinlock_timeout struct to 0 to trigger
deadlock checks immediately to perform faster recovery.

Note: Extending lock word size by 4 bytes to record owner CPU can allow
faster detection for ABBA. It is typically the owner which participates
in a ABBA situation. However, to keep compatibility with existing lock
words in the kernel (struct qspinlock), and given deadlocks are a rare
event triggered by bugs, we choose to favor compatibility over faster
detection.

The release_held_lock_entry function requires an smp_wmb, while the
release store on unlock will provide the necessary ordering for us. Add
comments to document the subtleties of why this is correct. It is
possible for stores to be reordered still, but in the context of the
deadlock detection algorithm, a release barrier is sufficient and
needn't be stronger for unlock's case.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-13-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19 08:03:05 -07:00
Kumar Kartikeya Dwivedi
3bb159366a rqspinlock: Protect waiters in trylock fallback from stalls
When we run out of maximum rqnodes, the original queued spin lock slow
path falls back to a try lock. In such a case, we are again susceptible
to stalls in case the lock owner fails to make progress. We use the
timeout as a fallback to break out of this loop and return to the
caller. This is a fallback for an extreme edge case, when on the same
CPU we run out of all 4 qnodes. When could this happen? We are in slow
path in task context, we get interrupted by an IRQ, which while in the
slow path gets interrupted by an NMI, whcih in the slow path gets
another nested NMI, which enters the slow path. All of the interruptions
happen after node->count++.

We use RES_DEF_TIMEOUT as our spinning duration, but in the case of this
fallback, no fairness is guaranteed, so the duration may be too small
for contended cases, as the waiting time is not bounded. Since this is
an extreme corner case, let's just prefer timing out instead of
attempting to spin for longer.

Reviewed-by: Barret Rhoden <brho@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-12-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19 08:03:05 -07:00
Kumar Kartikeya Dwivedi
164c246571 rqspinlock: Protect waiters in queue from stalls
Implement the wait queue cleanup algorithm for rqspinlock. There are
three forms of waiters in the original queued spin lock algorithm. The
first is the waiter which acquires the pending bit and spins on the lock
word without forming a wait queue. The second is the head waiter that is
the first waiter heading the wait queue. The third form is of all the
non-head waiters queued behind the head, waiting to be signalled through
their MCS node to overtake the responsibility of the head.

In this commit, we are concerned with the second and third kind. First,
we augment the waiting loop of the head of the wait queue with a
timeout. When this timeout happens, all waiters part of the wait queue
will abort their lock acquisition attempts. This happens in three steps.
First, the head breaks out of its loop waiting for pending and locked
bits to turn to 0, and non-head waiters break out of their MCS node spin
(more on that later). Next, every waiter (head or non-head) attempts to
check whether they are also the tail waiter, in such a case they attempt
to zero out the tail word and allow a new queue to be built up for this
lock. If they succeed, they have no one to signal next in the queue to
stop spinning. Otherwise, they signal the MCS node of the next waiter to
break out of its spin and try resetting the tail word back to 0. This
goes on until the tail waiter is found. In case of races, the new tail
will be responsible for performing the same task, as the old tail will
then fail to reset the tail word and wait for its next pointer to be
updated before it signals the new tail to do the same.

We terminate the whole wait queue because of two main reasons. Firstly,
we eschew per-waiter timeouts with one applied at the head of the wait
queue.  This allows everyone to break out faster once we've seen the
owner / pending waiter not responding for the timeout duration from the
head.  Secondly, it avoids complicated synchronization, because when not
leaving in FIFO order, prev's next pointer needs to be fixed up etc.

Lastly, all of these waiters release the rqnode and return to the
caller. This patch underscores the point that rqspinlock's timeout does
not apply to each waiter individually, and cannot be relied upon as an
upper bound. It is possible for the rqspinlock waiters to return early
from a failed lock acquisition attempt as soon as stalls are detected.

The head waiter cannot directly WRITE_ONCE the tail to zero, as it may
race with a concurrent xchg and a non-head waiter linking its MCS node
to the head's MCS node through 'prev->next' assignment.

One notable thing is that we must use RES_DEF_TIMEOUT * 2 as our maximum
duration for the waiting loop (for the wait queue head), since we may
have both the owner and pending bit waiter ahead of us, and in the worst
case, need to span their maximum permitted critical section lengths.

Reviewed-by: Barret Rhoden <brho@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-11-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19 08:03:05 -07:00
Kumar Kartikeya Dwivedi
337ffea51a rqspinlock: Protect pending bit owners from stalls
The pending bit is used to avoid queueing in case the lock is
uncontended, and has demonstrated benefits for the 2 contender scenario,
esp. on x86. In case the pending bit is acquired and we wait for the
locked bit to disappear, we may get stuck due to the lock owner not
making progress. Hence, this waiting loop must be protected with a
timeout check.

To perform a graceful recovery once we decide to abort our lock
acquisition attempt in this case, we must unset the pending bit since we
own it. All waiters undoing their changes and exiting gracefully allows
the lock word to be restored to the unlocked state once all participants
(owner, waiters) have been recovered, and the lock remains usable.
Hence, set the pending bit back to zero before returning to the caller.

Introduce a lockevent (rqspinlock_lock_timeout) to capture timeout
event statistics.

Reviewed-by: Barret Rhoden <brho@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-10-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19 08:03:05 -07:00
Kumar Kartikeya Dwivedi
ebababcd03 rqspinlock: Hardcode cond_acquire loops for arm64
Currently, for rqspinlock usage, the implementation of
smp_cond_load_acquire (and thus, atomic_cond_read_acquire) are
susceptible to stalls on arm64, because they do not guarantee that the
conditional expression will be repeatedly invoked if the address being
loaded from is not written to by other CPUs. When support for
event-streams is absent (which unblocks stuck WFE-based loops every
~100us), we may end up being stuck forever.

This causes a problem for us, as we need to repeatedly invoke the
RES_CHECK_TIMEOUT in the spin loop to break out when the timeout
expires.

Let us import the smp_cond_load_acquire_timewait implementation Ankur is
proposing in [0], and then fallback to it once it is merged.

While we rely on the implementation to amortize the cost of sampling
check_timeout for us, it will not happen when event stream support is
unavailable. This is not the common case, and it would be difficult to
fit our logic in the time_expr_ns >= time_limit_ns comparison, hence
just let it be.

  [0]: https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com

Cc: Ankur Arora <ankur.a.arora@oracle.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-9-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19 08:03:04 -07:00
Kumar Kartikeya Dwivedi
14c48ee814 rqspinlock: Add support for timeouts
Introduce policy macro RES_CHECK_TIMEOUT which can be used to detect
when the timeout has expired for the slow path to return an error. It
depends on being passed two variables initialized to 0: ts, ret. The
'ts' parameter is of type rqspinlock_timeout.

This macro resolves to the (ret) expression so that it can be used in
statements like smp_cond_load_acquire to break the waiting loop
condition.

The 'spin' member is used to amortize the cost of checking time by
dispatching to the implementation every 64k iterations. The
'timeout_end' member is used to keep track of the timestamp that denotes
the end of the waiting period. The 'ret' parameter denotes the status of
the timeout, and can be checked in the slow path to detect timeouts
after waiting loops.

The 'duration' member is used to store the timeout duration for each
waiting loop. The default timeout value defined in the header
(RES_DEF_TIMEOUT) is 0.25 seconds.

This macro will be used as a condition for waiting loops in the slow
path.  Since each waiting loop applies a fresh timeout using the same
rqspinlock_timeout, we add a new RES_RESET_TIMEOUT as well to ensure the
values can be easily reinitialized to the default state.

Reviewed-by: Barret Rhoden <brho@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-8-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19 08:03:04 -07:00
Kumar Kartikeya Dwivedi
a926d09922 rqspinlock: Drop PV and virtualization support
Changes to rqspinlock in subsequent commits will be algorithmic
modifications, which won't remain in agreement with the implementations
of paravirt spin lock and virt_spin_lock support. These future changes
include measures for terminating waiting loops in slow path after a
certain point. While using a fair lock like qspinlock directly inside
virtual machines leads to suboptimal performance under certain
conditions, we cannot use the existing virtualization support before we
make it resilient as well.  Therefore, drop it for now.

Note that we need to drop qspinlock_stat.h, as it's only relevant in
case of CONFIG_PARAVIRT_SPINLOCKS=y, but we need to keep lock_events.h
in the includes, which was indirectly pulled in before.

Reviewed-by: Barret Rhoden <brho@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-7-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19 08:03:04 -07:00
Kumar Kartikeya Dwivedi
30ff133277 rqspinlock: Add rqspinlock.h header
This header contains the public declarations usable in the rest of the
kernel for rqspinlock.

Let's also type alias qspinlock to rqspinlock_t to ensure consistent use
of the new lock type. We want to remove dependence on the qspinlock type
in later patches as we need to provide a test-and-set fallback, hence
begin abstracting away from now onwards.

Reviewed-by: Barret Rhoden <brho@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-6-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19 08:03:04 -07:00
Kumar Kartikeya Dwivedi
a8fcf2a39b locking: Copy out qspinlock.c to kernel/bpf/rqspinlock.c
In preparation for introducing a new lock implementation, Resilient
Queued Spin Lock, or rqspinlock, we first begin our modifications by
using the existing qspinlock.c code as the base. Simply copy the code to
a new file and rename functions and variables from 'queued' to
'resilient_queued'.

Since we place the file in kernel/bpf, include needs to be relative.

This helps each subsequent commit in clearly showing how and where the
code is being changed. The only change after a literal copy in this
commit is renaming the functions where necessary, and rename qnodes to
rqnodes. Let's also use EXPORT_SYMBOL_GPL for rqspinlock slowpath.

Reviewed-by: Barret Rhoden <brho@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-5-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19 08:03:04 -07:00
Andrea Terzolo
a2598045ea bpf: clarify a misleading verifier error message
The current verifier error message states that tail_calls are not
allowed in non-JITed programs with BPF-to-BPF calls. While this is
accurate, it is not the only scenario where this restriction applies.
Some architectures do not support this feature combination even when
programs are JITed. This update improves the error message to better
reflect these limitations.

Suggested-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Andrea Terzolo <andreaterzolo3@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20250318083551.8192-1-andreaterzolo3@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-18 19:11:23 -07:00
Yafang Shao
cfe816d469 bpf: Reject attaching fexit/fmod_ret to __noreturn functions
If we attach fexit/fmod_ret to __noreturn functions, it will cause an
issue that the bpf trampoline image will be left over even if the bpf
link has been destroyed. Take attaching do_exit() with fexit for example.
The fexit works as follows,

  bpf_trampoline
  + __bpf_tramp_enter
    + percpu_ref_get(&tr->pcref);

  + call do_exit()

  + __bpf_tramp_exit
    + percpu_ref_put(&tr->pcref);

Since do_exit() never returns, the refcnt of the trampoline image is
never decremented, preventing it from being freed. That can be verified
with as follows,

  $ bpftool link show                                   <<<< nothing output
  $ grep "bpf_trampoline_[0-9]" /proc/kallsyms
  ffffffffc04cb000 t bpf_trampoline_6442526459    [bpf] <<<< leftover

In this patch, all functions annotated with __noreturn are rejected, except
for the following cases:
- Functions that result in a system reboot, such as panic,
  machine_real_restart and rust_begin_unwind
- Functions that are never executed by tasks, such as rest_init and
  cpu_startup_entry
- Functions implemented in assembly, such as rewind_stack_and_make_dead and
  xen_cpu_bringup_again, lack an associated BTF ID.

With this change, attaching fexit probes to functions like do_exit() will
be rejected.

$ ./fexit
libbpf: prog 'fexit': BPF program load failed: -EINVAL
libbpf: prog 'fexit': -- BEGIN PROG LOAD LOG --
Attaching fexit/fmod_ret to __noreturn functions is rejected.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20250318114447.75484-2-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-18 19:07:18 -07:00
Martin KaFai Lau
f4edc66e48 bpf: Only fails the busy counter check in bpf_cgrp_storage_get if it creates storage
The current cgrp storage has a percpu counter, bpf_cgrp_storage_busy,
to detect potential deadlock at a spin_lock that the local storage
acquires during new storage creation.

There are false positives. It turns out to be too noisy in
production. For example, a bpf prog may be doing a
bpf_cgrp_storage_get on map_a. An IRQ comes in and triggers
another bpf_cgrp_storage_get on a different map_b. It will then
trigger the false positive deadlock check in the percpu counter.
On top of that, both are doing lookup only and no need to create
new storage, so practically it does not need to acquire
the spin_lock.

The bpf_task_storage_get already has a strategy to minimize this
false positive by only failing if the bpf_task_storage_get needs
to create a new storage and the percpu counter is busy. Creating
a new storage is the only time it must acquire the spin_lock.

This patch borrows the same idea. Unlike task storage that
has a separate variant for tracing (_recur) and non-tracing, this
patch stays with one bpf_cgrp_storage_get helper to keep it simple
for now in light of the upcoming res_spin_lock.

The variable could potentially use a better name noTbusy instead
of nobusy. This patch follows the same naming in
bpf_task_storage_get for now.

I have tested it by temporarily adding noinline to
the cgroup_storage_lookup(), traced it by fentry, and the fentry
program succeeded in calling bpf_cgrp_storage_get().

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20250318182759.3676094-1-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-18 19:05:46 -07:00
Emil Tsalapatis
ae0a457f5d bpf: Make perf_event_read_output accessible in all program types.
The perf_event_read_event_output helper is currently only available to
tracing protrams, but is useful for other BPF programs like sched_ext
schedulers. When the helper is available, provide its bpf_func_proto
directly from the bpf base_proto.

Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20250318030753.10949-1-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-18 10:21:59 -07:00
Jiayuan Chen
3775be3417 bpftool: Using the right format specifiers
Fixed some formatting specifiers errors, such as using %d for int and %u
for unsigned int, as well as other byte-length types.

Perform type cast using the type derived from the data type itself, for
example, if it's originally an int, it will be cast to unsigned int if
forced to unsigned.

Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250311112809.81901-3-jiayuan.chen@linux.dev
2025-03-17 13:50:56 -07:00
Mykyta Yatsenko
07651ccda9 bpf: Return prog btf_id without capable check
Return prog's btf_id from bpf_prog_get_info_by_fd regardless of capable
check. This patch enables scenario, when freplace program, running
from user namespace, requires to query target prog's btf.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20250317174039.161275-3-mykyta.yatsenko5@gmail.com
2025-03-17 13:45:12 -07:00
Mykyta Yatsenko
0de445d18e bpf: BPF token support for BPF_BTF_GET_FD_BY_ID
Currently BPF_BTF_GET_FD_BY_ID requires CAP_SYS_ADMIN, which does not
allow running it from user namespace. This creates a problem when
freplace program running from user namespace needs to query target
program BTF.
This patch relaxes capable check from CAP_SYS_ADMIN to CAP_BPF and adds
support for BPF token that can be passed in attributes to syscall.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250317174039.161275-2-mykyta.yatsenko5@gmail.com
2025-03-17 13:45:11 -07:00
Hou Tao
bb2243f432 bpf: Check map->record at the beginning of check_and_free_fields()
When there are no special fields in the map value, there is no need to
invoke bpf_obj_free_fields(). Therefore, checking the validity of
map->record in advance.

After the change, the benchmark result of the per-cpu update case in
map_perf_test increased by 40% under a 16-CPU VM.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250315150930.1511727-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15 12:06:50 -07:00
Arnd Bergmann
38c6104e0b bpf: preload: Add MODULE_DESCRIPTION
Modpost complains when extra warnings are enabled:

WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/bpf/preload/bpf_preload.o

Add a description from the Kconfig help text.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250310134920.4123633-1-arnd@kernel.org

----
Not sure if that description actually fits what the module does. If not,
please add a different description instead.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15 11:48:58 -07:00
Blaise Boscaccy
082f1db02c security: Propagate caller information in bpf hooks
Certain bpf syscall subcommands are available for usage from both
userspace and the kernel. LSM modules or eBPF gatekeeper programs may
need to take a different course of action depending on whether or not
a BPF syscall originated from the kernel or userspace.

Additionally, some of the bpf_attr struct fields contain pointers to
arbitrary memory. Currently the functionality to determine whether or
not a pointer refers to kernel memory or userspace memory is exposed
to the bpf verifier, but that information is missing from various LSM
hooks.

Here we augment the LSM hooks to provide this data, by simply passing
a boolean flag indicating whether or not the call originated in the
kernel, in any hook that contains a bpf_attr struct that corresponds
to a subcommand that may be called from the kernel.

Signed-off-by: Blaise Boscaccy <bboscaccy@linux.microsoft.com>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://lore.kernel.org/r/20250310221737.821889-2-bboscaccy@linux.microsoft.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15 11:48:58 -07:00
Emil Tsalapatis
014eb5c2d6 bpf: fix missing kdoc string fields in cpumask.c
Some bpf_cpumask-related kfuncs have kdoc strings that are missing
return values. Add a the missing descriptions for the return values.

Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250309230427.26603-4-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15 11:48:57 -07:00
Emil Tsalapatis
950ad93df2 bpf: add kfunc for populating cpumask bits
Add a helper kfunc that sets the bitmap of a bpf_cpumask from BPF memory.

Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com>
Acked-by: Hou Tao <houtao1@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250309230427.26603-2-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15 11:48:57 -07:00
Eduard Zingerman
871ef8d50e bpf: correct use/def for may_goto instruction
may_goto instruction does not use any registers,
but in compute_insn_live_regs() it was treated as a regular
conditional jump of kind BPF_K with r0 as source register.
Thus unnecessarily marking r0 as used.

Fixes: 14c8552db6 ("bpf: simple DFA-based live registers analysis")
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250305085436.2731464-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15 11:48:30 -07:00
Eduard Zingerman
0fb3cf6110 bpf: use register liveness information for func_states_equal
Liveness analysis DFA computes a set of registers live before each
instruction. Leverage this information to skip comparison of dead
registers in func_states_equal(). This helps with convergance of
iterator processing loops, as bpf_reg_state->live marks can't be used
when loops are processed.

This has certain performance impact for selftests, here is a veristat
listing using `-f "insns_pct>5" -f "!insns<200"`

selftests:

File                  Program                        States (A)  States (B)  States  (DIFF)
--------------------  -----------------------------  ----------  ----------  --------------
arena_htab.bpf.o      arena_htab_llvm                        37          35     -2 (-5.41%)
arena_htab_asm.bpf.o  arena_htab_asm                         37          33    -4 (-10.81%)
arena_list.bpf.o      arena_list_add                         37          22   -15 (-40.54%)
dynptr_success.bpf.o  test_dynptr_copy                       22          16    -6 (-27.27%)
dynptr_success.bpf.o  test_dynptr_copy_xdp                   68          58   -10 (-14.71%)
iters.bpf.o           checkpoint_states_deletion            918          40  -878 (-95.64%)
iters.bpf.o           clean_live_states                     136          66   -70 (-51.47%)
iters.bpf.o           iter_nested_deeply_iters               43          37    -6 (-13.95%)
iters.bpf.o           iter_nested_iters                      72          62   -10 (-13.89%)
iters.bpf.o           iter_pass_iter_ptr_to_subprog          30          26    -4 (-13.33%)
iters.bpf.o           iter_subprog_iters                     68          59    -9 (-13.24%)
iters.bpf.o           loop_state_deps2                       35          32     -3 (-8.57%)
iters_css.bpf.o       iter_css_for_each                      32          29     -3 (-9.38%)
pyperf600_iter.bpf.o  on_event                              286         192   -94 (-32.87%)

Total progs: 3578
Old success: 2061
New success: 2061
States diff min:  -95.64%
States diff max:    0.00%
-100 .. -90  %: 1
 -55 .. -45  %: 3
 -45 .. -35  %: 2
 -35 .. -25  %: 5
 -20 .. -10  %: 12
 -10 .. 0    %: 6

sched_ext:

File               Program                 States (A)  States (B)  States   (DIFF)
-----------------  ----------------------  ----------  ----------  ---------------
bpf.bpf.o          lavd_dispatch                 8950        7065  -1885 (-21.06%)
bpf.bpf.o          lavd_init                      516         480     -36 (-6.98%)
bpf.bpf.o          layered_dispatch               662         501   -161 (-24.32%)
bpf.bpf.o          layered_dump                   298         237    -61 (-20.47%)
bpf.bpf.o          layered_init                   523         423   -100 (-19.12%)
bpf.bpf.o          layered_init_task               24          22      -2 (-8.33%)
bpf.bpf.o          layered_runnable               151         125    -26 (-17.22%)
bpf.bpf.o          p2dq_dispatch                   66          53    -13 (-19.70%)
bpf.bpf.o          p2dq_init                      170         142    -28 (-16.47%)
bpf.bpf.o          refresh_layer_cpumasks         120          78    -42 (-35.00%)
bpf.bpf.o          rustland_init                   37          34      -3 (-8.11%)
bpf.bpf.o          rustland_init                   37          34      -3 (-8.11%)
bpf.bpf.o          rusty_select_cpu               125         108    -17 (-13.60%)
scx_central.bpf.o  central_dispatch                59          43    -16 (-27.12%)
scx_central.bpf.o  central_init                    39          28    -11 (-28.21%)
scx_nest.bpf.o     nest_init                       58          51     -7 (-12.07%)
scx_pair.bpf.o     pair_dispatch                  142         111    -31 (-21.83%)
scx_qmap.bpf.o     qmap_dispatch                  174         141    -33 (-18.97%)
scx_qmap.bpf.o     qmap_init                      768         654   -114 (-14.84%)

Total progs: 216
Old success: 186
New success: 186
States diff min:  -35.00%
States diff max:    0.00%
 -35 .. -25  %: 3
 -25 .. -20  %: 6
 -20 .. -15  %: 6
 -15 .. -5   %: 7
  -5 .. 0    %: 6

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250304195024.2478889-5-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15 11:48:29 -07:00
Eduard Zingerman
14c8552db6 bpf: simple DFA-based live registers analysis
Compute may-live registers before each instruction in the program.
The register is live before the instruction I if it is read by I or
some instruction S following I during program execution and is not
overwritten between I and S.

This information would be used in the next patch as a hint in
func_states_equal().

Use a simple algorithm described in [1] to compute this information:
- define the following:
  - I.use : a set of all registers read by instruction I;
  - I.def : a set of all registers written by instruction I;
  - I.in  : a set of all registers that may be alive before I execution;
  - I.out : a set of all registers that may be alive after I execution;
  - I.successors : a set of instructions S that might immediately
                   follow I for some program execution;
- associate separate empty sets 'I.in' and 'I.out' with each instruction;
- visit each instruction in a postorder and update corresponding
  'I.in' and 'I.out' sets as follows:

      I.out = U [S.in for S in I.successors]
      I.in  = (I.out / I.def) U I.use

  (where U stands for set union, / stands for set difference)
- repeat the computation while I.{in,out} changes for any instruction.

On implementation side keep things as simple, as possible:
- check_cfg() already marks instructions EXPLORED in post-order,
  modify it to save the index of each EXPLORED instruction in a vector;
- represent I.{in,out,use,def} as bitmasks;
- don't split the program into basic blocks and don't maintain the
  work queue, instead:
  - do fixed-point computation by visiting each instruction;
  - maintain a simple 'changed' flag if I.{in,out} for any instruction
    change;
  Measurements show that even such simplistic implementation does not
  add measurable verification time overhead (for selftests, at-least).

Note on check_cfg() ex_insn_beg/ex_done change:
To avoid out of bounds access to env->cfg.insn_postorder array,
it should be guaranteed that instruction transitions to EXPLORED state
only once. Previously this was not the fact for incorrect programs
with direct calls to exception callbacks.

The 'align' selftest needs adjustment to skip computed insn/live
registers printout. Otherwise it matches lines from the live registers
printout.

[1] https://en.wikipedia.org/wiki/Live-variable_analysis

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250304195024.2478889-4-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15 11:48:29 -07:00
Eduard Zingerman
22f8397495 bpf: get_call_summary() utility function
Refactor mark_fastcall_pattern_for_call() to extract a utility
function get_call_summary(). For a helper or kfunc call this function
fills the following information: {num_params, is_void, fastcall}.

This function would be used in the next patch in order to get number
of parameters of a helper or kfunc call.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250304195024.2478889-3-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15 11:48:29 -07:00
Eduard Zingerman
80ca3f1d77 bpf: jmp_offset() and verbose_insn() utility functions
Extract two utility functions:
- One BPF jump instruction uses .imm field to encode jump offset,
  while the rest use .off. Encapsulate this detail as jmp_offset()
  function.
- Avoid duplicating instruction printing callback definitions by
  defining a verbose_insn() function, which disassembles an
  instruction into the verifier log while hiding this detail.

These functions will be used in the next patch.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250304195024.2478889-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15 11:48:29 -07:00
Peilin Ye
880442305a bpf: Introduce load-acquire and store-release instructions
Introduce BPF instructions with load-acquire and store-release
semantics, as discussed in [1].  Define 2 new flags:

  #define BPF_LOAD_ACQ    0x100
  #define BPF_STORE_REL   0x110

A "load-acquire" is a BPF_STX | BPF_ATOMIC instruction with the 'imm'
field set to BPF_LOAD_ACQ (0x100).

Similarly, a "store-release" is a BPF_STX | BPF_ATOMIC instruction with
the 'imm' field set to BPF_STORE_REL (0x110).

Unlike existing atomic read-modify-write operations that only support
BPF_W (32-bit) and BPF_DW (64-bit) size modifiers, load-acquires and
store-releases also support BPF_B (8-bit) and BPF_H (16-bit).  As an
exception, however, 64-bit load-acquires/store-releases are not
supported on 32-bit architectures (to fix a build error reported by the
kernel test robot).

An 8- or 16-bit load-acquire zero-extends the value before writing it to
a 32-bit register, just like ARM64 instruction LDARH and friends.

Similar to existing atomic read-modify-write operations, misaligned
load-acquires/store-releases are not allowed (even if
BPF_F_ANY_ALIGNMENT is set).

As an example, consider the following 64-bit load-acquire BPF
instruction (assuming little-endian):

  db 10 00 00 00 01 00 00  r0 = load_acquire((u64 *)(r1 + 0x0))

  opcode (0xdb): BPF_ATOMIC | BPF_DW | BPF_STX
  imm (0x00000100): BPF_LOAD_ACQ

Similarly, a 16-bit BPF store-release:

  cb 21 00 00 10 01 00 00  store_release((u16 *)(r1 + 0x0), w2)

  opcode (0xcb): BPF_ATOMIC | BPF_H | BPF_STX
  imm (0x00000110): BPF_STORE_REL

In arch/{arm64,s390,x86}/net/bpf_jit_comp.c, have
bpf_jit_supports_insn(..., /*in_arena=*/true) return false for the new
instructions, until the corresponding JIT compiler supports them in
arena.

[1] https://lore.kernel.org/all/20240729183246.4110549-1-yepeilin@google.com/

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Peilin Ye <yepeilin@google.com>
Link: https://lore.kernel.org/r/a217f46f0e445fbd573a1a024be5c6bf1d5fe716.1741049567.git.yepeilin@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15 11:48:28 -07:00
Kumar Kartikeya Dwivedi
e723608bf4 bpf: Add verifier support for timed may_goto
Implement support in the verifier for replacing may_goto implementation
from a counter-based approach to one which samples time on the local CPU
to have a bigger loop bound.

We implement it by maintaining 16-bytes per-stack frame, and using 8
bytes for maintaining the count for amortizing time sampling, and 8
bytes for the starting timestamp. To minimize overhead, we need to avoid
spilling and filling of registers around this sequence, so we push this
cost into the time sampling function 'arch_bpf_timed_may_goto'. This is
a JIT-specific wrapper around bpf_check_timed_may_goto which returns us
the count to store into the stack through BPF_REG_AX. All caller-saved
registers (r0-r5) are guaranteed to remain untouched.

The loop can be broken by returning count as 0, otherwise we dispatch
into the function when the count drops to 0, and the runtime chooses to
refresh it (by returning count as BPF_MAX_TIMED_LOOPS) or returning 0
and aborting the loop on next iteration.

Since the check for 0 is done right after loading the count from the
stack, all subsequent cond_break sequences should immediately break as
well, of the same loop or subsequent loops in the program.

We pass in the stack_depth of the count (and thus the timestamp, by
adding 8 to it) to the arch_bpf_timed_may_goto call so that it can be
passed in to bpf_check_timed_may_goto as an argument after r1 is saved,
by adding the offset to r10/fp. This adjustment will be arch specific,
and the next patch will introduce support for x86.

Note that depending on loop complexity, time spent in the loop can be
more than the current limit (250 ms), but imposing an upper bound on
program runtime is an orthogonal problem which will be addressed when
program cancellations are supported.

The current time afforded by cond_break may not be enough for cases
where BPF programs want to implement locking algorithms inline, and use
cond_break as a promise to the verifier that they will eventually
terminate.

Below are some benchmarking numbers on the time taken per-iteration for
an empty loop that counts the number of iterations until cond_break
fires. For comparison, we compare it against bpf_for/bpf_repeat which is
another way to achieve the same number of spins (BPF_MAX_LOOPS).  The
hardware used for benchmarking was a Sapphire Rapids Intel server with
performance governor enabled, mitigations were enabled.

+-----------------------------+--------------+--------------+------------------+
| Loop type                   | Iterations   |  Time (ms)   |   Time/iter (ns) |
+-----------------------------|--------------+--------------+------------------+
| may_goto                    | 8388608      |  3           |   0.36           |
| timed_may_goto (count=65535)| 589674932    |  250         |   0.42           |
| bpf_for                     | 8388608      |  10          |   1.19           |
+-----------------------------+--------------+--------------+------------------+

This gives a good approximation at low overhead while staying close to
the current implementation.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250304003239.2390751-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15 11:48:28 -07:00
Peilin Ye
a752ba4332 bpf: Factor out check_load_mem() and check_store_reg()
Extract BPF_LDX and most non-ATOMIC BPF_STX instruction handling logic
in do_check() into helper functions to be used later.  While we are
here, make that comment about "reserved fields" more specific.

Suggested-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Peilin Ye <yepeilin@google.com>
Link: https://lore.kernel.org/r/8b39c94eac2bb7389ff12392ca666f939124ec4f.1740978603.git.yepeilin@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15 11:48:26 -07:00
Peilin Ye
2626ffe9f3 bpf: Factor out check_atomic_rmw()
Currently, check_atomic() only handles atomic read-modify-write (RMW)
instructions.  Since we are planning to introduce other types of atomic
instructions (i.e., atomic load/store), extract the existing RMW
handling logic into its own function named check_atomic_rmw().

Remove the @insn_idx parameter as it is not really necessary.  Use
'env->insn_idx' instead, as in other places in verifier.c.

Signed-off-by: Peilin Ye <yepeilin@google.com>
Link: https://lore.kernel.org/r/6323ac8e73a10a1c8ee547c77ed68cf8eb6b90e1.1740978603.git.yepeilin@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15 11:48:26 -07:00
Peilin Ye
66faaea94e bpf: Factor out atomic_ptr_type_ok()
Factor out atomic_ptr_type_ok() as a helper function to be used later.

Signed-off-by: Peilin Ye <yepeilin@google.com>
Link: https://lore.kernel.org/r/e5ef8b3116f3fffce78117a14060ddce05eba52a.1740978603.git.yepeilin@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15 11:48:26 -07:00
Eric Dumazet
f8ac5a4e1a bpf: no longer acquire map_idr_lock in bpf_map_inc_not_zero()
bpf_sk_storage_clone() is the only caller of bpf_map_inc_not_zero()
and is holding rcu_read_lock().

map_idr_lock does not add any protection, just remove the cost
for passive TCP flows.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kui-Feng Lee <kuifeng@meta.com>
Cc: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://lore.kernel.org/r/20250301191315.1532629-1-edumazet@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15 11:48:26 -07:00
Kumar Kartikeya Dwivedi
e2d8f560d1 bpf: Summarize sleepable global subprogs
The verifier currently does not permit global subprog calls when a lock
is held, preemption is disabled, or when IRQs are disabled. This is
because we don't know whether the global subprog calls sleepable
functions or not.

In case of locks, there's an additional reason: functions called by the
global subprog may hold additional locks etc. The verifier won't know
while verifying the global subprog whether it was called in context
where a spin lock is already held by the program.

Perform summarization of the sleepable nature of a global subprog just
like changes_pkt_data and then allow calls to global subprogs for
non-sleepable ones from atomic context.

While making this change, I noticed that RCU read sections had no
protection against sleepable global subprog calls, include it in the
checks and fix this while we're at it.

Care needs to be taken to not allow global subprog calls when regular
bpf_spin_lock is held. When resilient spin locks is held, we want to
potentially have this check relaxed, but not for now.

Also make sure extensions freplacing global functions cannot do so
in case the target is non-sleepable, but the extension is. The other
combination is ok.

Tests are included in the next patch to handle all special conditions.

Fixes: 9bb00b2895 ("bpf: Add kfunc bpf_rcu_read_lock/unlock()")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250301151846.1552362-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15 11:48:25 -07:00
Yonghong Song
4b82b181a2 bpf: Allow pre-ordering for bpf cgroup progs
Currently for bpf progs in a cgroup hierarchy, the effective prog array
is computed from bottom cgroup to upper cgroups (post-ordering). For
example, the following cgroup hierarchy
    root cgroup: p1, p2
        subcgroup: p3, p4
have BPF_F_ALLOW_MULTI for both cgroup levels.
The effective cgroup array ordering looks like
    p3 p4 p1 p2
and at run time, progs will execute based on that order.

But in some cases, it is desirable to have root prog executes earlier than
children progs (pre-ordering). For example,
  - prog p1 intends to collect original pkt dest addresses.
  - prog p3 will modify original pkt dest addresses to a proxy address for
    security reason.
The end result is that prog p1 gets proxy address which is not what it
wants. Putting p1 to every child cgroup is not desirable either as it
will duplicate itself in many child cgroups. And this is exactly a use case
we are encountering in Meta.

To fix this issue, let us introduce a flag BPF_F_PREORDER. If the flag
is specified at attachment time, the prog has higher priority and the
ordering with that flag will be from top to bottom (pre-ordering).
For example, in the above example,
    root cgroup: p1, p2
        subcgroup: p3, p4
Let us say p2 and p4 are marked with BPF_F_PREORDER. The final
effective array ordering will be
    p2 p4 p3 p1

Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20250224230116.283071-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15 11:48:25 -07:00
Mykyta Yatsenko
daec295a70 bpf/helpers: Introduce bpf_dynptr_copy kfunc
Introducing bpf_dynptr_copy kfunc allowing copying data from one dynptr to
another. This functionality is useful in scenarios such as capturing XDP
data to a ring buffer.
The implementation consists of 4 branches:
  * A fast branch for contiguous buffer capacity in both source and
destination dynptrs
  * 3 branches utilizing __bpf_dynptr_read and __bpf_dynptr_write to copy
data to/from non-contiguous buffer

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250226183201.332713-3-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15 11:48:16 -07:00
Mykyta Yatsenko
09206af69c bpf/helpers: Refactor bpf_dynptr_read and bpf_dynptr_write
Refactor bpf_dynptr_read and bpf_dynptr_write helpers: extract code
into the static functions namely __bpf_dynptr_read and
__bpf_dynptr_write, this allows calling these without compiler warnings.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250226183201.332713-2-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15 11:47:51 -07:00
Jakub Kicinski
8ef890df40 net: move misc netdev_lock flavors to a separate header
Move the more esoteric helpers for netdev instance lock to
a dedicated header. This avoids growing netdevice.h to infinity
and makes rebuilding the kernel much faster (after touching
the header with the helpers).

The main netdev_lock() / netdev_unlock() functions are used
in static inlines in netdevice.h and will probably be used
most commonly, so keep them in netdevice.h.

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250307183006.2312761-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-08 09:06:50 -08:00
Eric Dumazet
0a5c8b2c8c bpf: fix a possible NULL deref in bpf_map_offload_map_alloc()
Call bpf_dev_offload_check() before netdev_lock_ops().

This is needed if attr->map_ifindex is not valid.

Oops: general protection fault, probably for non-canonical address 0xdffffc0000000197: 0000 [#1] PREEMPT SMP KASAN PTI
KASAN: null-ptr-deref in range [0x0000000000000cb8-0x0000000000000cbf]
 RIP: 0010:netdev_need_ops_lock include/linux/netdevice.h:2792 [inline]
 RIP: 0010:netdev_lock_ops include/linux/netdevice.h:2803 [inline]
 RIP: 0010:bpf_map_offload_map_alloc+0x19a/0x910 kernel/bpf/offload.c:533
Call Trace:
 <TASK>
  map_create+0x946/0x11c0 kernel/bpf/syscall.c:1455
  __sys_bpf+0x6d3/0x820 kernel/bpf/syscall.c:5777
  __do_sys_bpf kernel/bpf/syscall.c:5902 [inline]
  __se_sys_bpf kernel/bpf/syscall.c:5900 [inline]
  __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5900
  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83

Fixes: 97246d6d21 ("net: hold netdev instance lock during ndo_bpf")
Reported-by: syzbot+0c7bfd8cf3aecec92708@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/67caa2b1.050a0220.15b4b9.0077.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250307074303.1497911-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-07 19:09:39 -08:00
Stanislav Fomichev
97246d6d21 net: hold netdev instance lock during ndo_bpf
Cover the paths that come via bpf system call and XSK bind.

Cc: Saeed Mahameed <saeed@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250305163732.2766420-10-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-06 12:59:44 -08:00
Brian Gerst
01c7bc5198 x86/smp: Move cpu number to percpu hot section
No functional change.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250303165246.2175811-5-brgerst@gmail.com
2025-03-04 20:30:33 +01:00
Ingo Molnar
cfdaa618de Merge branch 'x86/cpu' into x86/asm, to pick up dependent commits
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-03-04 11:19:21 +01:00
Brian Gerst
18cdd90aba x86/bpf: Fix BPF percpu accesses
Due to this recent commit in the x86 tree:

  9d7de2aa8b ("Use relative percpu offsets")

percpu addresses went from positive offsets from the GSBASE to negative
kernel virtual addresses.  The BPF verifier has an optimization for
x86-64 that loads the address of cpu_number into a register, but was only
doing a 32-bit load which truncates negative addresses.

Change it to a 64-bit load so that the address is properly sign-extended.

Fixes: 9d7de2aa8b ("Use relative percpu offsets")
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250227195302.1667654-1-brgerst@gmail.com
2025-02-27 21:10:03 +01:00
NeilBrown
88d5baf690
Change inode_operations.mkdir to return struct dentry *
Some filesystems, such as NFS, cifs, ceph, and fuse, do not have
complete control of sequencing on the actual filesystem (e.g.  on a
different server) and may find that the inode created for a mkdir
request already exists in the icache and dcache by the time the mkdir
request returns.  For example, if the filesystem is mounted twice the
directory could be visible on the other mount before it is on the
original mount, and a pair of name_to_handle_at(), open_by_handle_at()
calls could instantiate the directory inode with an IS_ROOT() dentry
before the first mkdir returns.

This means that the dentry passed to ->mkdir() may not be the one that
is associated with the inode after the ->mkdir() completes.  Some
callers need to interact with the inode after the ->mkdir completes and
they currently need to perform a lookup in the (rare) case that the
dentry is no longer hashed.

This lookup-after-mkdir requires that the directory remains locked to
avoid races.  Planned future patches to lock the dentry rather than the
directory will mean that this lookup cannot be performed atomically with
the mkdir.

To remove this barrier, this patch changes ->mkdir to return the
resulting dentry if it is different from the one passed in.
Possible returns are:
  NULL - the directory was created and no other dentry was used
  ERR_PTR() - an error occurred
  non-NULL - this other dentry was spliced in

This patch only changes file-systems to return "ERR_PTR(err)" instead of
"err" or equivalent transformations.  Subsequent patches will make
further changes to some file-systems to return a correct dentry.

Not all filesystems reliably result in a positive hashed dentry:

- NFS, cifs, hostfs will sometimes need to perform a lookup of
  the name to get inode information.  Races could result in this
  returning something different. Note that this lookup is
  non-atomic which is what we are trying to avoid.  Placing the
  lookup in filesystem code means it only happens when the filesystem
  has no other option.
- kernfs and tracefs leave the dentry negative and the ->revalidate
  operation ensures that lookup will be called to correctly populate
  the dentry.  This could be fixed but I don't think it is important
  to any of the users of vfs_mkdir() which look at the dentry.

The recommendation to use
    d_drop();d_splice_alias()
is ugly but fits with current practice.  A planned future patch will
change this.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: NeilBrown <neilb@suse.de>
Link: https://lore.kernel.org/r/20250227013949.536172-2-neilb@suse.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-27 20:00:17 +01:00
Jakub Kicinski
357660d759 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.14-rc5).

Conflicts:

drivers/net/ethernet/cadence/macb_main.c
  fa52f15c74 ("net: cadence: macb: Synchronize stats calculations")
  75696dd0fd ("net: cadence: macb: Convert to get_stats64")
https://lore.kernel.org/20250224125848.68ee63e5@canb.auug.org.au

Adjacent changes:

drivers/net/ethernet/intel/ice/ice_sriov.c
  79990cf5e7 ("ice: Fix deinitializing VF in error path")
  a203163274 ("ice: simplify VF MSI-X managing")

net/ipv4/tcp.c
  18912c5206 ("tcp: devmem: don't write truncated dmabuf CMSGs to userspace")
  297d389e9e ("net: prefix devmem specific helpers")

net/mptcp/subflow.c
  8668860b0a ("mptcp: reset when MPTCP opts are dropped after join")
  c3349a22c2 ("mptcp: consolidate subflow cleanup")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-27 10:20:58 -08:00
Alexei Starovoitov
c9eb8102e2 bpf: Use try_alloc_pages() to allocate pages for bpf needs.
Use try_alloc_pages() and free_pages_nolock() for BPF needs
when context doesn't allow using normal alloc_pages.
This is a prerequisite for further work.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20250222024427.30294-7-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-27 09:39:44 -08:00
Alexander Lobakin
ed16b8a4d1 bpf: cpumap: switch to napi_skb_cache_get_bulk()
Now that cpumap uses GRO, which drops unused skb heads to the NAPI
cache, use napi_skb_cache_get_bulk() to try to reuse cached entries
and lower MM layer pressure. Always disable the BH before checking and
running the cpumap-pinned XDP prog and don't re-enable it in between
that and allocating an skb bulk, as we can access the NAPI caches only
from the BH context.
The better GRO aggregates packets, the less new skbs will be allocated.
If an aggregated skb contains 16 frags, this means 15 skbs were returned
to the cache, so next 15 skbs will be built without allocating anything.

The same trafficgen UDP GRO test now shows:

                GRO off   GRO on
threaded GRO    2.3       4         Mpps
thr bulk GRO    2.4       4.7       Mpps

diff            +4        +17       %

Comparing to the baseline cpumap:

baseline        2.7       N/A       Mpps
thr bulk GRO    2.4       4.7       Mpps
diff            -11       +74       %

Tested-by: Daniel Xu <dxu@dxuuu.xyz>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-27 14:03:39 +01:00
Alexander Lobakin
57efe762cd bpf: cpumap: reuse skb array instead of a linked list to chain skbs
cpumap still uses linked lists to store a list of skbs to pass to the
stack. Now that we don't use listified Rx in favor of
napi_gro_receive(), linked list is now an unneeded overhead.
Inside the polling loop, we already have an array of skbs. Let's reuse
it for skbs passed to cpumap (generic XDP) and keep there in case of
XDP_PASS when a program is installed to the map itself. Don't list
regular xdp_frames after converting them to skbs as well; store them
in the mentioned array (but *before* generic skbs as the latters have
lower priority) and call gro_receive_skb() for each array element after
they're done.

Tested-by: Daniel Xu <dxu@dxuuu.xyz>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-27 14:03:14 +01:00
Alexander Lobakin
4f8ab26a03 bpf: cpumap: switch to GRO from netif_receive_skb_list()
cpumap has its own BH context based on kthread. It has a sane batch
size of 8 frames per one cycle.
GRO can be used here on its own. Adjust cpumap calls to the upper stack
to use GRO API instead of netif_receive_skb_list() which processes skbs
by batches, but doesn't involve GRO layer at all.
In plenty of tests, GRO performs better than listed receiving even
given that it has to calculate full frame checksums on the CPU.
As GRO passes the skbs to the upper stack in the batches of
@gro_normal_batch, i.e. 8 by default, and skb->dev points to the
device where the frame comes from, it is enough to disable GRO
netdev feature on it to completely restore the original behaviour:
untouched frames will be being bulked and passed to the upper stack
by 8, as it was with netif_receive_skb_list().

Tested-by: Daniel Xu <dxu@dxuuu.xyz>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-27 14:03:14 +01:00
Amery Hung
4e4136c644 selftests/bpf: Test gen_pro/epilogue that generate kfuncs
Test gen_prologue and gen_epilogue that generate kfuncs that have not
been seen in the main program.

The main bpf program and return value checks are identical to
pro_epilogue.c introduced in commit 47e69431b5 ("selftests/bpf: Test
gen_prologue and gen_epilogue"). However, now when bpf_testmod_st_ops
detects a program name with prefix "test_kfunc_", it generates slightly
different prologue and epilogue: They still add 1000 to args->a in
prologue, add 10000 to args->a and set r0 to 2 * args->a in epilogue,
but involve kfuncs.

At high level, the alternative version of prologue and epilogue look
like this:

  cgrp = bpf_cgroup_from_id(0);
  if (cgrp)
          bpf_cgroup_release(cgrp);
  else
          /* Perform what original bpf_testmod_st_ops prologue or
           * epilogue does
           */

Since 0 is never a valid cgroup id, the original prologue or epilogue
logic will be performed. As a result, the __retval check should expect
the exact same return value.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20250225233545.285481-2-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-25 19:04:43 -08:00
Amery Hung
d519594ee2 bpf: Search and add kfuncs in struct_ops prologue and epilogue
Currently, add_kfunc_call() is only invoked once before the main
verification loop. Therefore, the verifier could not find the
bpf_kfunc_btf_tab of a new kfunc call which is not seen in user defined
struct_ops operators but introduced in gen_prologue or gen_epilogue
during do_misc_fixup(). Fix this by searching kfuncs in the patching
instruction buffer and add them to prog->aux->kfunc_tab.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20250225233545.285481-1-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-25 19:04:43 -08:00
Eduard Zingerman
f3c2d243a3 bpf: abort verification if env->cur_state->loop_entry != NULL
In addition to warning abort verification with -EFAULT.
If env->cur_state->loop_entry != NULL something is irrecoverably
buggy.

Fixes: bbbc02b744 ("bpf: copy_verifier_state() should copy 'loop_entry' field")
Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20250225003838.135319-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-25 19:02:46 -08:00
Yonghong Song
11ba7ce076 bpf: Fix kmemleak warning for percpu hashmap
Vlad Poenaru reported the following kmemleak issue:

  unreferenced object 0x606fd7c44ac8 (size 32):
    backtrace (crc 0):
      pcpu_alloc_noprof+0x730/0xeb0
      bpf_map_alloc_percpu+0x69/0xc0
      prealloc_init+0x9d/0x1b0
      htab_map_alloc+0x363/0x510
      map_create+0x215/0x3a0
      __sys_bpf+0x16b/0x3e0
      __x64_sys_bpf+0x18/0x20
      do_syscall_64+0x7b/0x150
      entry_SYSCALL_64_after_hwframe+0x4b/0x53

Further investigation shows the reason is due to not 8-byte aligned
store of percpu pointer in htab_elem_set_ptr():
  *(void __percpu **)(l->key + key_size) = pptr;

Note that the whole htab_elem alignment is 8 (for x86_64). If the key_size
is 4, that means pptr is stored in a location which is 4 byte aligned but
not 8 byte aligned. In mm/kmemleak.c, scan_block() scans the memory based
on 8 byte stride, so it won't detect above pptr, hence reporting the memory
leak.

In htab_map_alloc(), we already have

        htab->elem_size = sizeof(struct htab_elem) +
                          round_up(htab->map.key_size, 8);
        if (percpu)
                htab->elem_size += sizeof(void *);
        else
                htab->elem_size += round_up(htab->map.value_size, 8);

So storing pptr with 8-byte alignment won't cause any problem and can fix
kmemleak too.

The issue can be reproduced with bpf selftest as well:
  1. Enable CONFIG_DEBUG_KMEMLEAK config
  2. Add a getchar() before skel destroy in test_hash_map() in prog_tests/for_each.c.
     The purpose is to keep map available so kmemleak can be detected.
  3. run './test_progs -t for_each/hash_map &' and a kmemleak should be reported.

Reported-by: Vlad Poenaru <thevlad@meta.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20250224175514.2207227-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-24 12:11:00 -08:00
Amery Hung
201b62ccc8 bpf: Refactor check_ctx_access()
Reduce the variable passing madness surrounding check_ctx_access().
Currently, check_mem_access() passes many pointers to local variables to
check_ctx_access(). They are used to initialize "struct
bpf_insn_access_aux info" in check_ctx_access() and then passed to
is_valid_access(). Then, check_ctx_access() takes the data our from
info and write them back the pointers to pass them back. This can be
simpilified by moving info up to check_mem_access().

No functional change.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Link: https://lore.kernel.org/r/20250221175644.1822383-1-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-23 12:18:20 -08:00
Amery Hung
38f1e66abd bpf: Do not allow tail call in strcut_ops program with __ref argument
Reject struct_ops programs with refcounted kptr arguments (arguments
tagged with __ref suffix) that tail call. Once a refcounted kptr is
passed to a struct_ops program from the kernel, it can be freed or
xchged into maps. As there is no guarantee a callee can get the same
valid refcounted kptr in the ctx, we cannot allow such usage.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250220221532.1079331-1-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-20 18:44:35 -08:00
Alexei Starovoitov
bd4319b6c2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf bpf-6.14-rc4
Cross-merge bpf fixes after downstream PR (bpf-6.14-rc4).

Minor conflict:
  kernel/bpf/btf.c
Adjacent changes:
  kernel/bpf/arena.c
  kernel/bpf/btf.c
  kernel/bpf/syscall.c
  kernel/bpf/verifier.c
  mm/memory.c

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-20 18:13:57 -08:00
Linus Torvalds
319fc77f8f BPF fixes:
- Fix a soft-lockup in BPF arena_map_free on 64k page size
   kernels (Alan Maguire)
 
 - Fix a missing allocation failure check in BPF verifier's
   acquire_lock_state (Kumar Kartikeya Dwivedi)
 
 - Fix a NULL-pointer dereference in trace_kfree_skb by adding
   kfree_skb to the raw_tp_null_args set (Kuniyuki Iwashima)
 
 - Fix a deadlock when freeing BPF cgroup storage (Abel Wu)
 
 - Fix a syzbot-reported deadlock when holding BPF map's
   freeze_mutex (Andrii Nakryiko)
 
 - Fix a use-after-free issue in bpf_test_init when
   eth_skb_pkt_type is accessing skb data not containing an
   Ethernet header (Shigeru Yoshida)
 
 - Fix skipping non-existing keys in generic_map_lookup_batch
   (Yan Zhai)
 
 - Several BPF sockmap fixes to address incorrect TCP copied_seq
   calculations, which prevented correct data reads from recv(2)
   in user space (Jiayuan Chen)
 
 - Two fixes for BPF map lookup nullness elision (Daniel Xu)
 
 - Fix a NULL-pointer dereference from vmlinux BTF lookup in
   bpf_sk_storage_tracing_allowed (Jared Kangas)
 
 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
 -----BEGIN PGP SIGNATURE-----
 
 iIsEABYKADMWIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZ7evlRUcZGFuaWVsQGlv
 Z2VhcmJveC5uZXQACgkQ2yufC7HISIPwHgD/dTvM00Ck4Q73fPivyT7tcqxeXJlD
 D6ggzWl/SG9LAbwA/2/cSgAM9Jm1g7ddvn/S9QaDYOs5GmFl6urq6krs+tYD
 =FCs9
 -----END PGP SIGNATURE-----

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull BPF fixes from Daniel Borkmann:

 - Fix a soft-lockup in BPF arena_map_free on 64k page size kernels
   (Alan Maguire)

 - Fix a missing allocation failure check in BPF verifier's
   acquire_lock_state (Kumar Kartikeya Dwivedi)

 - Fix a NULL-pointer dereference in trace_kfree_skb by adding kfree_skb
   to the raw_tp_null_args set (Kuniyuki Iwashima)

 - Fix a deadlock when freeing BPF cgroup storage (Abel Wu)

 - Fix a syzbot-reported deadlock when holding BPF map's freeze_mutex
   (Andrii Nakryiko)

 - Fix a use-after-free issue in bpf_test_init when eth_skb_pkt_type is
   accessing skb data not containing an Ethernet header (Shigeru
   Yoshida)

 - Fix skipping non-existing keys in generic_map_lookup_batch (Yan Zhai)

 - Several BPF sockmap fixes to address incorrect TCP copied_seq
   calculations, which prevented correct data reads from recv(2) in user
   space (Jiayuan Chen)

 - Two fixes for BPF map lookup nullness elision (Daniel Xu)

 - Fix a NULL-pointer dereference from vmlinux BTF lookup in
   bpf_sk_storage_tracing_allowed (Jared Kangas)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests: bpf: test batch lookup on array of maps with holes
  bpf: skip non exist keys in generic_map_lookup_batch
  bpf: Handle allocation failure in acquire_lock_state
  bpf: verifier: Disambiguate get_constant_map_key() errors
  bpf: selftests: Test constant key extraction on irrelevant maps
  bpf: verifier: Do not extract constant map keys for irrelevant maps
  bpf: Fix softlockup in arena_map_free on 64k page kernel
  net: Add rx_skb of kfree_skb to raw_tp_null_args[].
  bpf: Fix deadlock when freeing cgroup storage
  selftests/bpf: Add strparser test for bpf
  selftests/bpf: Fix invalid flag of recv()
  bpf: Disable non stream socket for strparser
  bpf: Fix wrong copied_seq calculation
  strparser: Add read_sock callback
  bpf: avoid holding freeze_mutex during mmap operation
  bpf: unify VM_WRITE vs VM_MAYWRITE use in BPF map mmaping logic
  selftests/bpf: Adjust data size to have ETH_HLEN
  bpf, test_run: Fix use-after-free issue in eth_skb_pkt_type()
  bpf: Remove unnecessary BTF lookups in bpf_sk_storage_tracing_allowed
2025-02-20 15:37:17 -08:00
Jason Xing
5942246426 bpf: Support selective sampling for bpf timestamping
Add the bpf_sock_ops_enable_tx_tstamp kfunc to allow BPF programs to
selectively enable TX timestamping on a skb during tcp_sendmsg().

For example, BPF program will limit tracking X numbers of packets
and then will stop there instead of tracing all the sendmsgs of
matched flow all along. It would be helpful for users who cannot
afford to calculate latencies from every sendmsg call probably
due to the performance or storage space consideration.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250220072940.99994-12-kerneljasonxing@gmail.com
2025-02-20 14:30:02 -08:00
Jordan Rome
f0f8a5b58f bpf: Add bpf_copy_from_user_task_str() kfunc
This new kfunc will be able to copy a zero-terminated C strings from
another task's address space. This is similar to `bpf_copy_from_user_str()`
but reads memory of specified task.

Signed-off-by: Jordan Rome <linux@jordanrome.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250213152125.1837400-2-linux@jordanrome.com
2025-02-19 17:00:06 -08:00
Eduard Zingerman
574078b001 bpf: fix env->peak_states computation
Compute env->peak_states as a maximum value of sum of
env->explored_states and env->free_list size.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250215110411.3236773-11-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-18 19:22:59 -08:00
Eduard Zingerman
408fcf946b bpf: free verifier states when they are no longer referenced
When fixes from patches 1 and 3 are applied, Patrick Somaru reported
an increase in memory consumption for sched_ext iterator-based
programs hitting 1M instructions limit. For example, 2Gb VMs ran out
of memory while verifying a program. Similar behaviour could be
reproduced on current bpf-next master.

Here is an example of such program:

    /* verification completes if given 16G or RAM,
     * final env->free_list size is 369,960 entries.
     */
    SEC("raw_tp")
    __flag(BPF_F_TEST_STATE_FREQ)
    __success
    int free_list_bomb(const void *ctx)
    {
        volatile char buf[48] = {};
        unsigned i, j;

        j = 0;
        bpf_for(i, 0, 10) {
            /* this forks verifier state:
             * - verification of current path continues and
             *   creates a checkpoint after 'if';
             * - verification of forked path hits the
             *   checkpoint and marks it as loop_entry.
             */
            if (bpf_get_prandom_u32())
                asm volatile ("");
            /* this marks 'j' as precise, thus any checkpoint
             * created on current iteration would not be matched
             * on the next iteration.
             */
            buf[j++] = 42;
            j %= ARRAY_SIZE(buf);
        }
        asm volatile (""::"r"(buf));
        return 0;
    }

Memory consumption increased due to more states being marked as loop
entries and eventually added to env->free_list.

This commit introduces logic to free states from env->free_list during
verification. A state in env->free_list can be freed if:
- it has no child states;
- it is not used as a loop_entry.

This commit:
- updates bpf_verifier_state->used_as_loop_entry to be a counter
  that tracks how many states use this one as a loop entry;
- adds a function maybe_free_verifier_state(), which:
  - frees a state if its ->branches and ->used_as_loop_entry counters
    are both zero;
  - if the state is freed, state->loop_entry->used_as_loop_entry is
    decremented, and an attempt is made to free state->loop_entry.

In the example above, this approach reduces the maximum number of
states in the free list from 369,960 to 16,223.

However, this approach has its limitations. If the buf size in the
example above is modified to 64, state caching overflows: the state
for j=0 is evicted from the cache before it can be used to stop
traversal. As a result, states in the free list accumulate because
their branch counters do not reach zero.

The effect of this patch on the selftests looks as follows:

File                              Program                               Max free list (A)  Max free list (B)  Max free list (DIFF)
--------------------------------  ------------------------------------  -----------------  -----------------  --------------------
arena_list.bpf.o                  arena_list_add                                       17                  3         -14 (-82.35%)
bpf_iter_task_stack.bpf.o         dump_task_stack                                      39                  9         -30 (-76.92%)
iters.bpf.o                       checkpoint_states_deletion                          265                 89        -176 (-66.42%)
iters.bpf.o                       clean_live_states                                    19                  0        -19 (-100.00%)
profiler2.bpf.o                   tracepoint__syscalls__sys_enter_kill                102                  1        -101 (-99.02%)
profiler3.bpf.o                   tracepoint__syscalls__sys_enter_kill                144                  0       -144 (-100.00%)
pyperf600_iter.bpf.o              on_event                                             15                  0        -15 (-100.00%)
pyperf600_nounroll.bpf.o          on_event                                           1170               1158          -12 (-1.03%)
setget_sockopt.bpf.o              skops_sockopt                                        18                  0        -18 (-100.00%)
strobemeta_nounroll1.bpf.o        on_event                                            147                 83         -64 (-43.54%)
strobemeta_nounroll2.bpf.o        on_event                                            312                209        -103 (-33.01%)
strobemeta_subprogs.bpf.o         on_event                                            124                 86         -38 (-30.65%)
test_cls_redirect_subprogs.bpf.o  cls_redirect                                         15                  0        -15 (-100.00%)
timer.bpf.o                       test1                                                30                 15         -15 (-50.00%)

Measured using "do-not-submit" patches from here:
https://github.com/eddyz87/bpf/tree/get-loop-entry-hungup

Reported-by: Patrick Somaru <patsomaru@meta.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250215110411.3236773-10-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-18 19:22:59 -08:00
Eduard Zingerman
5564ee3abb bpf: use list_head to track explored states and free list
The next patch in the set needs the ability to remove individual
states from env->free_list while only holding a pointer to the state.
Which requires env->free_list to be a doubly linked list.
This patch converts env->free_list and struct bpf_verifier_state_list
to use struct list_head for this purpose. The change to
env->explored_states is collateral.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250215110411.3236773-9-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-18 19:22:59 -08:00
Eduard Zingerman
590eee4268 bpf: do not update state->loop_entry in get_loop_entry()
The patch 9 is simpler if less places modify loop_entry field.
The loop deleted by this patch does not affect correctness, but is a
performance optimization. However, measurements on selftests and
sched_ext programs show that this optimization is unnecessary:
- at most 2 steps are done in get_loop_entry();
- most of the time 0 or 1 steps are done in get_loop_entry().

Measured using "do-not-submit" patches from here:
https://github.com/eddyz87/bpf/tree/get-loop-entry-hungup

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250215110411.3236773-8-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-18 19:22:59 -08:00
Eduard Zingerman
bb7abf3049 bpf: make state->dfs_depth < state->loop_entry->dfs_depth an invariant
For a generic loop detection algorithm a graph node can be a loop
header for itself. However, state loop entries are computed for use in
is_state_visited(), where get_loop_entry(state)->branches is checked.
is_state_visited() also checks state->branches, thus the case when
state == state->loop_entry is not interesting for is_state_visited().

This change does not affect correctness, but simplifies
get_loop_entry() a bit and also simplifies change to
update_loop_entry() in patch 9.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250215110411.3236773-7-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-18 19:22:59 -08:00
Eduard Zingerman
c1ce66357f bpf: detect infinite loop in get_loop_entry()
Tejun Heo reported an infinite loop in get_loop_entry(),
when verifying a sched_ext program layered_dispatch in [1].
After some investigation I'm sure that root cause is fixed by patches
1,3 in this patch-set.

To err on the safe side, this commit modifies get_loop_entry() to
detect infinite loops and abort verification in such cases.
The number of steps get_loop_entry(S) can make while moving along the
bpf_verifier_state->loop_entry chain is bounded by the DFS depth of
state S. This fact is exploited to implement the check.

To avoid dealing with the potential error code returned from
get_loop_entry() in update_loop_entry(), remove the get_loop_entry()
calls there:
- This change does not affect correctness. Loop entries would still be
  updated during the backward DFS move in update_branch_counts().
- This change does not affect performance. Measurements show that
  get_loop_entry() performs at most 1 step on selftests and at most 2
  steps on sched_ext programs (1 step in 17 cases, 2 steps in 3
  cases, measured using "do-not-submit" patches from [2]).

[1] https://github.com/sched-ext/scx/
    commit f0b27038ea10 ("XXX - kernel stall")
[2] https://github.com/eddyz87/bpf/tree/get-loop-entry-hungup

Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250215110411.3236773-6-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-18 19:22:59 -08:00
Eduard Zingerman
9e63fdb0cb bpf: don't do clean_live_states when state->loop_entry->branches > 0
verifier.c:is_state_visited() uses RANGE_WITHIN states comparison rules
for cached states that have loop_entry with non-zero branches count
(meaning that loop_entry's verification is not yet done).

The RANGE_WITHIN rules in regsafe()/stacksafe() require register and
stack objects types to be identical in current and old states.

verifier.c:clean_live_states() replaces registers and stack spills
with NOT_INIT/STACK_INVALID marks, if these registers/stack spills are
not read in any child state. This means that clean_live_states() works
against loop convergence logic under some conditions. See selftest in
the next patch for a specific example.

Mitigate this by prohibiting clean_verifier_state() when
state->loop_entry->branches > 0.

This undoes negative verification performance impact of the
copy_verifier_state() fix from the previous patch.
Below is comparison between master and current patch.

selftests:

File                                Program                       Insns (A)  Insns (B)  Insns    (DIFF)  States (A)  States (B)  States  (DIFF)
----------------------------------  ----------------------------  ---------  ---------  ---------------  ----------  ----------  --------------
arena_htab.bpf.o                    arena_htab_llvm                     717        423   -294 (-41.00%)          57          37   -20 (-35.09%)
arena_htab_asm.bpf.o                arena_htab_asm                      597        445   -152 (-25.46%)          47          37   -10 (-21.28%)
arena_list.bpf.o                    arena_list_add                     1493       1822   +329 (+22.04%)          30          37    +7 (+23.33%)
arena_list.bpf.o                    arena_list_del                      309        261    -48 (-15.53%)          23          15    -8 (-34.78%)
iters.bpf.o                         checkpoint_states_deletion        18125      22154  +4029 (+22.23%)         818         918  +100 (+12.22%)
iters.bpf.o                         iter_nested_deeply_iters            593        367   -226 (-38.11%)          67          43   -24 (-35.82%)
iters.bpf.o                         iter_nested_iters                   813        772     -41 (-5.04%)          79          72     -7 (-8.86%)
iters.bpf.o                         iter_subprog_check_stacksafe        155        135    -20 (-12.90%)          15          14     -1 (-6.67%)
iters.bpf.o                         iter_subprog_iters                 1094        808   -286 (-26.14%)          88          68   -20 (-22.73%)
iters.bpf.o                         loop_state_deps2                    479        356   -123 (-25.68%)          46          35   -11 (-23.91%)
iters.bpf.o                         triple_continue                      35         31     -4 (-11.43%)           3           3     +0 (+0.00%)
kmem_cache_iter.bpf.o               open_coded_iter                      63         59      -4 (-6.35%)           7           6    -1 (-14.29%)
mptcp_subflow.bpf.o                 _getsockopt_subflow                 501        446    -55 (-10.98%)          25          23     -2 (-8.00%)
pyperf600_iter.bpf.o                on_event                          12339       6379  -5960 (-48.30%)         441         286  -155 (-35.15%)
verifier_bits_iter.bpf.o            max_words                            92         84      -8 (-8.70%)           8           7    -1 (-12.50%)
verifier_iterating_callbacks.bpf.o  cond_break2                         113        192    +79 (+69.91%)          12          21    +9 (+75.00%)

sched_ext:

File               Program                 Insns (A)  Insns (B)  Insns      (DIFF)  States (A)  States (B)  States    (DIFF)
-----------------  ----------------------  ---------  ---------  -----------------  ----------  ----------  ----------------
bpf.bpf.o          layered_dispatch            11485       9039    -2446 (-21.30%)         848         662    -186 (-21.93%)
bpf.bpf.o          layered_dump                 7422       5022    -2400 (-32.34%)         681         298    -383 (-56.24%)
bpf.bpf.o          layered_enqueue             16854      13753    -3101 (-18.40%)        1611        1308    -303 (-18.81%)
bpf.bpf.o          layered_init              1000001       5549  -994452 (-99.45%)       84672         523  -84149 (-99.38%)
bpf.bpf.o          layered_runnable             3149       1899    -1250 (-39.70%)         288         151    -137 (-47.57%)
bpf.bpf.o          p2dq_init                    2343       1936     -407 (-17.37%)         201         170     -31 (-15.42%)
bpf.bpf.o          refresh_layer_cpumasks      16487       1285   -15202 (-92.21%)        1770         120   -1650 (-93.22%)
bpf.bpf.o          rusty_select_cpu             1937       1386     -551 (-28.45%)         177         125     -52 (-29.38%)
scx_central.bpf.o  central_dispatch              636        600       -36 (-5.66%)          63          59       -4 (-6.35%)
scx_central.bpf.o  central_init                  913        632     -281 (-30.78%)          48          39      -9 (-18.75%)
scx_nest.bpf.o     nest_init                     636        601       -35 (-5.50%)          60          58       -2 (-3.33%)
scx_pair.bpf.o     pair_dispatch             1000001       1914  -998087 (-99.81%)       58169         142  -58027 (-99.76%)
scx_qmap.bpf.o     qmap_dispatch                2393       2187      -206 (-8.61%)         196         174     -22 (-11.22%)
scx_qmap.bpf.o     qmap_init                   16367      22777    +6410 (+39.16%)         603         768    +165 (+27.36%)

'layered_init' and 'pair_dispatch' hit 1M on master, but are verified
ok with this patch.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250215110411.3236773-4-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-18 19:22:59 -08:00
Eduard Zingerman
bbbc02b744 bpf: copy_verifier_state() should copy 'loop_entry' field
The bpf_verifier_state.loop_entry state should be copied by
copy_verifier_state(). Otherwise, .loop_entry values from unrelated
states would poison env->cur_state.

Additionally, env->stack should not contain any states with
.loop_entry != NULL. The states in env->stack are yet to be verified,
while .loop_entry is set for states that reached an equivalent state.
This means that env->cur_state->loop_entry should always be NULL after
pop_stack().

See the selftest in the next commit for an example of the program that
is not safe yet is accepted by verifier w/o this fix.

This change has some verification performance impact for selftests:

File                                Program                       Insns (A)  Insns (B)  Insns   (DIFF)  States (A)  States (B)  States (DIFF)
----------------------------------  ----------------------------  ---------  ---------  --------------  ----------  ----------  -------------
arena_htab.bpf.o                    arena_htab_llvm                     717        426  -291 (-40.59%)          57          37  -20 (-35.09%)
arena_htab_asm.bpf.o                arena_htab_asm                      597        445  -152 (-25.46%)          47          37  -10 (-21.28%)
arena_list.bpf.o                    arena_list_del                      309        279    -30 (-9.71%)          23          14   -9 (-39.13%)
iters.bpf.o                         iter_subprog_check_stacksafe        155        141    -14 (-9.03%)          15          14    -1 (-6.67%)
iters.bpf.o                         iter_subprog_iters                 1094       1003    -91 (-8.32%)          88          83    -5 (-5.68%)
iters.bpf.o                         loop_state_deps2                    479        725  +246 (+51.36%)          46          63  +17 (+36.96%)
kmem_cache_iter.bpf.o               open_coded_iter                      63         59     -4 (-6.35%)           7           6   -1 (-14.29%)
verifier_bits_iter.bpf.o            max_words                            92         84     -8 (-8.70%)           8           7   -1 (-12.50%)
verifier_iterating_callbacks.bpf.o  cond_break2                         113        107     -6 (-5.31%)          12          12    +0 (+0.00%)

And significant negative impact for sched_ext:

File               Program                 Insns (A)  Insns (B)  Insns         (DIFF)  States (A)  States (B)  States      (DIFF)
-----------------  ----------------------  ---------  ---------  --------------------  ----------  ----------  ------------------
bpf.bpf.o          lavd_init                    7039      14723      +7684 (+109.16%)         490        1139     +649 (+132.45%)
bpf.bpf.o          layered_dispatch            11485      10548         -937 (-8.16%)         848         762       -86 (-10.14%)
bpf.bpf.o          layered_dump                 7422    1000001  +992579 (+13373.47%)         681       31178  +30497 (+4478.27%)
bpf.bpf.o          layered_enqueue             16854      71127     +54273 (+322.02%)        1611        6450    +4839 (+300.37%)
bpf.bpf.o          p2dq_dispatch                 665        791        +126 (+18.95%)          68          78       +10 (+14.71%)
bpf.bpf.o          p2dq_init                    2343       2980        +637 (+27.19%)         201         237       +36 (+17.91%)
bpf.bpf.o          refresh_layer_cpumasks      16487     674760   +658273 (+3992.68%)        1770       65370  +63600 (+3593.22%)
bpf.bpf.o          rusty_select_cpu             1937      40872    +38935 (+2010.07%)         177        3210   +3033 (+1713.56%)
scx_central.bpf.o  central_dispatch              636       2687      +2051 (+322.48%)          63         227     +164 (+260.32%)
scx_nest.bpf.o     nest_init                     636        815        +179 (+28.14%)          60          73       +13 (+21.67%)
scx_qmap.bpf.o     qmap_dispatch                2393       3580       +1187 (+49.60%)         196         253       +57 (+29.08%)
scx_qmap.bpf.o     qmap_dump                     233        318         +85 (+36.48%)          22          30        +8 (+36.36%)
scx_qmap.bpf.o     qmap_init                   16367      17436        +1069 (+6.53%)         603         669       +66 (+10.95%)

Note 'layered_dump' program, which now hits 1M instructions limit.
This impact would be mitigated in the next patch.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250215110411.3236773-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-18 19:22:58 -08:00
Yan Zhai
5644c6b50f bpf: skip non exist keys in generic_map_lookup_batch
The generic_map_lookup_batch currently returns EINTR if it fails with
ENOENT and retries several times on bpf_map_copy_value. The next batch
would start from the same location, presuming it's a transient issue.
This is incorrect if a map can actually have "holes", i.e.
"get_next_key" can return a key that does not point to a valid value. At
least the array of maps type may contain such holes legitly. Right now
these holes show up, generic batch lookup cannot proceed any more. It
will always fail with EINTR errors.

Rather, do not retry in generic_map_lookup_batch. If it finds a non
existing element, skip to the next key. This simple solution comes with
a price that transient errors may not be recovered, and the iteration
might cycle back to the first key under parallel deletion. For example,
Hou Tao <houtao@huaweicloud.com> pointed out a following scenario:

For LPM trie map:
(1) ->map_get_next_key(map, prev_key, key) returns a valid key

(2) bpf_map_copy_value() return -ENOMENT
It means the key must be deleted concurrently.

(3) goto next_key
It swaps the prev_key and key

(4) ->map_get_next_key(map, prev_key, key) again
prev_key points to a non-existing key, for LPM trie it will treat just
like prev_key=NULL case, the returned key will be duplicated.

With the retry logic, the iteration can continue to the key next to the
deleted one. But if we directly skip to the next key, the iteration loop
would restart from the first key for the lpm_trie type.

However, not all races may be recovered. For example, if current key is
deleted after instead of before bpf_map_copy_value, or if the prev_key
also gets deleted, then the loop will still restart from the first key
for lpm_tire anyway. For generic lookup it might be better to stay
simple, i.e. just skip to the next key. To guarantee that the output
keys are not duplicated, it is better to implement map type specific
batch operations, which can properly lock the trie and synchronize with
concurrent mutators.

Fixes: cb4d03ab49 ("bpf: Add generic support for lookup batch op")
Closes: https://lore.kernel.org/bpf/Z6JXtA1M5jAZx8xD@debian.debian/
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/85618439eea75930630685c467ccefeac0942e2b.1739171594.git.yan@cloudflare.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-18 17:27:37 -08:00
Nam Cao
deacdc871b bpf: Switch to use hrtimer_setup()
hrtimer_setup() takes the callback function pointer as argument and
initializes the timer completely.

Replace hrtimer_init() and the open coded initialization of
hrtimer::function with the new setup mechanism.

Patch was created by using Coccinelle.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/e4be2486f02a8e0ef5aa42624f1708d23e88ad57.1738746821.git.namcao@linutronix.de
2025-02-18 10:32:33 +01:00
Amery Hung
8d9f547f74 bpf: Allow struct_ops prog to return referenced kptr
Allow a struct_ops program to return a referenced kptr if the struct_ops
operator's return type is a struct pointer. To make sure the returned
pointer continues to be valid in the kernel, several constraints are
required:

1) The type of the pointer must matches the return type
2) The pointer originally comes from the kernel (not locally allocated)
3) The pointer is in its unmodified form

Implementation wise, a referenced kptr first needs to be allowed to _leak_
in check_reference_leak() if it is in the return register. Then, in
check_return_code(), constraints 1-3 are checked. During struct_ops
registration, a check is also added to warn about operators with
non-struct pointer return.

In addition, since the first user, Qdisc_ops::dequeue, allows a NULL
pointer to be returned when there is no skb to be dequeued, we will allow
a scalar value with value equals to NULL to be returned.

In the future when there is a struct_ops user that always expects a valid
pointer to be returned from an operator, we may extend tagging to the
return value. We can tell the verifier to only allow NULL pointer return
if the return value is tagged with MAY_BE_NULL.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20250217190640.1748177-5-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-17 18:47:27 -08:00
Amery Hung
a687df2008 bpf: Support getting referenced kptr from struct_ops argument
Allows struct_ops programs to acqurie referenced kptrs from arguments
by directly reading the argument.

The verifier will acquire a reference for struct_ops a argument tagged
with "__ref" in the stub function in the beginning of the main program.
The user will be able to access the referenced kptr directly by reading
the context as long as it has not been released by the program.

This new mechanism to acquire referenced kptr (compared to the existing
"kfunc with KF_ACQUIRE") is introduced for ergonomic and semantic reasons.
In the first use case, Qdisc_ops, an skb is passed to .enqueue in the
first argument. This mechanism provides a natural way for users to get a
referenced kptr in the .enqueue struct_ops programs and makes sure that a
qdisc will always enqueue or drop the skb.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20250217190640.1748177-3-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-17 18:47:27 -08:00
Amery Hung
432051806f bpf: Make every prog keep a copy of ctx_arg_info
Currently, ctx_arg_info is read-only in the view of the verifier since
it is shared among programs of the same attach type. Make each program
have their own copy of ctx_arg_info so that we can use it to store
program specific information.

In the next patch where we support acquiring a referenced kptr through a
struct_ops argument tagged with "__ref", ctx_arg_info->ref_obj_id will
be used to store the unique reference object id of the argument. This
avoids creating a requirement in the verifier that "__ref" tagged
arguments must be the first set of references acquired [0].

[0] https://lore.kernel.org/bpf/20241220195619.2022866-2-amery.hung@gmail.com/

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20250217190640.1748177-2-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-17 18:47:27 -08:00
Jiayuan Chen
6ebc5030e0 bpf: Fix array bounds error with may_goto
may_goto uses an additional 8 bytes on the stack, which causes the
interpreters[] array to go out of bounds when calculating index by
stack_size.

1. If a BPF program is rewritten, re-evaluate the stack size. For non-JIT
cases, reject loading directly.

2. For non-JIT cases, calculating interpreters[idx] may still cause
out-of-bounds array access, and just warn about it.

3. For jit_requested cases, the execution of bpf_func also needs to be
warned. So move the definition of function __bpf_prog_ret0_warn out of
the macro definition CONFIG_BPF_JIT_ALWAYS_ON.

Reported-by: syzbot+d2a2c639d03ac200a4f1@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/0000000000000f823606139faa5d@google.com/
Fixes: 011832b97b ("bpf: Introduce may_goto instruction")
Signed-off-by: Jiayuan Chen <mrpre@163.com>
Link: https://lore.kernel.org/r/20250214091823.46042-2-mrpre@163.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-14 19:55:15 -08:00
Song Liu
5646729279 bpf: fs/xattr: Add BPF kfuncs to set and remove xattrs
Add the following kfuncs to set and remove xattrs from BPF programs:

  bpf_set_dentry_xattr
  bpf_remove_dentry_xattr
  bpf_set_dentry_xattr_locked
  bpf_remove_dentry_xattr_locked

The _locked version of these kfuncs are called from hooks where
dentry->d_inode is already locked. Instead of requiring the user
to know which version of the kfuncs to use, the verifier will pick
the proper kfunc based on the calling hook.

Signed-off-by: Song Liu <song@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Matt Bobrowski <mattbobrowski@google.com>
Link: https://lore.kernel.org/r/20250130213549.3353349-5-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-13 19:35:32 -08:00
Song Liu
7587d735b1 bpf: lsm: Add two more sleepable hooks
Add bpf_lsm_inode_removexattr and bpf_lsm_inode_post_removexattr to list
sleepable_lsm_hooks. These two hooks are always called from sleepable
context.

Signed-off-by: Song Liu <song@kernel.org>
Reviewed-by: Matt Bobrowski <mattbobrowski@google.com>
Link: https://lore.kernel.org/r/20250130213549.3353349-4-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-13 19:35:31 -08:00
Jiri Olsa
c83e2d970b bpf: Add tracepoints with null-able arguments
Some of the tracepoints slipped when we did the first scan, adding them now.

Fixes: 838a10bd2e ("bpf: Augment raw_tp arguments with PTR_MAYBE_NULL")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20250210175913.2893549-1-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-13 17:01:36 -08:00
Ihor Solodrai
ea145d530a bpf: define KF_ARENA_* flags for bpf_arena kfuncs
bpf_arena_alloc_pages() and bpf_arena_free_pages() work with the
bpf_arena pointers [1], which is indicated by the __arena macro in the
kernel source code:

    #define __arena __attribute__((address_space(1)))

However currently this information is absent from the debug data in
the vmlinux binary. As a consequence, bpf_arena_* kfuncs declarations
in vmlinux.h (produced by bpftool) do not match prototypes expected by
the BPF programs attempting to use these functions.

Introduce a set of kfunc flags to mark relevant types as bpf_arena
pointers. The flags then can be detected by pahole when generating BTF
from vmlinux's DWARF, allowing it to emit corresponding BTF type tags
for the marked kfuncs.

With recently proposed BTF extension [2], these type tags will be
processed by bpftool when dumping vmlinux.h, and corresponding
compiler attributes will be added to the declarations.

[1] https://lwn.net/Articles/961594/
[2] https://lore.kernel.org/bpf/20250130201239.1429648-1-ihor.solodrai@linux.dev/

Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/20250206003148.2308659-1-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-07 18:22:52 -08:00
Kumar Kartikeya Dwivedi
8784714d7f bpf: Handle allocation failure in acquire_lock_state
The acquire_lock_state function needs to handle possible NULL values
returned by acquire_reference_state, and return -ENOMEM.

Fixes: 769b0f1c82 ("bpf: Refactor {acquire,release}_reference_state")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250206105435.2159977-24-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-07 18:17:07 -08:00
Daniel Xu
7968c65815 bpf: verifier: Disambiguate get_constant_map_key() errors
Refactor get_constant_map_key() to disambiguate the constant key
value from potential error values. In the case that the key is
negative, it could be confused for an error.

It's not currently an issue, as the verifier seems to track s32 spills
as u32. So even if the program wrongly uses a negative value for an
arraymap key, the verifier just thinks it's an impossibly high value
which gets correctly discarded.

Refactor anyways to make things cleaner and prevent potential future
issues.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/dfe144259ae7cfc98aa63e1b388a14869a10632a.1738689872.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-07 15:45:44 -08:00
Daniel Xu
884c3a18da bpf: verifier: Do not extract constant map keys for irrelevant maps
Previously, we were trying to extract constant map keys for all
bpf_map_lookup_elem(), regardless of map type. This is an issue if the
map has a u64 key and the value is very high, as it can be interpreted
as a negative signed value. This in turn is treated as an error value by
check_func_arg() which causes a valid program to be incorrectly
rejected.

Fix by only extracting constant map keys for relevant maps. This fix
works because nullness elision is only allowed for {PERCPU_}ARRAY maps,
and keys for these are within u32 range. See next commit for an example
via selftest.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Reported-by: Ilya Leoshkevich <iii@linux.ibm.com>
Tested-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/aa868b642b026ff87ba6105ea151bc8693b35932.1738689872.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-07 15:45:43 -08:00
Alan Maguire
517e8a7835 bpf: Fix softlockup in arena_map_free on 64k page kernel
On an aarch64 kernel with CONFIG_PAGE_SIZE_64KB=y,
arena_htab tests cause a segmentation fault and soft lockup.
The same failure is not observed with 4k pages on aarch64.

It turns out arena_map_free() is calling
apply_to_existing_page_range() with the address returned by
bpf_arena_get_kern_vm_start().  If this address is not page-aligned
the code ends up calling apply_to_pte_range() with that unaligned
address causing soft lockup.

Fix it by round up GUARD_SZ to PAGE_SIZE << 1 so that the
division by 2 in bpf_arena_get_kern_vm_start() returns
a page-aligned value.

Fixes: 317460317a ("bpf: Introduce bpf_arena.")
Reported-by: Colm Harrington <colm.harrington@oracle.com>
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Link: https://lore.kernel.org/r/20250205170059.427458-1-alan.maguire@oracle.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-06 03:41:08 -08:00