Commit graph

588 commits

Author SHA1 Message Date
Linus Torvalds
eff5f16bfd for-6.15/io_uring-reg-vec-20250327
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmflYcAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmvJD/4tKQlr0yRhln/JzPiONS41mUAuNRI4MdqJ
 ykpQkMx3NcQANbNyOxI0PV5I7y1Jdlg/UP9gy11BrIaBk4Kqoluc6iAzgr5q9pBC
 8pXhPIe80R/q/LOKEz9n5gqOMPNyUtd7IaBayJPBJre/YZXQu+49IL2Uyy3hss8d
 neqAbWErd2FoUfTY14XB3ImLM6a76Z6CjE3pJYvVDM5uRBuH0IGqehJJuNpsViBf
 M9XAW/HZt8ISsVt1tJbCQVWx4b63L/omHI8u5K2M0isTPV+QPk1O2Vgkn7dBrDeT
 JvThWrM1uE++DYGcQ3DXHfb3gBIFEjTrNb2nddstyEU2ZaEXUkuOV2O0b7WPuphj
 zp0oFaLl/ivHT8NoJzzZzK24zt99Qz43GWUaFCQeR0R8oTix/M1q0unguER45Iv7
 Po/b3h6+RAi+87KOlM5WWo05ScswS8AwcSUsP5xMR5BjjD+GQYO5PmVVyo8w0rid
 8F9U9DpN2CTA5YVjI+ax1cxWMOfmAXPK5ONjzZpyJoWb0THgj97esEwc2un7SBi7
 TJJz7Gc9/xOqfRKaPDoH9t8+b6ruWHMqCYDw6exSAUKeDxQ+7z0zNMudHkuR5VrX
 x+Taaj95ONLVNZYz0mbFcvmJC0UBOqkE94omXk7TU2Cn7SBzAW//XDep6CPpX/sa
 LcmOK4UXdg==
 =vOm1
 -----END PGP SIGNATURE-----

Merge tag 'for-6.15/io_uring-reg-vec-20250327' of git://git.kernel.dk/linux

Pull more io_uring updates from Jens Axboe:
 "Final separate updates for io_uring.

  This started out as a series of cleanups improvements and improvements
  for registered buffers, but as the last series of the io_uring changes
  for 6.15, it also collected a few fixes for the other branches on top:

   - Add support for vectored fixed/registered buffers.

     Previously only single segments have been supported for commands,
     now vectored variants are supported as well. This series includes
     networking and file read/write support.

   - Small series unifying return codes across multi and single shot.

   - Small series cleaning up registerd buffer importing.

   - Adding support for vectored registered buffers for uring_cmd.

   - Fix for io-wq handling of command reissue.

   - Various little fixes and tweaks"

* tag 'for-6.15/io_uring-reg-vec-20250327' of git://git.kernel.dk/linux: (25 commits)
  io_uring/net: fix io_req_post_cqe abuse by send bundle
  io_uring/net: use REQ_F_IMPORT_BUFFER for send_zc
  io_uring: move min_events sanitisation
  io_uring: rename "min" arg in io_iopoll_check()
  io_uring: open code __io_post_aux_cqe()
  io_uring: defer iowq cqe overflow via task_work
  io_uring: fix retry handling off iowq
  io_uring/net: only import send_zc buffer once
  io_uring/cmd: introduce io_uring_cmd_import_fixed_vec
  io_uring/cmd: add iovec cache for commands
  io_uring/cmd: don't expose entire cmd async data
  io_uring: rename the data cmd cache
  io_uring: rely on io_prep_reg_vec for iovec placement
  io_uring: introduce io_prep_reg_iovec()
  io_uring: unify STOP_MULTISHOT with IOU_OK
  io_uring: return -EAGAIN to continue multishot
  io_uring: cap cached iovec/bvec size
  io_uring/net: implement vectored reg bufs for zctx
  io_uring/net: convert to struct iou_vec
  io_uring/net: pull vec alloc out of msghdr import
  ...
2025-03-28 15:07:04 -07:00
Linus Torvalds
ca0b04ba0b for-6.15/io_uring-rx-zc-20250325
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmfjTP8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpm6oEACnpGL52FAKTVj14GDqFo6Pq0Jmnh07x8qj
 mpHFPwxfWAzRiuNyji2iS9ecS2cnlkixNyMWZipXRi4KJAUjJH6YDd7IofUI3Glf
 6v7b6srFSvsWJIJ8LdkJHLHAJuzYnJvFZ8apwgQczEDqgHq7BAunM1sVQ+mydjYk
 EXT4kN6DSBOPzwr9GAay52f8nXhbqdHfT+YTGHPHg+QToojL6gD7vvW57w/QqD/x
 91hJef1z01cSIsDZOxA0EUeD+9bBAHpoamr/e3IOOCVYCN6hy0dGa9g0QGbbpVyE
 AeU4FGZLV9J8OOfvHVraDt5Wn3IXxYaMu22dSn1S6tVinwjXhJR2LAA+t4fGHAkt
 i38LjOsIbopSQn/cNhzwC8UZcHLqnVsdDolHlHzsVFVfcpck2/4JFpUeP8QhWgrk
 f9tY12QUf/oEaWm0/sUCHZNFxpIGeFA5FFXf0Z92clnzBuiuWoesBNvxqY/2DeZn
 IDNXiv+Trxr6kFEjTpzPeuxbWrn4PJ7afQSAFcEmOCguk5riM+zJZNIKg0TxUHSS
 tt6sfxmTP1DhgDKad5kT3MLyzOcx47Kbjf4dj6KmRnD+3DGwwN2F7X7R1GJylPSp
 RLOzJ+Ouuy9UmBN6JMsT4BmR9+FJTVirADU926d/ZqCTtRV8Tnq/6HHmKmmr4CR0
 THJ0PJqQjg==
 =MOve
 -----END PGP SIGNATURE-----

Merge tag 'for-6.15/io_uring-rx-zc-20250325' of git://git.kernel.dk/linux

Pull io_uring zero-copy receive support from Jens Axboe:
 "This adds support for zero-copy receive with io_uring, enabling fast
  bulk receive of data directly into application memory, rather than
  needing to copy the data out of kernel memory.

  While this version only supports host memory as that was the initial
  target, other memory types are planned as well, with notably GPU
  memory coming next.

  This work depends on some networking components which were queued up
  on the networking side, but have now landed in your tree.

  This is the work of Pavel Begunkov and David Wei. From the v14 posting:

    'We configure a page pool that a driver uses to fill a hw rx queue
     to hand out user pages instead of kernel pages. Any data that ends
     up hitting this hw rx queue will thus be dma'd into userspace
     memory directly, without needing to be bounced through kernel
     memory. 'Reading' data out of a socket instead becomes a
     _notification_ mechanism, where the kernel tells userspace where
     the data is. The overall approach is similar to the devmem TCP
     proposal

     This relies on hw header/data split, flow steering and RSS to
     ensure packet headers remain in kernel memory and only desired
     flows hit a hw rx queue configured for zero copy. Configuring this
     is outside of the scope of this patchset.

     We share netdev core infra with devmem TCP. The main difference is
     that io_uring is used for the uAPI and the lifetime of all objects
     are bound to an io_uring instance. Data is 'read' using a new
     io_uring request type. When done, data is returned via a new shared
     refill queue. A zero copy page pool refills a hw rx queue from this
     refill queue directly. Of course, the lifetime of these data
     buffers are managed by io_uring rather than the networking stack,
     with different refcounting rules.

     This patchset is the first step adding basic zero copy support. We
     will extend this iteratively with new features e.g. dynamically
     allocated zero copy areas, THP support, dmabuf support, improved
     copy fallback, general optimisations and more'

  In a local setup, I was able to saturate a 200G link with a single CPU
  core, and at netdev conf 0x19 earlier this month, Jamal reported
  188Gbit of bandwidth using a single core (no HT, including soft-irq).

  Safe to say the efficiency is there, as bigger links would be needed
  to find the per-core limit, and it's considerably more efficient and
  faster than the existing devmem solution"

* tag 'for-6.15/io_uring-rx-zc-20250325' of git://git.kernel.dk/linux:
  io_uring/zcrx: add selftest case for recvzc with read limit
  io_uring/zcrx: add a read limit to recvzc requests
  io_uring: add missing IORING_MAP_OFF_ZCRX_REGION in io_uring_mmap
  io_uring: Rename KConfig to Kconfig
  io_uring/zcrx: fix leaks on failed registration
  io_uring/zcrx: recheck ifq on shutdown
  io_uring/zcrx: add selftest
  net: add documentation for io_uring zcrx
  io_uring/zcrx: add copy fallback
  io_uring/zcrx: throttle receive requests
  io_uring/zcrx: set pp memory provider for an rx queue
  io_uring/zcrx: add io_recvzc request
  io_uring/zcrx: dma-map area for the device
  io_uring/zcrx: implement zerocopy receive pp memory provider
  io_uring/zcrx: grab a net device
  io_uring/zcrx: add io_zcrx_area
  io_uring/zcrx: add interface queue and refill queue
2025-03-28 13:45:52 -07:00
Pavel Begunkov
6889ae1b4d io_uring/net: fix io_req_post_cqe abuse by send bundle
[  114.987980][ T5313] WARNING: CPU: 6 PID: 5313 at io_uring/io_uring.c:872 io_req_post_cqe+0x12e/0x4f0
[  114.991597][ T5313] RIP: 0010:io_req_post_cqe+0x12e/0x4f0
[  115.001880][ T5313] Call Trace:
[  115.002222][ T5313]  <TASK>
[  115.007813][ T5313]  io_send+0x4fe/0x10f0
[  115.009317][ T5313]  io_issue_sqe+0x1a6/0x1740
[  115.012094][ T5313]  io_wq_submit_work+0x38b/0xed0
[  115.013223][ T5313]  io_worker_handle_work+0x62a/0x1600
[  115.013876][ T5313]  io_wq_worker+0x34f/0xdf0

As the comment states, io_req_post_cqe() should only be used by
multishot requests, i.e. REQ_F_APOLL_MULTISHOT, which bundled sends are
not. Add a flag signifying whether a request wants to post multiple
CQEs. Eventually REQ_F_APOLL_MULTISHOT should imply the new flag, but
that's left out for simplicity.

Cc: stable@vger.kernel.org
Fixes: a05d1f625c ("io_uring/net: support bundles for send")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8b611dbb54d1cd47a88681f5d38c84d0c02bc563.1743067183.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-27 05:48:32 -06:00
Linus Torvalds
91928e0d3c for-6.15/io_uring-20250322
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmfe8DcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgprzNEADPOOfi05F0f4KOMLYOiRlB+D9yGtnbcu8k
 foPATU++L5fFDUpjAGobi38AXqHtF/Bx0CjCkPCvy4GjygcszYDY81FQmRlSFdeH
 CnYV/bm5myyupc8xANcWGJYpxf/ZAw7ugOAgcR3Xxk/qXdD5hYJ1jG4477o8v4IJ
 jyzE4vQCnWBogIoqpWTQyY3FSP5fdjeQlYdFGIFCx68rFTO9zEpuHhFr2OMfJtSu
 y1dLkeqbu1BZ0fhgRog9k9ZWk0uRQJI7d7EUXf9iwXOfkUyzCaV+nHvNGRdhJ/XO
 zj/qePw4lFyCgc+kuIejjIaPf/g84paCzshz/PIxOZTsvo6X78zj72RPM0Makhga
 amgtEqegWHtKSya1zew57oDwqwaIe6b+56PtNSP7K8IrwcJLheszoPKJIwiiynIL
 ZqvWseqYQXs58S9qZzSZlyEpwLVFcvyNt7Z/q7b56vhvXHabxOccPkUHt+UscVoQ
 qpG0R9+zmIiDfzY6Vfu7JyH2Uspdq3WjGYiaoVUko1rJXY+HFPd3be9pIbckyaA8
 ZQHIxB+h8IlxhpQ8HJKQpZhtYiluxEu6wDdvtpwC0KiGp5g4Q5eg9SkVyvETIEN8
 yW2tHqMAJJtJjDAPwJOV+GIyWQMIexzLmMP7Fq8R5VIoAox+Xn3flgNNEnMZdl/r
 kTc/hRjHvQ==
 =VEQr
 -----END PGP SIGNATURE-----

Merge tag 'for-6.15/io_uring-20250322' of git://git.kernel.dk/linux

Pull io_uring updates from Jens Axboe:
 "This is the first of the io_uring pull requests for the 6.15 merge
  window, there will be others once the net tree has gone in. This
  contains:

   - Cleanup and unification of cancelation handling across various
     request types.

   - Improvement for bundles, supporting them both for incrementally
     consumed buffers, and for non-multishot requests.

   - Enable toggling of using iowait while waiting on io_uring events or
     not. Unfortunately this is still tied with CPU frequency boosting
     on short waits, as the scheduler side has not been very receptive
     to splitting the (useless) iowait stat from the cpufreq implied
     boost.

   - Add support for kbuf nodes, enabling zero-copy support for the ublk
     block driver.

   - Various cleanups for resource node handling.

   - Series greatly cleaning up the legacy provided (non-ring based)
     buffers. For years, we've been pushing the ring provided buffers as
     the way to go, and that is what people have been using. Reduce the
     complexity and code associated with legacy provided buffers.

   - Series cleaning up the compat handling.

   - Series improving and cleaning up the recvmsg/sendmsg iovec and msg
     handling.

   - Series of cleanups for io-wq.

   - Start adding a bunch of selftests. The liburing repository
     generally carries feature and regression tests for everything, but
     at least for ublk initially, we'll try and go the route of having
     it in selftests as well. We'll see how this goes, might decide to
     migrate more tests this way in the future.

   - Various little cleanups and fixes"

* tag 'for-6.15/io_uring-20250322' of git://git.kernel.dk/linux: (108 commits)
  selftests: ublk: add stripe target
  selftests: ublk: simplify loop io completion
  selftests: ublk: enable zero copy for null target
  selftests: ublk: prepare for supporting stripe target
  selftests: ublk: move common code into common.c
  selftests: ublk: increase max buffer size to 1MB
  selftests: ublk: add single sqe allocator helper
  selftests: ublk: add generic_01 for verifying sequential IO order
  selftests: ublk: fix starting ublk device
  io_uring: enable toggle of iowait usage when waiting on CQEs
  selftests: ublk: fix write cache implementation
  selftests: ublk: add variable for user to not show test result
  selftests: ublk: don't show `modprobe` failure
  selftests: ublk: add one dependency header
  io_uring/kbuf: enable bundles for incrementally consumed buffers
  Revert "io_uring/rsrc: simplify the bvec iter count calculation"
  selftests: ublk: improve test usability
  selftests: ublk: add stress test for covering IO vs. killing ublk server
  selftests: ublk: add one stress test for covering IO vs. removing device
  selftests: ublk: load/unload ublk_drv when preparing & cleaning up tests
  ...
2025-03-26 17:56:00 -07:00
Linus Torvalds
054570267d lsm/stable-6.15 PR 20250323
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmfgWgMUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNW5RAAvCDq5gBtY0aTNlULe637EVLSh+t8
 PkSzHzu/NlzU6BfjtwSm2fuML8welTGxSwUPxUzMCI91gPdkGeFktefavT3xa+QI
 BHWROn7fEJ/KmRZvngPeIkgLr5xhF5nBJmc/Jw71qem20zRzNgJnpzMX16d10Phx
 dxd2xOO1qM3bv6Z9RcIssZRGaN+PHngpWWg+0B69XuaBUso87S6NDyKNn1XPmvoz
 as96k+Wk/xAZGVEeCbs/+H5rBx6DLg+FfTRa06Oh4BFsqedpkDPxLrTgCJGJkA0H
 dsK6O/993zvjx0Jn4ZPoJ9n35S82BmkCsz4bGq1xVl6FYUiMcm3/8yO41wllS+w4
 j+RlTU/RIdB7n8EKyMMl1hj1stTvt3Bi9F5Cbf7ZEv0snfR00K4KVpi17jnFjUHv
 kpOiEtXZb/NGQip7UAuUq0PisfqbiO4jJurYHRetDgv1WCy6+C8ufM5t6I+cnvmG
 VG+dlxcW+rDIn6bLRVuGi9TJRsQ6eox9ipa+qEKNNiOXgftELcgT7m74nAS5m0uv
 n5rDa221nPXecEB0X7d6YUFk711lly90dbelNeLrmv1w6jl8L1PpS1oBaW+UzGu9
 46eGBd6pzu9otvK9WVyDEdotDOCrgH0sd7pTetqDhLJZ7KrGwyyqO2gD/JroUKcC
 lnxBQwPnat86iI8=
 =oxfV
 -----END PGP SIGNATURE-----

Merge tag 'lsm-pr-20250323' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm

Pull lsm updates from Paul Moore:

 - Various minor updates to the LSM Rust bindings

   Changes include marking trivial Rust bindings as inlines and comment
   tweaks to better reflect the LSM hooks.

 - Add LSM/SELinux access controls to io_uring_allowed()

   Similar to the io_uring_disabled sysctl, add a LSM hook to
   io_uring_allowed() to enable LSMs a simple way to enforce security
   policy on the use of io_uring. This pull request includes SELinux
   support for this new control using the io_uring/allowed permission.

 - Remove an unused parameter from the security_perf_event_open() hook

   The perf_event_attr struct parameter was not used by any currently
   supported LSMs, remove it from the hook.

 - Add an explicit MAINTAINERS entry for the credentials code

   We've seen problems in the past where patches to the credentials code
   sent by non-maintainers would often languish on the lists for
   multiple months as there was no one explicitly tasked with the
   responsibility of reviewing and/or merging credentials related code.

   Considering that most of the code under security/ has a vested
   interest in ensuring that the credentials code is well maintained,
   I'm volunteering to look after the credentials code and Serge Hallyn
   has also volunteered to step up as an official reviewer. I posted the
   MAINTAINERS update as a RFC to LKML in hopes that someone else would
   jump up with an "I'll do it!", but beyond Serge it was all crickets.

 - Update Stephen Smalley's old email address to prevent confusion

   This includes a corresponding update to the mailmap file.

* tag 'lsm-pr-20250323' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
  mailmap: map Stephen Smalley's old email addresses
  lsm: remove old email address for Stephen Smalley
  MAINTAINERS: add Serge Hallyn as a credentials reviewer
  MAINTAINERS: add an explicit credentials entry
  cred,rust: mark Credential methods inline
  lsm,rust: reword "destroy" -> "release" in SecurityCtx
  lsm,rust: mark SecurityCtx methods inline
  perf: Remove unnecessary parameter of security check
  lsm: fix a missing security_uring_allowed() prototype
  io_uring,lsm,selinux: add LSM hooks for io_uring_setup()
  io_uring: refactor io_uring_allowed()
2025-03-25 15:44:19 -07:00
Pavel Begunkov
816619782b io_uring: move min_events sanitisation
iopoll and normal waiting already duplicate min_completion truncation,
so move them inside the corresponding routines.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/254adb289cc04638f25d746a7499260fa89a179e.1742829388.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-25 12:38:01 -06:00
Pavel Begunkov
d73acd7af3 io_uring: rename "min" arg in io_iopoll_check()
Don't name arguments "min", it shadows the namesake function.
min_events is also more consistent.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f52ce9d88d3bca5732a218b0da14924aa6968909.1742829388.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-25 12:37:57 -06:00
Pavel Begunkov
4c76de42cb io_uring: open code __io_post_aux_cqe()
There is no reason to keep __io_post_aux_cqe() separately from
io_post_aux_cqe().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2c4c1f68d694deea25a212fc09bbb11f330cd82e.1742829388.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-25 12:37:52 -06:00
Pavel Begunkov
3afcb3b2e3 io_uring: defer iowq cqe overflow via task_work
Don't handle CQE overflows in io_req_complete_post() and defer it to
flush_completions. It cuts some duplication, and I also want to limit
the number of places directly overflowing completions.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9046410ac27e18f2baa6f7cdb363ec921cbc3b79.1742829388.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-25 12:37:47 -06:00
Pavel Begunkov
3f0cb8de56 io_uring: fix retry handling off iowq
io_req_complete_post() doesn't handle reissue and if called with a
REQ_F_REISSUE request it might post extra unexpected completions. Fix it
by pushing into flush_completion via task work.

Fixes: d803d12394 ("io_uring/rw: handle -EAGAIN retry at IO completion time")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/badb3d7e462881e7edbfcc2be6301090b07dbe53.1742829388.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-25 12:37:43 -06:00
Linus Torvalds
a50b4fe095 A treewide hrtimer timer cleanup
hrtimers are initialized with hrtimer_init() and a subsequent store to
   the callback pointer. This turned out to be suboptimal for the upcoming
   Rust integration and is obviously a silly implementation to begin with.
 
   This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
   with hrtimer_setup(T, cb);
 
   The conversion was done with Coccinelle and a few manual fixups.
 
   Once the conversion has completely landed in mainline, hrtimer_init()
   will be removed and the hrtimer::function becomes a private member.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmff5jQTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoVvRD/wKtuwmiA66NJFgXC0qVq82A6fO3bY8
 GBdbfysDJIbqGu5PTcULTbJ8qkqv3jeLUv6CcXvS4sZ7y/uJQl2lzf8yrD/0bbwc
 rLI6sHiPSZmK93kNVN4X5H7kvt7cE/DYC9nnEOgK3BY5FgKc4n9887d4aVBhL8Lv
 ODwVXvZ+xi351YCj7qRyPU24zt/p4tkkT1o2k4a0HBluqLI0D+V20fke9IERUL8r
 d1uWKlcn0TqYDesE8HXKIhbst3gx52rMJrXBJDHwFmG6v8Pj1fkTXCVpPo8QcBz8
 OTVkpomN9f/Tx4+GZwhZOF86LhLL3OhxD6pT7JhFCXdmSGv+Ez8uyk1YZysM/XpV
 Juy/1yAcBpDIDkmhMFGdAAn48Nn9Fotty0r4je60zSEp1d/4QMXcFme29qr2JTUE
 iWnQ/HD6DxUjVHqy7CYvvo26Xegg1C7qgyOVt4PYZwAM1VKF5P3kzYTb4SAdxtop
 Tpji1sfW9QV08jqMNo6XntD32DSP9S2HqjO9LwBw700jnx2jjJ35fcJs6iodMOUn
 gckIZLMn3L0OoglPdyA5O7SNTbKE7aFiRKdnT/cJtR3Fa39Qu27CwC5gfiyuie9I
 Q+LG8GLuYSBHXAR+PBK4GWlzJ7Dn8k3eqmbnLeKpRMsU6ZzcttgA64xhaviN2wN0
 iJbvLJeisXr3GA==
 =bYAX
 -----END PGP SIGNATURE-----

Merge tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer cleanups from Thomas Gleixner:
 "A treewide hrtimer timer cleanup

  hrtimers are initialized with hrtimer_init() and a subsequent store to
  the callback pointer. This turned out to be suboptimal for the
  upcoming Rust integration and is obviously a silly implementation to
  begin with.

  This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
  with hrtimer_setup(T, cb);

  The conversion was done with Coccinelle and a few manual fixups.

  Once the conversion has completely landed in mainline, hrtimer_init()
  will be removed and the hrtimer::function becomes a private member"

* tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
  wifi: rt2x00: Switch to use hrtimer_update_function()
  io_uring: Use helper function hrtimer_update_function()
  serial: xilinx_uartps: Use helper function hrtimer_update_function()
  ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup()
  RDMA: Switch to use hrtimer_setup()
  virtio: mem: Switch to use hrtimer_setup()
  drm/vmwgfx: Switch to use hrtimer_setup()
  drm/xe/oa: Switch to use hrtimer_setup()
  drm/vkms: Switch to use hrtimer_setup()
  drm/msm: Switch to use hrtimer_setup()
  drm/i915/request: Switch to use hrtimer_setup()
  drm/i915/uncore: Switch to use hrtimer_setup()
  drm/i915/pmu: Switch to use hrtimer_setup()
  drm/i915/perf: Switch to use hrtimer_setup()
  drm/i915/gvt: Switch to use hrtimer_setup()
  drm/i915/huc: Switch to use hrtimer_setup()
  drm/amdgpu: Switch to use hrtimer_setup()
  stm class: heartbeat: Switch to use hrtimer_setup()
  i2c: Switch to use hrtimer_setup()
  iio: Switch to use hrtimer_setup()
  ...
2025-03-25 10:54:15 -07:00
Pavel Begunkov
3a4689ac10 io_uring/cmd: add iovec cache for commands
Add iou_vec to commands and wire caching for it, but don't expose it to
users just yet. We need the vec cleared on initial alloc, but since
we can't place it at the beginning at the moment, zero the entire
async_data. It's cached, and the performance effects only the initial
allocation, and it might be not a bad idea since we're exposing those
bits to outside drivers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c0f2145b75791bc6106eb4e72add2cf6a2c72a7a.1742579999.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-21 12:52:15 -06:00
Jens Axboe
07754bfd9a io_uring: enable toggle of iowait usage when waiting on CQEs
By default, io_uring marks a waiting task as being in iowait, if it's
sleeping waiting on events and there are pending requests. This isn't
necessarily always useful, and may be confusing on non-storage setups
where iowait isn't expected. It can also cause extra power usage, by
preventing the CPU from entering lower sleep states.

This adds a new enter flag, IORING_ENTER_NO_IOWAIT. If set, then
io_uring will not account the sleeping task as being in iowait. If the
kernel supports this feature, then it will be marked by having the
IORING_FEAT_NO_IOWAIT feature flag set.

As the kernel currently does not support separating the iowait
accounting and CPU frequency boosting, the IORING_ENTER_NO_IOWAIT
controls both of these at the same time. In the future, if those do end
up being split, then it'd be possible to control them separately.
However, it seems more likely that the kernel will decouple iowait and
CPU frequency boosting anyway.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-20 20:01:03 -06:00
Pavel Begunkov
5f14404bfa io_uring/cmd: don't expose entire cmd async data
io_uring needs private bits in cmd's ->async_data, and they should never
be exposed to drivers as it'd certainly be abused. Leave struct
io_uring_cmd_data for the drivers but wrap it into a structure. It's a
prep patch and doesn't do anything useful yet.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20250319061251.21452-3-sidong.yang@furiosa.ai
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-19 09:25:55 -06:00
Pavel Begunkov
575e7b0629 io_uring: rename the data cmd cache
Pick a more descriptive name for the cmd async data cache.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20250319061251.21452-2-sidong.yang@furiosa.ai
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-19 09:25:55 -06:00
Pavel Begunkov
5027d02452 io_uring: unify STOP_MULTISHOT with IOU_OK
IOU_OK means that the request ownership is now handed back to core
io_uring and it has to complete it using the result provided in
req->cqe. Same is true for multishot and IOU_STOP_MULTISHOT.

Rename it into IOU_COMPLETE to avoid confusion and use for both modes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e6a5b2edb0eb9558acb1c8f1db38ac45fee95491.1741453534.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10 07:14:18 -06:00
Pavel Begunkov
7a9dcb05f5 io_uring: return -EAGAIN to continue multishot
Multishot errors can be mapped 1:1 to normal errors, but there are not
identical. It leads to a peculiar situation where all multishot requests
has to check in what context they're run and return different codes.

Unify them starting with EAGAIN / IOU_ISSUE_SKIP_COMPLETE(EIOCBQUEUED)
pair, which mean that core io_uring still owns the request and it should
be retried. In case of multishot it's naturally just continues to poll,
otherwise it might poll, use iowq or do any other kind of allowed
blocking. Introduce IOU_RETRY aliased to -EAGAIN for that.

Apart from obvious upsides, multishot can now also check for misuse of
IOU_ISSUE_SKIP_COMPLETE.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/da117b79ce72ecc3ab488c744e29fae9ba54e23b.1741453534.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10 07:14:18 -06:00
Jens Axboe
78b6f6e9bf Merge branch 'for-6.15/io_uring-rx-zc' into for-6.15/io_uring-reg-vec
* for-6.15/io_uring-rx-zc: (80 commits)
  io_uring/zcrx: add selftest case for recvzc with read limit
  io_uring/zcrx: add a read limit to recvzc requests
  io_uring: add missing IORING_MAP_OFF_ZCRX_REGION in io_uring_mmap
  io_uring: Rename KConfig to Kconfig
  io_uring/zcrx: fix leaks on failed registration
  io_uring/zcrx: recheck ifq on shutdown
  io_uring/zcrx: add selftest
  net: add documentation for io_uring zcrx
  io_uring/zcrx: add copy fallback
  io_uring/zcrx: throttle receive requests
  io_uring/zcrx: set pp memory provider for an rx queue
  io_uring/zcrx: add io_recvzc request
  io_uring/zcrx: dma-map area for the device
  io_uring/zcrx: implement zerocopy receive pp memory provider
  io_uring/zcrx: grab a net device
  io_uring/zcrx: add io_zcrx_area
  io_uring/zcrx: add interface queue and refill queue
  net: add helpers for setting a memory provider on an rx queue
  net: page_pool: add memory provider helpers
  net: prepare for non devmem TCP memory providers
  ...
2025-03-07 09:07:11 -07:00
Caleb Sander Mateos
0d83b8a9f1 io_uring: introduce io_cache_free() helper
Add a helper function io_cache_free() that returns an allocation to a
io_alloc_cache, falling back on kfree() if the io_alloc_cache is full.
This is the inverse of io_cache_alloc(), which takes an allocation from
an io_alloc_cache and falls back on kmalloc() if the cache is empty.

Convert 4 callers to use the helper.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Suggested-by: Li Zetao <lizetao1@huawei.com>
Link: https://lore.kernel.org/r/20250304194814.2346705-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-05 07:38:55 -07:00
Keith Busch
ed9f3112a8 io_uring: cache nodes and mapped buffers
Frequent alloc/free cycles on these is pretty costly. Use an io cache to
more efficiently reuse these buffers.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250227223916.143006-7-kbusch@meta.com
[axboe: fix imu leak]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-28 07:05:46 -07:00
Keith Busch
27cb27b6d5 io_uring: add support for kernel registered bvecs
Provide an interface for the kernel to leverage the existing
pre-registered buffers that io_uring provides. User space can reference
these later to achieve zero-copy IO.

User space must register an empty fixed buffer table with io_uring in
order for the kernel to make use of it.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250227223916.143006-5-kbusch@meta.com
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-28 07:05:36 -07:00
Jens Axboe
c0d8c0362b Merge branch 'io_uring-6.14' into for-6.15/io_uring
Merge mainline fixes into 6.15 branch, as upcoming patches depend on
fixes that went into the 6.14 mainline branch.

* io_uring-6.14:
  io_uring/net: save msg_control for compat
  io_uring/rw: clean up mshot forced sync mode
  io_uring/rw: move ki_complete init into prep
  io_uring/rw: don't directly use ki_complete
  io_uring/rw: forbid multishot async reads
  io_uring/rsrc: remove unused constants
  io_uring: fix spelling error in uapi io_uring.h
  io_uring: prevent opcode speculation
  io-wq: backoff when retrying worker creation
2025-02-27 07:18:01 -07:00
Pavel Begunkov
c457eed55d io_uring: make io_poll_issue() sturdier
io_poll_issue() forwards the call to io_issue_sqe() and thus inherits
some of the handling. That's not particularly failure resistant, as for
example returning an innocently looking IOU_OK from a multishot issue
will lead to severe bugs.

Reimplement io_poll_issue() without io_issue_sqe()'s request completion
logic. Remove extra checks as we know that req->file is already set,
linked timeout are armed, and iopoll is not supported. Also cover it
with warnings for now.

The patch should be useful by itself, but it's also preparing the
codebase for other future clean ups.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3096d7b1026d9a52426a598bdfc8d9d324555545.1740331076.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-24 12:11:06 -07:00
Linus Torvalds
f679ebf6aa io_uring-6.14-20250221
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAme4rVYQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpodqD/9hVmNuIJTpfLNdKNGgsgE1VRhtTtKgISOV
 U2c2NDVvx6brqyRL2JGfC3DhZyXvC98qEKXfsJf0lbDPh/pP3oQMLWmPJWsJKVUt
 DpE9RzI89/objCvIUNMLS9svixmBcVSBnF9x1ppGl6Ou4+NSkS72aswMNlP3qLr9
 +UnAxEFWCCn/8sl29GdtUDXSm8l/u6oftxeGhXTaczwEVUSvclyTqrsqoy05z9ib
 Ar5PC9h7crwCgrnMahAUkM603Vi3gZzkDiQre/GK28HN7nx0TbwB9inRsxrO3sv5
 01rwwA67oN7sEp2AM3Svh4MyBb3wyNRzcynZMgml2YM+0RRmstmeT2VBjilCRVCT
 s5pO+KnJ1Q64rYcaqPU5+hqfJ55d8qveBjB8Omu4BrFaf7HuzgLq1bbEBrmyoHKm
 TQvM/USnwycOs6FrdAWS5X+nx9pFIDowd6ZqLckVhWxjnIsi/WeyNoJFP9U887X7
 YdatkNoqY7EApykzUxVKzuSw1sgADE9HvnBSrfBcbyxA3Wpl52K+Y+/NyRjlJamF
 aGP8FpM1ZoOcwyWhi0By8PmB4flSnirYQ7Ysu0Wv12CoNylHv1n1pbbGNZ/LuXGN
 MrrE4jQLk5cVEfSk/+C0G5MreUkwav8K5cnEoqq36peq9LXW36jQdWkYieZMCbun
 lKODmRPAgA==
 =w+aR
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.14-20250221' of git://git.kernel.dk/linux

Pull io_uring fixes from Jens Axboe:

 - Series fixing an issue with multishot read on pollable files that may
   return -EIOCBQUEUED from ->read_iter(). Four small patches for that,
   the first one deliberately done in such a way that it'd be easy to
   backport

 - Remove some dead constant definitions

 - Use array_index_nospec() for opcode indexing

 - Work-around for worker creation retries in the presence of signals

* tag 'io_uring-6.14-20250221' of git://git.kernel.dk/linux:
  io_uring/rw: clean up mshot forced sync mode
  io_uring/rw: move ki_complete init into prep
  io_uring/rw: don't directly use ki_complete
  io_uring/rw: forbid multishot async reads
  io_uring/rsrc: remove unused constants
  io_uring: fix spelling error in uapi io_uring.h
  io_uring: prevent opcode speculation
  io-wq: backoff when retrying worker creation
2025-02-21 09:17:56 -08:00
Caleb Sander Mateos
62aa9805d1 io_uring: use lockless_cq flag in io_req_complete_post()
io_uring_create() computes ctx->lockless_cq as:
ctx->task_complete || (ctx->flags & IORING_SETUP_IOPOLL)

So use it to simplify that expression in io_req_complete_post().

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Li Zetao <lizetao1@huawei.com>
Link: https://lore.kernel.org/r/20250212005119.3433005-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-18 10:32:22 -07:00
Nam Cao
3f8d93d137 io_uring: Use helper function hrtimer_update_function()
The field 'function' of struct hrtimer should not be changed directly, as
the write is lockless and a concurrent timer expiry might end up using the
wrong function pointer.

Switch to use hrtimer_update_function() which also performs runtime checks
that it is safe to modify the callback.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/9b33f490fb1d207d3918ef5e116dc3412ae35c1e.1738746927.git.namcao@linutronix.de
2025-02-18 17:41:35 +01:00
David Wei
6f377873cb io_uring/zcrx: add interface queue and refill queue
Add a new object called an interface queue (ifq) that represents a net
rx queue that has been configured for zero copy. Each ifq is registered
using a new registration opcode IORING_REGISTER_ZCRX_IFQ.

The refill queue is allocated by the kernel and mapped by userspace
using a new offset IORING_OFF_RQ_RING, in a similar fashion to the main
SQ/CQ. It is used by userspace to return buffers that it is done with,
which will then be re-used by the netdev again.

The main CQ ring is used to notify userspace of received data by using
the upper 16 bytes of a big CQE as a new struct io_uring_zcrx_cqe. Each
entry contains the offset + len to the data.

For now, each io_uring instance only has a single ifq.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: David Wei <dw@davidwei.uk>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20250215000947.789731-2-dw@davidwei.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:41:03 -07:00
Caleb Sander Mateos
94a4274bb6 io_uring: pass struct io_tw_state by value
8e5b3b89ec ("io_uring: remove struct io_tw_state::locked") removed the
only field of io_tw_state but kept it as a task work callback argument
to "forc[e] users not to invoke them carelessly out of a wrong context".
Passing the struct io_tw_state * argument adds a few instructions to all
callers that can't inline the functions and see the argument is unused.

So pass struct io_tw_state by value instead. Since it's a 0-sized value,
it can be passed without any instructions needed to initialize it.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250217022511.1150145-2-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:50 -07:00
Caleb Sander Mateos
bcf8a0293a io_uring: introduce type alias for io_tw_state
In preparation for changing how io_tw_state is passed, introduce a type
alias io_tw_token_t for struct io_tw_state *. This allows for changing
the representation in one place, without having to update the many
functions that just forward their struct io_tw_state * argument.

Also add a comment to struct io_tw_state to explain its purpose.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250217022511.1150145-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:50 -07:00
Caleb Sander Mateos
60e6ce746b io_uring: pass ctx instead of req to io_init_req_drain()
io_init_req_drain() takes a struct io_kiocb *req argument but only uses
it to get struct io_ring_ctx *ctx. The caller already knows the ctx, so
pass it instead.

Drop "req" from the function name since it operates on the ctx rather
than a specific req.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250212164807.3681036-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:46 -07:00
Caleb Sander Mateos
0e8934724f io_uring: use IO_REQ_LINK_FLAGS more
Replace the 2 instances of REQ_F_LINK | REQ_F_HARDLINK with
the more commonly used IO_REQ_LINK_FLAGS.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250211202002.3316324-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:46 -07:00
Pavel Begunkov
54e00d9a61 io_uring/kbuf: introduce io_kbuf_drop_legacy()
io_kbuf_drop() is only used for legacy provided buffers, and so
__io_put_kbuf_list() is never called for REQ_F_BUFFER_RING. Remove the
dead branch out of __io_put_kbuf_list(), rename it into
io_kbuf_drop_legacy() and use it directly instead of io_kbuf_drop().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c8cc73e2272f09a86ecbdad9ebdd8304f8e583c0.1738724373.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Pavel Begunkov
13ee854e7c io_uring/kbuf: remove legacy kbuf caching
Remove all struct io_buffer caches. It makes it a fair bit simpler.
Apart from from killing a bunch of lines and juggling between lists,
__io_put_kbuf_list() doesn't need ->completion_lock locking now.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/18287217466ee2576ea0b1e72daccf7b22c7e856.1738724373.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Pavel Begunkov
dd4fbb11e7 io_uring/kbuf: move locking into io_kbuf_drop()
Move the burden of locking out of the caller into io_kbuf_drop(), that
will help with furher refactoring.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/530f0cf1f06963029399f819a9a58b1a34bebef3.1738724373.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Pavel Begunkov
9afe6847cf io_uring/kbuf: remove legacy kbuf kmem cache
Remove the kmem cache used by legacy provided buffers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8195c207d8524d94e972c0c82de99282289f7f5c.1738724373.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Pavel Begunkov
92a3bac9a5 io_uring: sanitise ring params earlier
Do all struct io_uring_params validation early on before allocating the
context. That makes initialisation easier, especially by having fewer
places where we need to care about partial de-initialisation.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/363ba90b83ff78eefdc88b60e1b2c4a39d182247.1738344646.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Pavel Begunkov
7215469659 io_uring: check for iowq alloc_workqueue failure
alloc_workqueue() can fail even during init in io_uring_init(), check
the result and panic if anything went wrong.

Fixes: 73eaa2b583 ("io_uring: use private workqueue for exit work")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3a046063902f888f66151f89fa42f84063b9727b.1738343083.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Pavel Begunkov
40b991837f io_uring: deduplicate caches deallocation
Add a function that frees all ring caches since we already have two
spots repeating the same thing and it's easy to miss it and change only
one of them.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b6b0125677c58bdff99eda91ab320137406e8562.1738342562.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Pavel Begunkov
1e988c3fe1 io_uring: prevent opcode speculation
sqe->opcode is used for different tables, make sure we santitise it
against speculations.

Cc: stable@vger.kernel.org
Fixes: d3656344fe ("io_uring: add lookup table for various opcode needs")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Li Zetao <lizetao1@huawei.com>
Link: https://lore.kernel.org/r/7eddbf31c8ca0a3947f8ed98271acc2b4349c016.1739568408.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-15 08:15:12 -07:00
Hamza Mahfooz
c6ad9fdbd4 io_uring,lsm,selinux: add LSM hooks for io_uring_setup()
It is desirable to allow LSM to configure accessibility to io_uring
because it is a coarse yet very simple way to restrict access to it. So,
add an LSM for io_uring_allowed() to guard access to io_uring.

Cc: Paul Moore <paul@paul-moore.com>
Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
[PM: merge fuzz due to changes in preceding patches, subj tweak]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-02-07 17:17:49 -05:00
Hamza Mahfooz
b8a468e0b0 io_uring: refactor io_uring_allowed()
Have io_uring_allowed() return an error code directly instead of
true/false. This is needed for follow-up work to guard io_uring_setup()
with LSM.

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
[PM: goto-to-return conversion as discussed on-list]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-02-07 17:17:49 -05:00
Linus Torvalds
c82da38b28 io_uring-6.14-20250131
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmec70wQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpp61D/4pFyr6hgqq22bkUHonGRqSpXnFXLfWmjWJ
 p/M9i8+3YS7Q5BUmBjmE0rncOrjqs+oFACXBXPTKqboPqgjGDLrrhZuOWn6OH6Pv
 nPxHS1eP813B/SY/qpSrPXz9b8tlgLZqY35dB9/2USB7k1Lbly204HoonHWnNvu7
 tk43YkSa8q5IWoJaUn2a8q8yi0isxCkt2UtlChkAaQEhXNoUIpr1lHnUx1VTHoB4
 +VfwMNvyXNMy3ENGvGjMEKLqKF2QyFJbwCsPYZDgvAxw8gCUHqCqMgCfTzWHAXgH
 VRvspost+6DKAbR0nIHpH421NZ1n4nnN1MUxxJizGSPpfxBR/R8i8Vtfswxzl6MN
 YNQlASGIbzlJhdweDKRwZH2LHgo+EkF2ULQG0b0Di7KFLwjfPtDN7KraPHRHnMJr
 yiKUY4Tf9PuEjgdIDAzqfU8Lgr5GKFE9pYA6NlB+3mkPt2JGbecWjeBV76a4DqjA
 RyaRKNwAQzlZkJxftq0OJLiFsBUTewZumRdxlrouV+RZZ5HlzZjINKBqEYlMzned
 zTdr4xzc96O5xV7OcLDuSk2aMU0RKcFyMmLMfOHET11Hu/PFmmiI+KaBPxheKZLb
 nWPQFtUuEJmYkSntsNZZ8rx6ef4CoUPnhmJrN1JR0zfhJeykxl/1eCmWZjwKc8s1
 7iXe48s4Dg==
 =hygF
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.14-20250131' of git://git.kernel.dk/linux

Pull more io_uring updates from Jens Axboe:

 - Series cleaning up the alloc cache changes from this merge window,
   and then another series on top making it better yet.

   This also solves an issue with KASAN_EXTRA_INFO, by making io_uring
   resilient to KASAN using parts of the freed struct for storage

 - Cleanups and simplications to buffer cloning and io resource node
   management

 - Fix an issue introduced in this merge window where READ/WRITE_ONCE
   was used on an atomic_t, which made some archs complain

 - Fix for an errant connect retry when the socket has been shut down

 - Fix for multishot and provided buffers

* tag 'io_uring-6.14-20250131' of git://git.kernel.dk/linux:
  io_uring/net: don't retry connect operation on EPOLLERR
  io_uring/rw: simplify io_rw_recycle()
  io_uring: remove !KASAN guards from cache free
  io_uring/net: extract io_send_select_buffer()
  io_uring/net: clean io_msg_copy_hdr()
  io_uring/net: make io_net_vec_assign() return void
  io_uring: add alloc_cache.c
  io_uring: dont ifdef io_alloc_cache_kasan()
  io_uring: include all deps for alloc_cache.h
  io_uring: fix multishots with selected buffers
  io_uring/register: use atomic_read/write for sq_flags migration
  io_uring/alloc_cache: get rid of _nocache() helper
  io_uring: get rid of alloc cache init_once handling
  io_uring/uring_cmd: cleanup struct io_uring_cmd_data layout
  io_uring/uring_cmd: use cached cmd_op in io_uring_cmd_sock()
  io_uring/msg_ring: don't leave potentially dangling ->tctx pointer
  io_uring/rsrc: Move lockdep assert from io_free_rsrc_node() to caller
  io_uring/rsrc: remove unused parameter ctx for io_rsrc_node_alloc()
  io_uring: clean up io_uring_register_get_file()
  io_uring/rsrc: Simplify buffer cloning by locking both rings
2025-01-31 11:29:23 -08:00
Joel Granados
1751f872cc treewide: const qualify ctl_tables where applicable
Add the const qualifier to all the ctl_tables in the tree except for
watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls,
loadpin_sysctl_table and the ones calling register_net_sysctl (./net,
drivers/inifiniband dirs). These are special cases as they use a
registration function with a non-const qualified ctl_table argument or
modify the arrays before passing them on to the registration function.

Constifying ctl_table structs will prevent the modification of
proc_handler function pointers as the arrays would reside in .rodata.
This is made possible after commit 78eb4ea25c ("sysctl: treewide:
constify the ctl_table argument of proc_handlers") constified all the
proc_handlers.

Created this by running an spatch followed by a sed command:
Spatch:
    virtual patch

    @
    depends on !(file in "net")
    disable optional_qualifier
    @

    identifier table_name != {
      watchdog_hardlockup_sysctl,
      iwcm_ctl_table,
      ucma_ctl_table,
      memory_allocation_profiling_sysctls,
      loadpin_sysctl_table
    };
    @@

    + const
    struct ctl_table table_name [] = { ... };

sed:
    sed --in-place \
      -e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \
      kernel/utsname_sysctl.c

Reviewed-by: Song Liu <song@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> # for kernel/trace/
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI
Reviewed-by: Darrick J. Wong <djwong@kernel.org> # xfs
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Corey Minyard <cminyard@mvista.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Acked-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Acked-by: Anna Schumaker <anna.schumaker@oracle.com>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
2025-01-28 13:48:37 +01:00
Jens Axboe
fa3595523d io_uring: get rid of alloc cache init_once handling
init_once is called when an object doesn't come from the cache, and
hence needs initial clearing of certain members. While the whole
struct could get cleared by memset() in that case, a few of the cache
members are large enough that this may cause unnecessary overhead if
the caches used aren't large enough to satisfy the workload. For those
cases, some churn of kmalloc+kfree is to be expected.

Ensure that the 3 users that need clearing put the members they need
cleared at the start of the struct, and wrap the rest of the struct in
a struct group so the offset is known.

While at it, improve the interaction with KASAN such that when/if
KASAN writes to members inside the struct that should be retained over
caching, it won't trip over itself. For rw and net, the retaining of
the iovec over caching is disabled if KASAN is enabled. A helper will
free and clear those members in that case.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-23 11:32:28 -07:00
Linus Torvalds
a312e1706c for-6.14/io_uring-20250119
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmeNDEUQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpl5hD/4t7kWWNQDeQG9CiA3QStMJ5Yow2AgYtK8f
 sJBr5/6PGEsbTreX//Kh8DtPZPRGcjG9elCo58QxWaPZ2mg3fTOR3/QYLMlaGXU2
 hSht58lj32utpuzMjMo9bG3aesi03bLf+buaq7V1FaMlcTV8rXqK1s/HGtphDBRo
 8tNLEk3JDJDs3vlWbNp/5Hqh9+Ro6DU8df1zWWH4Vbu8RXaGIPyJyjKvvcbfuuCf
 k7Ay45XNAmTZg+rSNGv1H3Yn1LNzPMVFLWBfzRahPCzlKy2+mJMWz1PWu9naaUK+
 WTM+kgiBLF24k59G/9xuxC5bYtsTjTbr4GsEE5ZvFBnhKPzLzzaJj7iQHRj83vtv
 tqxNmAbA3wJoNk48Zr8+cYbfDX9Q9Pl32wIaS/LxRgF9MT4lem6pyKY7Skd12oK3
 rnQ8moGtnOBxp3QUU6BZ7IX3ipb+Bgw7FhZbtVYJdlqKeKyi1QO0MuITwGXpMwk/
 EWDDTsspIf+QaTu+fmO8byJavugKljW8t7hM1JpvlfOLl+rsh6/+AYz42fCvcaA0
 Tu4bpUk8SuwALvZfU2R6bLkorGG6MFuGI8g3eixOcGir3YAcHBMfdg6ItpZi5qVt
 ToM87BMaezOZZvSwX1JBaQ0AR5HBQYmHaiLWgPsORf3PjJ0kz+u21SK9D+yJkUtU
 rT6+HvoVXA==
 =ufpE
 -----END PGP SIGNATURE-----

Merge tag 'for-6.14/io_uring-20250119' of git://git.kernel.dk/linux

Pull io_uring updates from Jens Axboe:
 "Not a lot in terms of features this time around, mostly just cleanups
  and code consolidation:

   - Support for PI meta data read/write via io_uring, with NVMe and
     SCSI covered

   - Cleanup the per-op structure caching, making it consistent across
     various command types

   - Consolidate the various user mapped features into a concept called
     regions, making the various users of that consistent

   - Various cleanups and fixes"

* tag 'for-6.14/io_uring-20250119' of git://git.kernel.dk/linux: (56 commits)
  io_uring/fdinfo: fix io_uring_show_fdinfo() misuse of ->d_iname
  io_uring: reuse io_should_terminate_tw() for cmds
  io_uring: Factor out a function to parse restrictions
  io_uring/rsrc: require cloned buffers to share accounting contexts
  io_uring: simplify the SQPOLL thread check when cancelling requests
  io_uring: expose read/write attribute capability
  io_uring/rw: don't gate retry on completion context
  io_uring/rw: handle -EAGAIN retry at IO completion time
  io_uring/rw: use io_rw_recycle() from cleanup path
  io_uring/rsrc: simplify the bvec iter count calculation
  io_uring: ensure io_queue_deferred() is out-of-line
  io_uring/rw: always clear ->bytes_done on io_async_rw setup
  io_uring/rw: use NULL for rw->free_iovec assigment
  io_uring/rw: don't mask in f_iocb_flags
  io_uring/msg_ring: Drop custom destructor
  io_uring: Move old async data allocation helper to header
  io_uring/rw: Allocate async data through helper
  io_uring/net: Allocate msghdr async data through helper
  io_uring/uring_cmd: Allocate async data through generic helper
  io_uring/poll: Allocate apoll with generic alloc_cache helper
  ...
2025-01-20 20:27:33 -08:00
Bui Quang Minh
a13030fd19 io_uring: simplify the SQPOLL thread check when cancelling requests
In io_uring_try_cancel_requests, we check whether sq_data->thread ==
current to determine if the function is called by the SQPOLL thread to do
iopoll when IORING_SETUP_SQPOLL is set. This check can race with the SQPOLL
thread termination.

io_uring_cancel_generic is used in 2 places: io_uring_cancel_generic and
io_ring_exit_work. In io_uring_cancel_generic, we have the information
whether the current is SQPOLL thread already. And the SQPOLL thread never
reaches io_ring_exit_work.

So to avoid the racy check, this commit adds a boolean flag to
io_uring_try_cancel_requests to determine if the caller is SQPOLL thread.

Reported-by: syzbot+3c750be01dab672c513d@syzkaller.appspotmail.com
Reported-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Li Zetao <lizetao1@huawei.com>
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20250113160331.44057-1-minhquangbui99@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-13 15:29:44 -07:00
Linus Torvalds
52a5a22d8a io_uring-6.13-20250111
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmeCmJkQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgppx5EACv65GSg4kQZTFtuSQ3Z1fq53Itg2vVS6Bo
 d9IcO99T23IezMpRzk/HrEXWE3Kdjkp/z8spKo0ZP//dYMTm4js/PWLxfH81zluc
 lfvnJhGjsZvhQBHKZggVE70W0lmWE6OBuC0jmuujVqtHmu3d7OzGkPK7CmSyKaxR
 2ekFKaa7QvLgmx0gEPpmEsfAWzlM5hhNAPbWcdAUTQvUtnMxpTowYY8bI/drPUC0
 bOvoYq7O/ZCdgobNGPiiOUrEQfDAuc7S3aQ+i5zn7gIu0BHe31XlwR8hbt6mRz/0
 SHk2eSecrv0H6rA4YPKFno7eQZOIWd43T+t9IjJUpykcXkMfyNOKO2HLR/FaQkxN
 kFNcCjFNJ6qLacTtbIZCzRs2Skhe5AF56jJ9FiVZbE3MKNuQBjcM2DpRlkuJLGvw
 71T5cldS0394+lIA+B2DjYVJ6IqMBHQ23brnL0HfMBuRuLaPweHj//wh5S6oCLg0
 X9Nq0tvgoYVo0M+jNS8NW4zWaoOdAw8eIlTVl8VNr1mSklpA0ZCgFXFsnCBZZb3N
 C7SgG1lrmI+IYTC30LKxDcwmCi3JhDQg5Yvz9trQzMDMJaePMms+achcHyY9WfL5
 0feUMe4RZAOEros0W7QshaAiz5TWFCoGi18muhzXDECQEQ9cV+Mh2BJ+JFiOP/ZT
 LxNpFaFwDg==
 =XUlm
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.13-20250111' of git://git.kernel.dk/linux

Pull io_uring fixes from Jens Axboe:

 - Fix for multishot timeout updates only using the updated value for
   the first invocation, not subsequent ones

 - Silence a false positive lockdep warning

 - Fix the eventfd signaling and putting RCU logic

 - Fix fault injected SQPOLL setup not clearing the task pointer in the
   error path

 - Fix local task_work looking at the SQPOLL thread rather than just
   signaling the safe variant. Again one of those theoretical issues,
   which should be closed up none the less.

* tag 'io_uring-6.13-20250111' of git://git.kernel.dk/linux:
  io_uring: don't touch sqd->thread off tw add
  io_uring/sqpoll: zero sqd->thread on tctx errors
  io_uring/eventfd: ensure io_eventfd_signal() defers another RCU period
  io_uring: silence false positive warnings
  io_uring/timeout: fix multishot updates
2025-01-11 10:59:43 -08:00
Anuj Gupta
94d57442e5 io_uring: expose read/write attribute capability
After commit 9a213d3b80c0, we can pass additional attributes along with
read/write. However, userspace doesn't know that. Add a new feature flag
IORING_FEAT_RW_ATTR, to notify the userspace that the kernel has this
ability.

Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Tested-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20241205062109.1788-1-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-10 17:12:42 -07:00
Pavel Begunkov
bd2703b42d io_uring: don't touch sqd->thread off tw add
With IORING_SETUP_SQPOLL all requests are created by the SQPOLL task,
which means that req->task should always match sqd->thread. Since
accesses to sqd->thread should be separately protected, use req->task
in io_req_normal_work_add() instead.

Note, in the eyes of io_req_normal_work_add(), the SQPOLL task struct
is always pinned and alive, and sqd->thread can either be the task or
NULL. It's only problematic if the compiler decides to reload the value
after the null check, which is not so likely.

Cc: stable@vger.kernel.org
Cc: Bui Quang Minh <minhquangbui99@gmail.com>
Reported-by: lizetao <lizetao1@huawei.com>
Fixes: 78f9b61bd8 ("io_uring: wake SQPOLL task when task_work is added to an empty queue")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1cbbe72cf32c45a8fee96026463024cd8564a7d7.1736541357.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-10 14:00:25 -07:00
Linus Torvalds
7110f24f9e vfs-6.13-rc7.fixes.2
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ4EhtAAKCRCRxhvAZXjc
 orToAQCIKKS7fk9j8CUSAdRG5mMy7Q++8OEVA+gyyMWuXnBPYwD/ehy+1xBVjCcI
 FBzLadaJSuygjZVCzhVXsE0oRf4A2wg=
 =waDA
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.13-rc7.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:
 "afs:

   - Fix the maximum cell name length

   - Fix merge preference rule failure condition

  fuse:

   - Fix fuse_get_user_pages() so it doesn't risk misleading the caller
     to think pages have been allocated when they actually haven't

   - Fix direct-io folio offset and length calculation

  netfs:

   - Fix async direct-io handling

   - Fix read-retry for filesystems that don't provide a
     ->prepare_read() method

  vfs:

   - Prevent truncating 64-bit offsets to 32-bits in iomap

   - Fix memory barrier interactions when polling

   - Remove MNT_ONRB to fix concurrent modification of @mnt->mnt_flags
     leading to MNT_ONRB to not be raised and invalid access to a list
     member"

* tag 'vfs-6.13-rc7.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  poll: kill poll_does_not_wait()
  sock_poll_wait: kill the no longer necessary barrier after poll_wait()
  io_uring_poll: kill the no longer necessary barrier after poll_wait()
  poll_wait: kill the obsolete wait_address check
  poll_wait: add mb() to fix theoretical race between waitqueue_active() and .poll()
  afs: Fix merge preference rule failure condition
  netfs: Fix read-retry for fs with no ->prepare_read()
  netfs: Fix kernel async DIO
  fs: kill MNT_ONRB
  iomap: avoid avoid truncating 64-bit offset to 32 bits
  afs: Fix the maximum cell name length
  fuse: Set *nbytesp=0 in fuse_get_user_pages on allocation failure
  fuse: fix direct io folio offset and length calculation
2025-01-10 09:11:11 -08:00