mirror of
				https://github.com/torvalds/linux.git
				synced 2025-11-04 02:30:34 +02:00 
			
		
		
		
	bpf, docs: document open-coded BPF iterators
Extract BPF open-coded iterators documentation spread out across a few original commit messages ([0], [1]) into a dedicated doc section under Documentation/bpf/bpf_iterators.rst. Also make explicit expectation that BPF iterator program type should be accompanied by a corresponding open-coded BPF iterator implementation, going forward. [0] https://lore.kernel.org/all/20230308184121.1165081-3-andrii@kernel.org/ [1] https://lore.kernel.org/all/20230308184121.1165081-4-andrii@kernel.org/ Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20250509180350.2604946-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This commit is contained in:
		
							parent
							
								
									c8ce7db0ca
								
							
						
					
					
						commit
						7220eabff8
					
				
					 1 changed files with 110 additions and 3 deletions
				
			
		| 
						 | 
				
			
			@ -2,10 +2,117 @@
 | 
			
		|||
BPF Iterators
 | 
			
		||||
=============
 | 
			
		||||
 | 
			
		||||
--------
 | 
			
		||||
Overview
 | 
			
		||||
--------
 | 
			
		||||
 | 
			
		||||
----------
 | 
			
		||||
Motivation
 | 
			
		||||
----------
 | 
			
		||||
BPF supports two separate entities collectively known as "BPF iterators": BPF
 | 
			
		||||
iterator *program type* and *open-coded* BPF iterators. The former is
 | 
			
		||||
a stand-alone BPF program type which, when attached and activated by user,
 | 
			
		||||
will be called once for each entity (task_struct, cgroup, etc) that is being
 | 
			
		||||
iterated. The latter is a set of BPF-side APIs implementing iterator
 | 
			
		||||
functionality and available across multiple BPF program types. Open-coded
 | 
			
		||||
iterators provide similar functionality to BPF iterator programs, but gives
 | 
			
		||||
more flexibility and control to all other BPF program types. BPF iterator
 | 
			
		||||
programs, on the other hand, can be used to implement anonymous or BPF
 | 
			
		||||
FS-mounted special files, whose contents are generated by attached BPF iterator
 | 
			
		||||
program, backed by seq_file functionality. Both are useful depending on
 | 
			
		||||
specific needs.
 | 
			
		||||
 | 
			
		||||
When adding a new BPF iterator program, it is expected that similar
 | 
			
		||||
functionality will be added as open-coded iterator for maximum flexibility.
 | 
			
		||||
It's also expected that iteration logic and code will be maximally shared and
 | 
			
		||||
reused between two iterator API surfaces.
 | 
			
		||||
 | 
			
		||||
------------------------
 | 
			
		||||
Open-coded BPF Iterators
 | 
			
		||||
------------------------
 | 
			
		||||
 | 
			
		||||
Open-coded BPF iterators are implemented as tightly-coupled trios of kfuncs
 | 
			
		||||
(constructor, next element fetch, destructor) and iterator-specific type
 | 
			
		||||
describing on-the-stack iterator state, which is guaranteed by the BPF
 | 
			
		||||
verifier to not be tampered with outside of the corresponding
 | 
			
		||||
constructor/destructor/next APIs.
 | 
			
		||||
 | 
			
		||||
Each kind of open-coded BPF iterator has its own associated
 | 
			
		||||
struct bpf_iter_<type>, where <type> denotes a specific type of iterator.
 | 
			
		||||
bpf_iter_<type> state needs to live on BPF program stack, so make sure it's
 | 
			
		||||
small enough to fit on BPF stack. For performance reasons its best to avoid
 | 
			
		||||
dynamic memory allocation for iterator state and size the state struct big
 | 
			
		||||
enough to fit everything necessary. But if necessary, dynamic memory
 | 
			
		||||
allocation is a way to bypass BPF stack limitations. Note, state struct size
 | 
			
		||||
is part of iterator's user-visible API, so changing it will break backwards
 | 
			
		||||
compatibility, so be deliberate about designing it.
 | 
			
		||||
 | 
			
		||||
All kfuncs (constructor, next, destructor) have to be named consistently as
 | 
			
		||||
bpf_iter_<type>_{new,next,destroy}(), respectively. <type> represents iterator
 | 
			
		||||
type, and iterator state should be represented as a matching
 | 
			
		||||
`struct bpf_iter_<type>` state type. Also, all iter kfuncs should have
 | 
			
		||||
a pointer to this `struct bpf_iter_<type>` as the very first argument.
 | 
			
		||||
 | 
			
		||||
Additionally:
 | 
			
		||||
  - Constructor, i.e., `bpf_iter_<type>_new()`, can have arbitrary extra
 | 
			
		||||
  number of arguments. Return type is not enforced either.
 | 
			
		||||
  - Next method, i.e., `bpf_iter_<type>_next()`, has to return a pointer
 | 
			
		||||
  type and should have exactly one argument: `struct bpf_iter_<type> *`
 | 
			
		||||
  (const/volatile/restrict and typedefs are ignored).
 | 
			
		||||
  - Destructor, i.e., `bpf_iter_<type>_destroy()`, should return void and
 | 
			
		||||
  should have exactly one argument, similar to the next method.
 | 
			
		||||
  - `struct bpf_iter_<type>` size is enforced to be positive and
 | 
			
		||||
  a multiple of 8 bytes (to fit stack slots correctly).
 | 
			
		||||
 | 
			
		||||
Such strictness and consistency allows to build generic helpers abstracting
 | 
			
		||||
important, but boilerplate, details to be able to use open-coded iterators
 | 
			
		||||
effectively and ergonomically (see libbpf's bpf_for_each() macro). This is
 | 
			
		||||
enforced at kfunc registration point by the kernel.
 | 
			
		||||
 | 
			
		||||
Constructor/next/destructor implementation contract is as follows:
 | 
			
		||||
  - constructor, `bpf_iter_<type>_new()`, always initializes iterator state on
 | 
			
		||||
    the stack. If any of the input arguments are invalid, constructor should
 | 
			
		||||
    make sure to still initialize it such that subsequent next() calls will
 | 
			
		||||
    return NULL. I.e., on error, *return error and construct empty iterator*.
 | 
			
		||||
    Constructor kfunc is marked with KF_ITER_NEW flag.
 | 
			
		||||
 | 
			
		||||
  - next method, `bpf_iter_<type>_next()`, accepts pointer to iterator state
 | 
			
		||||
    and produces an element. Next method should always return a pointer. The
 | 
			
		||||
    contract between BPF verifier is that next method *guarantees* that it
 | 
			
		||||
    will eventually return NULL when elements are exhausted. Once NULL is
 | 
			
		||||
    returned, subsequent next calls *should keep returning NULL*. Next method
 | 
			
		||||
    is marked with KF_ITER_NEXT (and should also have KF_RET_NULL as
 | 
			
		||||
    NULL-returning kfunc, of course).
 | 
			
		||||
 | 
			
		||||
  - destructor, `bpf_iter_<type>_destroy()`, is always called once. Even if
 | 
			
		||||
    constructor failed or next returned nothing.  Destructor frees up any
 | 
			
		||||
    resources and marks stack space used by `struct bpf_iter_<type>` as usable
 | 
			
		||||
    for something else. Destructor is marked with KF_ITER_DESTROY flag.
 | 
			
		||||
 | 
			
		||||
Any open-coded BPF iterator implementation has to implement at least these
 | 
			
		||||
three methods. It is enforced that for any given type of iterator only
 | 
			
		||||
applicable constructor/destructor/next are callable. I.e., verifier ensures
 | 
			
		||||
you can't pass number iterator state into, say, cgroup iterator's next method.
 | 
			
		||||
 | 
			
		||||
From a 10,000-feet BPF verification point of view, next methods are the points
 | 
			
		||||
of forking a verification state, which are conceptually similar to what
 | 
			
		||||
verifier is doing when validating conditional jumps. Verifier is branching out
 | 
			
		||||
`call bpf_iter_<type>_next` instruction and simulates two outcomes: NULL
 | 
			
		||||
(iteration is done) and non-NULL (new element is returned). NULL is simulated
 | 
			
		||||
first and is supposed to reach exit without looping. After that non-NULL case
 | 
			
		||||
is validated and it either reaches exit (for trivial examples with no real
 | 
			
		||||
loop), or reaches another `call bpf_iter_<type>_next` instruction with the
 | 
			
		||||
state equivalent to already (partially) validated one. State equivalency at
 | 
			
		||||
that point means we technically are going to be looping forever without
 | 
			
		||||
"breaking out" out of established "state envelope" (i.e., subsequent
 | 
			
		||||
iterations don't add any new knowledge or constraints to the verifier state,
 | 
			
		||||
so running 1, 2, 10, or a million of them doesn't matter). But taking into
 | 
			
		||||
account the contract stating that iterator next method *has to* return NULL
 | 
			
		||||
eventually, we can conclude that loop body is safe and will eventually
 | 
			
		||||
terminate. Given we validated logic outside of the loop (NULL case), and
 | 
			
		||||
concluded that loop body is safe (though potentially looping many times),
 | 
			
		||||
verifier can claim safety of the overall program logic.
 | 
			
		||||
 | 
			
		||||
------------------------
 | 
			
		||||
BPF Iterators Motivation
 | 
			
		||||
------------------------
 | 
			
		||||
 | 
			
		||||
There are a few existing ways to dump kernel data into user space. The most
 | 
			
		||||
popular one is the ``/proc`` system. For example, ``cat /proc/net/tcp6`` dumps
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
		Loading…
	
		Reference in a new issue