mirror of
				https://github.com/torvalds/linux.git
				synced 2025-11-04 02:30:34 +02:00 
			
		
		
		
	Fix typos in Documentation. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Link: https://lore.kernel.org/r/20230814212822.193684-4-helgaas@kernel.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
		
			
				
	
	
		
			267 lines
		
	
	
	
		
			10 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			267 lines
		
	
	
	
		
			10 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
=========================
 | 
						|
BPF Graph Data Structures
 | 
						|
=========================
 | 
						|
 | 
						|
This document describes implementation details of new-style "graph" data
 | 
						|
structures (linked_list, rbtree), with particular focus on the verifier's
 | 
						|
implementation of semantics specific to those data structures.
 | 
						|
 | 
						|
Although no specific verifier code is referred to in this document, the document
 | 
						|
assumes that the reader has general knowledge of BPF verifier internals, BPF
 | 
						|
maps, and BPF program writing.
 | 
						|
 | 
						|
Note that the intent of this document is to describe the current state of
 | 
						|
these graph data structures. **No guarantees** of stability for either
 | 
						|
semantics or APIs are made or implied here.
 | 
						|
 | 
						|
.. contents::
 | 
						|
    :local:
 | 
						|
    :depth: 2
 | 
						|
 | 
						|
Introduction
 | 
						|
------------
 | 
						|
 | 
						|
The BPF map API has historically been the main way to expose data structures
 | 
						|
of various types for use within BPF programs. Some data structures fit naturally
 | 
						|
with the map API (HASH, ARRAY), others less so. Consequently, programs
 | 
						|
interacting with the latter group of data structures can be hard to parse
 | 
						|
for kernel programmers without previous BPF experience.
 | 
						|
 | 
						|
Luckily, some restrictions which necessitated the use of BPF map semantics are
 | 
						|
no longer relevant. With the introduction of kfuncs, kptrs, and the any-context
 | 
						|
BPF allocator, it is now possible to implement BPF data structures whose API
 | 
						|
and semantics more closely match those exposed to the rest of the kernel.
 | 
						|
 | 
						|
Two such data structures - linked_list and rbtree - have many verification
 | 
						|
details in common. Because both have "root"s ("head" for linked_list) and
 | 
						|
"node"s, the verifier code and this document refer to common functionality
 | 
						|
as "graph_api", "graph_root", "graph_node", etc.
 | 
						|
 | 
						|
Unless otherwise stated, examples and semantics below apply to both graph data
 | 
						|
structures.
 | 
						|
 | 
						|
Unstable API
 | 
						|
------------
 | 
						|
 | 
						|
Data structures implemented using the BPF map API have historically used BPF
 | 
						|
helper functions - either standard map API helpers like ``bpf_map_update_elem``
 | 
						|
or map-specific helpers. The new-style graph data structures instead use kfuncs
 | 
						|
to define their manipulation helpers. Because there are no stability guarantees
 | 
						|
for kfuncs, the API and semantics for these data structures can be evolved in
 | 
						|
a way that breaks backwards compatibility if necessary.
 | 
						|
 | 
						|
Root and node types for the new data structures are opaquely defined in the
 | 
						|
``uapi/linux/bpf.h`` header.
 | 
						|
 | 
						|
Locking
 | 
						|
-------
 | 
						|
 | 
						|
The new-style data structures are intrusive and are defined similarly to their
 | 
						|
vanilla kernel counterparts:
 | 
						|
 | 
						|
.. code-block:: c
 | 
						|
 | 
						|
        struct node_data {
 | 
						|
          long key;
 | 
						|
          long data;
 | 
						|
          struct bpf_rb_node node;
 | 
						|
        };
 | 
						|
 | 
						|
        struct bpf_spin_lock glock;
 | 
						|
        struct bpf_rb_root groot __contains(node_data, node);
 | 
						|
 | 
						|
The "root" type for both linked_list and rbtree expects to be in a map_value
 | 
						|
which also contains a ``bpf_spin_lock`` - in the above example both global
 | 
						|
variables are placed in a single-value arraymap. The verifier considers this
 | 
						|
spin_lock to be associated with the ``bpf_rb_root`` by virtue of both being in
 | 
						|
the same map_value and will enforce that the correct lock is held when
 | 
						|
verifying BPF programs that manipulate the tree. Since this lock checking
 | 
						|
happens at verification time, there is no runtime penalty.
 | 
						|
 | 
						|
Non-owning references
 | 
						|
---------------------
 | 
						|
 | 
						|
**Motivation**
 | 
						|
 | 
						|
Consider the following BPF code:
 | 
						|
 | 
						|
.. code-block:: c
 | 
						|
 | 
						|
        struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */
 | 
						|
 | 
						|
        bpf_spin_lock(&lock);
 | 
						|
 | 
						|
        bpf_rbtree_add(&tree, n); /* PASSED */
 | 
						|
 | 
						|
        bpf_spin_unlock(&lock);
 | 
						|
 | 
						|
From the verifier's perspective, the pointer ``n`` returned from ``bpf_obj_new``
 | 
						|
has type ``PTR_TO_BTF_ID | MEM_ALLOC``, with a ``btf_id`` of
 | 
						|
``struct node_data`` and a nonzero ``ref_obj_id``. Because it holds ``n``, the
 | 
						|
program has ownership of the pointee's (object pointed to by ``n``) lifetime.
 | 
						|
The BPF program must pass off ownership before exiting - either via
 | 
						|
``bpf_obj_drop``, which ``free``'s the object, or by adding it to ``tree`` with
 | 
						|
``bpf_rbtree_add``.
 | 
						|
 | 
						|
(``ACQUIRED`` and ``PASSED`` comments in the example denote statements where
 | 
						|
"ownership is acquired" and "ownership is passed", respectively)
 | 
						|
 | 
						|
What should the verifier do with ``n`` after ownership is passed off? If the
 | 
						|
object was ``free``'d with ``bpf_obj_drop`` the answer is obvious: the verifier
 | 
						|
should reject programs which attempt to access ``n`` after ``bpf_obj_drop`` as
 | 
						|
the object is no longer valid. The underlying memory may have been reused for
 | 
						|
some other allocation, unmapped, etc.
 | 
						|
 | 
						|
When ownership is passed to ``tree`` via ``bpf_rbtree_add`` the answer is less
 | 
						|
obvious. The verifier could enforce the same semantics as for ``bpf_obj_drop``,
 | 
						|
but that would result in programs with useful, common coding patterns being
 | 
						|
rejected, e.g.:
 | 
						|
 | 
						|
.. code-block:: c
 | 
						|
 | 
						|
        int x;
 | 
						|
        struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */
 | 
						|
 | 
						|
        bpf_spin_lock(&lock);
 | 
						|
 | 
						|
        bpf_rbtree_add(&tree, n); /* PASSED */
 | 
						|
        x = n->data;
 | 
						|
        n->data = 42;
 | 
						|
 | 
						|
        bpf_spin_unlock(&lock);
 | 
						|
 | 
						|
Both the read from and write to ``n->data`` would be rejected. The verifier
 | 
						|
can do better, though, by taking advantage of two details:
 | 
						|
 | 
						|
  * Graph data structure APIs can only be used when the ``bpf_spin_lock``
 | 
						|
    associated with the graph root is held
 | 
						|
 | 
						|
  * Both graph data structures have pointer stability
 | 
						|
 | 
						|
     * Because graph nodes are allocated with ``bpf_obj_new`` and
 | 
						|
       adding / removing from the root involves fiddling with the
 | 
						|
       ``bpf_{list,rb}_node`` field of the node struct, a graph node will
 | 
						|
       remain at the same address after either operation.
 | 
						|
 | 
						|
Because the associated ``bpf_spin_lock`` must be held by any program adding
 | 
						|
or removing, if we're in the critical section bounded by that lock, we know
 | 
						|
that no other program can add or remove until the end of the critical section.
 | 
						|
This combined with pointer stability means that, until the critical section
 | 
						|
ends, we can safely access the graph node through ``n`` even after it was used
 | 
						|
to pass ownership.
 | 
						|
 | 
						|
The verifier considers such a reference a *non-owning reference*. The ref
 | 
						|
returned by ``bpf_obj_new`` is accordingly considered an *owning reference*.
 | 
						|
Both terms currently only have meaning in the context of graph nodes and API.
 | 
						|
 | 
						|
**Details**
 | 
						|
 | 
						|
Let's enumerate the properties of both types of references.
 | 
						|
 | 
						|
*owning reference*
 | 
						|
 | 
						|
  * This reference controls the lifetime of the pointee
 | 
						|
 | 
						|
  * Ownership of pointee must be 'released' by passing it to some graph API
 | 
						|
    kfunc, or via ``bpf_obj_drop``, which ``free``'s the pointee
 | 
						|
 | 
						|
    * If not released before program ends, verifier considers program invalid
 | 
						|
 | 
						|
  * Access to the pointee's memory will not page fault
 | 
						|
 | 
						|
*non-owning reference*
 | 
						|
 | 
						|
  * This reference does not own the pointee
 | 
						|
 | 
						|
     * It cannot be used to add the graph node to a graph root, nor ``free``'d via
 | 
						|
       ``bpf_obj_drop``
 | 
						|
 | 
						|
  * No explicit control of lifetime, but can infer valid lifetime based on
 | 
						|
    non-owning ref existence (see explanation below)
 | 
						|
 | 
						|
  * Access to the pointee's memory will not page fault
 | 
						|
 | 
						|
From verifier's perspective non-owning references can only exist
 | 
						|
between spin_lock and spin_unlock. Why? After spin_unlock another program
 | 
						|
can do arbitrary operations on the data structure like removing and ``free``-ing
 | 
						|
via bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd,
 | 
						|
``free``'d, and reused via bpf_obj_new would point to an entirely different thing.
 | 
						|
Or the memory could go away.
 | 
						|
 | 
						|
To prevent this logic violation all non-owning references are invalidated by the
 | 
						|
verifier after a critical section ends. This is necessary to ensure the "will
 | 
						|
not page fault" property of non-owning references. So if the verifier hasn't
 | 
						|
invalidated a non-owning ref, accessing it will not page fault.
 | 
						|
 | 
						|
Currently ``bpf_obj_drop`` is not allowed in the critical section, so
 | 
						|
if there's a valid non-owning ref, we must be in a critical section, and can
 | 
						|
conclude that the ref's memory hasn't been dropped-and- ``free``'d or
 | 
						|
dropped-and-reused.
 | 
						|
 | 
						|
Any reference to a node that is in an rbtree _must_ be non-owning, since
 | 
						|
the tree has control of the pointee's lifetime. Similarly, any ref to a node
 | 
						|
that isn't in rbtree _must_ be owning. This results in a nice property:
 | 
						|
graph API add / remove implementations don't need to check if a node
 | 
						|
has already been added (or already removed), as the ownership model
 | 
						|
allows the verifier to prevent such a state from being valid by simply checking
 | 
						|
types.
 | 
						|
 | 
						|
However, pointer aliasing poses an issue for the above "nice property".
 | 
						|
Consider the following example:
 | 
						|
 | 
						|
.. code-block:: c
 | 
						|
 | 
						|
        struct node_data *n, *m, *o, *p;
 | 
						|
        n = bpf_obj_new(typeof(*n));     /* 1 */
 | 
						|
 | 
						|
        bpf_spin_lock(&lock);
 | 
						|
 | 
						|
        bpf_rbtree_add(&tree, n);        /* 2 */
 | 
						|
        m = bpf_rbtree_first(&tree);     /* 3 */
 | 
						|
 | 
						|
        o = bpf_rbtree_remove(&tree, n); /* 4 */
 | 
						|
        p = bpf_rbtree_remove(&tree, m); /* 5 */
 | 
						|
 | 
						|
        bpf_spin_unlock(&lock);
 | 
						|
 | 
						|
        bpf_obj_drop(o);
 | 
						|
        bpf_obj_drop(p); /* 6 */
 | 
						|
 | 
						|
Assume the tree is empty before this program runs. If we track verifier state
 | 
						|
changes here using numbers in above comments:
 | 
						|
 | 
						|
  1) n is an owning reference
 | 
						|
 | 
						|
  2) n is a non-owning reference, it's been added to the tree
 | 
						|
 | 
						|
  3) n and m are non-owning references, they both point to the same node
 | 
						|
 | 
						|
  4) o is an owning reference, n and m non-owning, all point to same node
 | 
						|
 | 
						|
  5) o and p are owning, n and m non-owning, all point to the same node
 | 
						|
 | 
						|
  6) a double-free has occurred, since o and p point to same node and o was
 | 
						|
     ``free``'d in previous statement
 | 
						|
 | 
						|
States 4 and 5 violate our "nice property", as there are non-owning refs to
 | 
						|
a node which is not in an rbtree. Statement 5 will try to remove a node which
 | 
						|
has already been removed as a result of this violation. State 6 is a dangerous
 | 
						|
double-free.
 | 
						|
 | 
						|
At a minimum we should prevent state 6 from being possible. If we can't also
 | 
						|
prevent state 5 then we must abandon our "nice property" and check whether a
 | 
						|
node has already been removed at runtime.
 | 
						|
 | 
						|
We prevent both by generalizing the "invalidate non-owning references" behavior
 | 
						|
of ``bpf_spin_unlock`` and doing similar invalidation after
 | 
						|
``bpf_rbtree_remove``. The logic here being that any graph API kfunc which:
 | 
						|
 | 
						|
  * takes an arbitrary node argument
 | 
						|
 | 
						|
  * removes it from the data structure
 | 
						|
 | 
						|
  * returns an owning reference to the removed node
 | 
						|
 | 
						|
May result in a state where some other non-owning reference points to the same
 | 
						|
node. So ``remove``-type kfuncs must be considered a non-owning reference
 | 
						|
invalidation point as well.
 |