mirror of
				https://github.com/torvalds/linux.git
				synced 2025-11-04 02:30:34 +02:00 
			
		
		
		
	There's no clear explanation of what VF Representors are for, their semantics, etc., outside of vendor docs and random conference slides. Add a document explaining Representors and defining what drivers that implement them are expected to do. Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Link: https://lore.kernel.org/r/20220905135557.39233-1-ecree@xilinx.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
		
			
				
	
	
		
			564 lines
		
	
	
	
		
			25 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			564 lines
		
	
	
	
		
			25 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
.. SPDX-License-Identifier: GPL-2.0
 | 
						|
.. include:: <isonum.txt>
 | 
						|
.. _switchdev:
 | 
						|
 | 
						|
===============================================
 | 
						|
Ethernet switch device driver model (switchdev)
 | 
						|
===============================================
 | 
						|
 | 
						|
Copyright |copy| 2014 Jiri Pirko <jiri@resnulli.us>
 | 
						|
 | 
						|
Copyright |copy| 2014-2015 Scott Feldman <sfeldma@gmail.com>
 | 
						|
 | 
						|
 | 
						|
The Ethernet switch device driver model (switchdev) is an in-kernel driver
 | 
						|
model for switch devices which offload the forwarding (data) plane from the
 | 
						|
kernel.
 | 
						|
 | 
						|
Figure 1 is a block diagram showing the components of the switchdev model for
 | 
						|
an example setup using a data-center-class switch ASIC chip.  Other setups
 | 
						|
with SR-IOV or soft switches, such as OVS, are possible.
 | 
						|
 | 
						|
::
 | 
						|
 | 
						|
 | 
						|
			     User-space tools
 | 
						|
 | 
						|
       user space                   |
 | 
						|
      +-------------------------------------------------------------------+
 | 
						|
       kernel                       | Netlink
 | 
						|
				    |
 | 
						|
		     +--------------+-------------------------------+
 | 
						|
		     |         Network stack                        |
 | 
						|
		     |           (Linux)                            |
 | 
						|
		     |                                              |
 | 
						|
		     +----------------------------------------------+
 | 
						|
 | 
						|
			   sw1p2     sw1p4     sw1p6
 | 
						|
		      sw1p1  +  sw1p3  +  sw1p5  +          eth1
 | 
						|
			+    |    +    |    +    |            +
 | 
						|
			|    |    |    |    |    |            |
 | 
						|
		     +--+----+----+----+----+----+---+  +-----+-----+
 | 
						|
		     |         Switch driver         |  |    mgmt   |
 | 
						|
		     |        (this document)        |  |   driver  |
 | 
						|
		     |                               |  |           |
 | 
						|
		     +--------------+----------------+  +-----------+
 | 
						|
				    |
 | 
						|
       kernel                       | HW bus (eg PCI)
 | 
						|
      +-------------------------------------------------------------------+
 | 
						|
       hardware                     |
 | 
						|
		     +--------------+----------------+
 | 
						|
		     |         Switch device (sw1)   |
 | 
						|
		     |  +----+                       +--------+
 | 
						|
		     |  |    v offloaded data path   | mgmt port
 | 
						|
		     |  |    |                       |
 | 
						|
		     +--|----|----+----+----+----+---+
 | 
						|
			|    |    |    |    |    |
 | 
						|
			+    +    +    +    +    +
 | 
						|
		       p1   p2   p3   p4   p5   p6
 | 
						|
 | 
						|
			     front-panel ports
 | 
						|
 | 
						|
 | 
						|
				    Fig 1.
 | 
						|
 | 
						|
 | 
						|
Include Files
 | 
						|
-------------
 | 
						|
 | 
						|
::
 | 
						|
 | 
						|
    #include <linux/netdevice.h>
 | 
						|
    #include <net/switchdev.h>
 | 
						|
 | 
						|
 | 
						|
Configuration
 | 
						|
-------------
 | 
						|
 | 
						|
Use "depends NET_SWITCHDEV" in driver's Kconfig to ensure switchdev model
 | 
						|
support is built for driver.
 | 
						|
 | 
						|
 | 
						|
Switch Ports
 | 
						|
------------
 | 
						|
 | 
						|
On switchdev driver initialization, the driver will allocate and register a
 | 
						|
struct net_device (using register_netdev()) for each enumerated physical switch
 | 
						|
port, called the port netdev.  A port netdev is the software representation of
 | 
						|
the physical port and provides a conduit for control traffic to/from the
 | 
						|
controller (the kernel) and the network, as well as an anchor point for higher
 | 
						|
level constructs such as bridges, bonds, VLANs, tunnels, and L3 routers.  Using
 | 
						|
standard netdev tools (iproute2, ethtool, etc), the port netdev can also
 | 
						|
provide to the user access to the physical properties of the switch port such
 | 
						|
as PHY link state and I/O statistics.
 | 
						|
 | 
						|
There is (currently) no higher-level kernel object for the switch beyond the
 | 
						|
port netdevs.  All of the switchdev driver ops are netdev ops or switchdev ops.
 | 
						|
 | 
						|
A switch management port is outside the scope of the switchdev driver model.
 | 
						|
Typically, the management port is not participating in offloaded data plane and
 | 
						|
is loaded with a different driver, such as a NIC driver, on the management port
 | 
						|
device.
 | 
						|
 | 
						|
Switch ID
 | 
						|
^^^^^^^^^
 | 
						|
 | 
						|
The switchdev driver must implement the net_device operation
 | 
						|
ndo_get_port_parent_id for each port netdev, returning the same physical ID for
 | 
						|
each port of a switch. The ID must be unique between switches on the same
 | 
						|
system. The ID does not need to be unique between switches on different
 | 
						|
systems.
 | 
						|
 | 
						|
The switch ID is used to locate ports on a switch and to know if aggregated
 | 
						|
ports belong to the same switch.
 | 
						|
 | 
						|
Port Netdev Naming
 | 
						|
^^^^^^^^^^^^^^^^^^
 | 
						|
 | 
						|
Udev rules should be used for port netdev naming, using some unique attribute
 | 
						|
of the port as a key, for example the port MAC address or the port PHYS name.
 | 
						|
Hard-coding of kernel netdev names within the driver is discouraged; let the
 | 
						|
kernel pick the default netdev name, and let udev set the final name based on a
 | 
						|
port attribute.
 | 
						|
 | 
						|
Using port PHYS name (ndo_get_phys_port_name) for the key is particularly
 | 
						|
useful for dynamically-named ports where the device names its ports based on
 | 
						|
external configuration.  For example, if a physical 40G port is split logically
 | 
						|
into 4 10G ports, resulting in 4 port netdevs, the device can give a unique
 | 
						|
name for each port using port PHYS name.  The udev rule would be::
 | 
						|
 | 
						|
    SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="<phys_switch_id>", \
 | 
						|
	    ATTR{phys_port_name}!="", NAME="swX$attr{phys_port_name}"
 | 
						|
 | 
						|
Suggested naming convention is "swXpYsZ", where X is the switch name or ID, Y
 | 
						|
is the port name or ID, and Z is the sub-port name or ID.  For example, sw1p1s0
 | 
						|
would be sub-port 0 on port 1 on switch 1.
 | 
						|
 | 
						|
Port Features
 | 
						|
^^^^^^^^^^^^^
 | 
						|
 | 
						|
NETIF_F_NETNS_LOCAL
 | 
						|
 | 
						|
If the switchdev driver (and device) only supports offloading of the default
 | 
						|
network namespace (netns), the driver should set this feature flag to prevent
 | 
						|
the port netdev from being moved out of the default netns.  A netns-aware
 | 
						|
driver/device would not set this flag and be responsible for partitioning
 | 
						|
hardware to preserve netns containment.  This means hardware cannot forward
 | 
						|
traffic from a port in one namespace to another port in another namespace.
 | 
						|
 | 
						|
Port Topology
 | 
						|
^^^^^^^^^^^^^
 | 
						|
 | 
						|
The port netdevs representing the physical switch ports can be organized into
 | 
						|
higher-level switching constructs.  The default construct is a standalone
 | 
						|
router port, used to offload L3 forwarding.  Two or more ports can be bonded
 | 
						|
together to form a LAG.  Two or more ports (or LAGs) can be bridged to bridge
 | 
						|
L2 networks.  VLANs can be applied to sub-divide L2 networks.  L2-over-L3
 | 
						|
tunnels can be built on ports.  These constructs are built using standard Linux
 | 
						|
tools such as the bridge driver, the bonding/team drivers, and netlink-based
 | 
						|
tools such as iproute2.
 | 
						|
 | 
						|
The switchdev driver can know a particular port's position in the topology by
 | 
						|
monitoring NETDEV_CHANGEUPPER notifications.  For example, a port moved into a
 | 
						|
bond will see its upper master change.  If that bond is moved into a bridge,
 | 
						|
the bond's upper master will change.  And so on.  The driver will track such
 | 
						|
movements to know what position a port is in in the overall topology by
 | 
						|
registering for netdevice events and acting on NETDEV_CHANGEUPPER.
 | 
						|
 | 
						|
L2 Forwarding Offload
 | 
						|
---------------------
 | 
						|
 | 
						|
The idea is to offload the L2 data forwarding (switching) path from the kernel
 | 
						|
to the switchdev device by mirroring bridge FDB entries down to the device.  An
 | 
						|
FDB entry is the {port, MAC, VLAN} tuple forwarding destination.
 | 
						|
 | 
						|
To offloading L2 bridging, the switchdev driver/device should support:
 | 
						|
 | 
						|
	- Static FDB entries installed on a bridge port
 | 
						|
	- Notification of learned/forgotten src mac/vlans from device
 | 
						|
	- STP state changes on the port
 | 
						|
	- VLAN flooding of multicast/broadcast and unknown unicast packets
 | 
						|
 | 
						|
Static FDB Entries
 | 
						|
^^^^^^^^^^^^^^^^^^
 | 
						|
 | 
						|
A driver which implements the ``ndo_fdb_add``, ``ndo_fdb_del`` and
 | 
						|
``ndo_fdb_dump`` operations is able to support the command below, which adds a
 | 
						|
static bridge FDB entry::
 | 
						|
 | 
						|
        bridge fdb add dev DEV ADDRESS [vlan VID] [self] static
 | 
						|
 | 
						|
(the "static" keyword is non-optional: if not specified, the entry defaults to
 | 
						|
being "local", which means that it should not be forwarded)
 | 
						|
 | 
						|
The "self" keyword (optional because it is implicit) has the role of
 | 
						|
instructing the kernel to fulfill the operation through the ``ndo_fdb_add``
 | 
						|
implementation of the ``DEV`` device itself. If ``DEV`` is a bridge port, this
 | 
						|
will bypass the bridge and therefore leave the software database out of sync
 | 
						|
with the hardware one.
 | 
						|
 | 
						|
To avoid this, the "master" keyword can be used::
 | 
						|
 | 
						|
        bridge fdb add dev DEV ADDRESS [vlan VID] master static
 | 
						|
 | 
						|
The above command instructs the kernel to search for a master interface of
 | 
						|
``DEV`` and fulfill the operation through the ``ndo_fdb_add`` method of that.
 | 
						|
This time, the bridge generates a ``SWITCHDEV_FDB_ADD_TO_DEVICE`` notification
 | 
						|
which the port driver can handle and use it to program its hardware table. This
 | 
						|
way, the software and the hardware database will both contain this static FDB
 | 
						|
entry.
 | 
						|
 | 
						|
Note: for new switchdev drivers that offload the Linux bridge, implementing the
 | 
						|
``ndo_fdb_add`` and ``ndo_fdb_del`` bridge bypass methods is strongly
 | 
						|
discouraged: all static FDB entries should be added on a bridge port using the
 | 
						|
"master" flag. The ``ndo_fdb_dump`` is an exception and can be implemented to
 | 
						|
visualize the hardware tables, if the device does not have an interrupt for
 | 
						|
notifying the operating system of newly learned/forgotten dynamic FDB
 | 
						|
addresses. In that case, the hardware FDB might end up having entries that the
 | 
						|
software FDB does not, and implementing ``ndo_fdb_dump`` is the only way to see
 | 
						|
them.
 | 
						|
 | 
						|
Note: by default, the bridge does not filter on VLAN and only bridges untagged
 | 
						|
traffic.  To enable VLAN support, turn on VLAN filtering::
 | 
						|
 | 
						|
	echo 1 >/sys/class/net/<bridge>/bridge/vlan_filtering
 | 
						|
 | 
						|
Notification of Learned/Forgotten Source MAC/VLANs
 | 
						|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | 
						|
 | 
						|
The switch device will learn/forget source MAC address/VLAN on ingress packets
 | 
						|
and notify the switch driver of the mac/vlan/port tuples.  The switch driver,
 | 
						|
in turn, will notify the bridge driver using the switchdev notifier call::
 | 
						|
 | 
						|
	err = call_switchdev_notifiers(val, dev, info, extack);
 | 
						|
 | 
						|
Where val is SWITCHDEV_FDB_ADD when learning and SWITCHDEV_FDB_DEL when
 | 
						|
forgetting, and info points to a struct switchdev_notifier_fdb_info.  On
 | 
						|
SWITCHDEV_FDB_ADD, the bridge driver will install the FDB entry into the
 | 
						|
bridge's FDB and mark the entry as NTF_EXT_LEARNED.  The iproute2 bridge
 | 
						|
command will label these entries "offload"::
 | 
						|
 | 
						|
	$ bridge fdb
 | 
						|
	52:54:00:12:35:01 dev sw1p1 master br0 permanent
 | 
						|
	00:02:00:00:02:00 dev sw1p1 master br0 offload
 | 
						|
	00:02:00:00:02:00 dev sw1p1 self
 | 
						|
	52:54:00:12:35:02 dev sw1p2 master br0 permanent
 | 
						|
	00:02:00:00:03:00 dev sw1p2 master br0 offload
 | 
						|
	00:02:00:00:03:00 dev sw1p2 self
 | 
						|
	33:33:00:00:00:01 dev eth0 self permanent
 | 
						|
	01:00:5e:00:00:01 dev eth0 self permanent
 | 
						|
	33:33:ff:00:00:00 dev eth0 self permanent
 | 
						|
	01:80:c2:00:00:0e dev eth0 self permanent
 | 
						|
	33:33:00:00:00:01 dev br0 self permanent
 | 
						|
	01:00:5e:00:00:01 dev br0 self permanent
 | 
						|
	33:33:ff:12:35:01 dev br0 self permanent
 | 
						|
 | 
						|
Learning on the port should be disabled on the bridge using the bridge command::
 | 
						|
 | 
						|
	bridge link set dev DEV learning off
 | 
						|
 | 
						|
Learning on the device port should be enabled, as well as learning_sync::
 | 
						|
 | 
						|
	bridge link set dev DEV learning on self
 | 
						|
	bridge link set dev DEV learning_sync on self
 | 
						|
 | 
						|
Learning_sync attribute enables syncing of the learned/forgotten FDB entry to
 | 
						|
the bridge's FDB.  It's possible, but not optimal, to enable learning on the
 | 
						|
device port and on the bridge port, and disable learning_sync.
 | 
						|
 | 
						|
To support learning, the driver implements switchdev op
 | 
						|
switchdev_port_attr_set for SWITCHDEV_ATTR_PORT_ID_{PRE}_BRIDGE_FLAGS.
 | 
						|
 | 
						|
FDB Ageing
 | 
						|
^^^^^^^^^^
 | 
						|
 | 
						|
The bridge will skip ageing FDB entries marked with NTF_EXT_LEARNED and it is
 | 
						|
the responsibility of the port driver/device to age out these entries.  If the
 | 
						|
port device supports ageing, when the FDB entry expires, it will notify the
 | 
						|
driver which in turn will notify the bridge with SWITCHDEV_FDB_DEL.  If the
 | 
						|
device does not support ageing, the driver can simulate ageing using a
 | 
						|
garbage collection timer to monitor FDB entries.  Expired entries will be
 | 
						|
notified to the bridge using SWITCHDEV_FDB_DEL.  See rocker driver for
 | 
						|
example of driver running ageing timer.
 | 
						|
 | 
						|
To keep an NTF_EXT_LEARNED entry "alive", the driver should refresh the FDB
 | 
						|
entry by calling call_switchdev_notifiers(SWITCHDEV_FDB_ADD, ...).  The
 | 
						|
notification will reset the FDB entry's last-used time to now.  The driver
 | 
						|
should rate limit refresh notifications, for example, no more than once a
 | 
						|
second.  (The last-used time is visible using the bridge -s fdb option).
 | 
						|
 | 
						|
STP State Change on Port
 | 
						|
^^^^^^^^^^^^^^^^^^^^^^^^
 | 
						|
 | 
						|
Internally or with a third-party STP protocol implementation (e.g. mstpd), the
 | 
						|
bridge driver maintains the STP state for ports, and will notify the switch
 | 
						|
driver of STP state change on a port using the switchdev op
 | 
						|
switchdev_attr_port_set for SWITCHDEV_ATTR_PORT_ID_STP_UPDATE.
 | 
						|
 | 
						|
State is one of BR_STATE_*.  The switch driver can use STP state updates to
 | 
						|
update ingress packet filter list for the port.  For example, if port is
 | 
						|
DISABLED, no packets should pass, but if port moves to BLOCKED, then STP BPDUs
 | 
						|
and other IEEE 01:80:c2:xx:xx:xx link-local multicast packets can pass.
 | 
						|
 | 
						|
Note that STP BDPUs are untagged and STP state applies to all VLANs on the port
 | 
						|
so packet filters should be applied consistently across untagged and tagged
 | 
						|
VLANs on the port.
 | 
						|
 | 
						|
Flooding L2 domain
 | 
						|
^^^^^^^^^^^^^^^^^^
 | 
						|
 | 
						|
For a given L2 VLAN domain, the switch device should flood multicast/broadcast
 | 
						|
and unknown unicast packets to all ports in domain, if allowed by port's
 | 
						|
current STP state.  The switch driver, knowing which ports are within which
 | 
						|
vlan L2 domain, can program the switch device for flooding.  The packet may
 | 
						|
be sent to the port netdev for processing by the bridge driver.  The
 | 
						|
bridge should not reflood the packet to the same ports the device flooded,
 | 
						|
otherwise there will be duplicate packets on the wire.
 | 
						|
 | 
						|
To avoid duplicate packets, the switch driver should mark a packet as already
 | 
						|
forwarded by setting the skb->offload_fwd_mark bit. The bridge driver will mark
 | 
						|
the skb using the ingress bridge port's mark and prevent it from being forwarded
 | 
						|
through any bridge port with the same mark.
 | 
						|
 | 
						|
It is possible for the switch device to not handle flooding and push the
 | 
						|
packets up to the bridge driver for flooding.  This is not ideal as the number
 | 
						|
of ports scale in the L2 domain as the device is much more efficient at
 | 
						|
flooding packets that software.
 | 
						|
 | 
						|
If supported by the device, flood control can be offloaded to it, preventing
 | 
						|
certain netdevs from flooding unicast traffic for which there is no FDB entry.
 | 
						|
 | 
						|
IGMP Snooping
 | 
						|
^^^^^^^^^^^^^
 | 
						|
 | 
						|
In order to support IGMP snooping, the port netdevs should trap to the bridge
 | 
						|
driver all IGMP join and leave messages.
 | 
						|
The bridge multicast module will notify port netdevs on every multicast group
 | 
						|
changed whether it is static configured or dynamically joined/leave.
 | 
						|
The hardware implementation should be forwarding all registered multicast
 | 
						|
traffic groups only to the configured ports.
 | 
						|
 | 
						|
L3 Routing Offload
 | 
						|
------------------
 | 
						|
 | 
						|
Offloading L3 routing requires that device be programmed with FIB entries from
 | 
						|
the kernel, with the device doing the FIB lookup and forwarding.  The device
 | 
						|
does a longest prefix match (LPM) on FIB entries matching route prefix and
 | 
						|
forwards the packet to the matching FIB entry's nexthop(s) egress ports.
 | 
						|
 | 
						|
To program the device, the driver has to register a FIB notifier handler
 | 
						|
using register_fib_notifier. The following events are available:
 | 
						|
 | 
						|
===================  ===================================================
 | 
						|
FIB_EVENT_ENTRY_ADD  used for both adding a new FIB entry to the device,
 | 
						|
		     or modifying an existing entry on the device.
 | 
						|
FIB_EVENT_ENTRY_DEL  used for removing a FIB entry
 | 
						|
FIB_EVENT_RULE_ADD,
 | 
						|
FIB_EVENT_RULE_DEL   used to propagate FIB rule changes
 | 
						|
===================  ===================================================
 | 
						|
 | 
						|
FIB_EVENT_ENTRY_ADD and FIB_EVENT_ENTRY_DEL events pass::
 | 
						|
 | 
						|
	struct fib_entry_notifier_info {
 | 
						|
		struct fib_notifier_info info; /* must be first */
 | 
						|
		u32 dst;
 | 
						|
		int dst_len;
 | 
						|
		struct fib_info *fi;
 | 
						|
		u8 tos;
 | 
						|
		u8 type;
 | 
						|
		u32 tb_id;
 | 
						|
		u32 nlflags;
 | 
						|
	};
 | 
						|
 | 
						|
to add/modify/delete IPv4 dst/dest_len prefix on table tb_id.  The ``*fi``
 | 
						|
structure holds details on the route and route's nexthops.  ``*dev`` is one
 | 
						|
of the port netdevs mentioned in the route's next hop list.
 | 
						|
 | 
						|
Routes offloaded to the device are labeled with "offload" in the ip route
 | 
						|
listing::
 | 
						|
 | 
						|
	$ ip route show
 | 
						|
	default via 192.168.0.2 dev eth0
 | 
						|
	11.0.0.0/30 dev sw1p1  proto kernel  scope link  src 11.0.0.2 offload
 | 
						|
	11.0.0.4/30 via 11.0.0.1 dev sw1p1  proto zebra  metric 20 offload
 | 
						|
	11.0.0.8/30 dev sw1p2  proto kernel  scope link  src 11.0.0.10 offload
 | 
						|
	11.0.0.12/30 via 11.0.0.9 dev sw1p2  proto zebra  metric 20 offload
 | 
						|
	12.0.0.2  proto zebra  metric 30 offload
 | 
						|
		nexthop via 11.0.0.1  dev sw1p1 weight 1
 | 
						|
		nexthop via 11.0.0.9  dev sw1p2 weight 1
 | 
						|
	12.0.0.3 via 11.0.0.1 dev sw1p1  proto zebra  metric 20 offload
 | 
						|
	12.0.0.4 via 11.0.0.9 dev sw1p2  proto zebra  metric 20 offload
 | 
						|
	192.168.0.0/24 dev eth0  proto kernel  scope link  src 192.168.0.15
 | 
						|
 | 
						|
The "offload" flag is set in case at least one device offloads the FIB entry.
 | 
						|
 | 
						|
XXX: add/mod/del IPv6 FIB API
 | 
						|
 | 
						|
Nexthop Resolution
 | 
						|
^^^^^^^^^^^^^^^^^^
 | 
						|
 | 
						|
The FIB entry's nexthop list contains the nexthop tuple (gateway, dev), but for
 | 
						|
the switch device to forward the packet with the correct dst mac address, the
 | 
						|
nexthop gateways must be resolved to the neighbor's mac address.  Neighbor mac
 | 
						|
address discovery comes via the ARP (or ND) process and is available via the
 | 
						|
arp_tbl neighbor table.  To resolve the routes nexthop gateways, the driver
 | 
						|
should trigger the kernel's neighbor resolution process.  See the rocker
 | 
						|
driver's rocker_port_ipv4_resolve() for an example.
 | 
						|
 | 
						|
The driver can monitor for updates to arp_tbl using the netevent notifier
 | 
						|
NETEVENT_NEIGH_UPDATE.  The device can be programmed with resolved nexthops
 | 
						|
for the routes as arp_tbl updates.  The driver implements ndo_neigh_destroy
 | 
						|
to know when arp_tbl neighbor entries are purged from the port.
 | 
						|
 | 
						|
Device driver expected behavior
 | 
						|
-------------------------------
 | 
						|
 | 
						|
Below is a set of defined behavior that switchdev enabled network devices must
 | 
						|
adhere to.
 | 
						|
 | 
						|
Configuration-less state
 | 
						|
^^^^^^^^^^^^^^^^^^^^^^^^
 | 
						|
 | 
						|
Upon driver bring up, the network devices must be fully operational, and the
 | 
						|
backing driver must configure the network device such that it is possible to
 | 
						|
send and receive traffic to this network device and it is properly separated
 | 
						|
from other network devices/ports (e.g.: as is frequent with a switch ASIC). How
 | 
						|
this is achieved is heavily hardware dependent, but a simple solution can be to
 | 
						|
use per-port VLAN identifiers unless a better mechanism is available
 | 
						|
(proprietary metadata for each network port for instance).
 | 
						|
 | 
						|
The network device must be capable of running a full IP protocol stack
 | 
						|
including multicast, DHCP, IPv4/6, etc. If necessary, it should program the
 | 
						|
appropriate filters for VLAN, multicast, unicast etc. The underlying device
 | 
						|
driver must effectively be configured in a similar fashion to what it would do
 | 
						|
when IGMP snooping is enabled for IP multicast over these switchdev network
 | 
						|
devices and unsolicited multicast must be filtered as early as possible in
 | 
						|
the hardware.
 | 
						|
 | 
						|
When configuring VLANs on top of the network device, all VLANs must be working,
 | 
						|
irrespective of the state of other network devices (e.g.: other ports being part
 | 
						|
of a VLAN-aware bridge doing ingress VID checking). See below for details.
 | 
						|
 | 
						|
If the device implements e.g.: VLAN filtering, putting the interface in
 | 
						|
promiscuous mode should allow the reception of all VLAN tags (including those
 | 
						|
not present in the filter(s)).
 | 
						|
 | 
						|
Bridged switch ports
 | 
						|
^^^^^^^^^^^^^^^^^^^^
 | 
						|
 | 
						|
When a switchdev enabled network device is added as a bridge member, it should
 | 
						|
not disrupt any functionality of non-bridged network devices and they
 | 
						|
should continue to behave as normal network devices. Depending on the bridge
 | 
						|
configuration knobs below, the expected behavior is documented.
 | 
						|
 | 
						|
Bridge VLAN filtering
 | 
						|
^^^^^^^^^^^^^^^^^^^^^
 | 
						|
 | 
						|
The Linux bridge allows the configuration of a VLAN filtering mode (statically,
 | 
						|
at device creation time, and dynamically, during run time) which must be
 | 
						|
observed by the underlying switchdev network device/hardware:
 | 
						|
 | 
						|
- with VLAN filtering turned off: the bridge is strictly VLAN unaware and its
 | 
						|
  data path will process all Ethernet frames as if they are VLAN-untagged.
 | 
						|
  The bridge VLAN database can still be modified, but the modifications should
 | 
						|
  have no effect while VLAN filtering is turned off. Frames ingressing the
 | 
						|
  device with a VID that is not programmed into the bridge/switch's VLAN table
 | 
						|
  must be forwarded and may be processed using a VLAN device (see below).
 | 
						|
 | 
						|
- with VLAN filtering turned on: the bridge is VLAN-aware and frames ingressing
 | 
						|
  the device with a VID that is not programmed into the bridges/switch's VLAN
 | 
						|
  table must be dropped (strict VID checking).
 | 
						|
 | 
						|
When there is a VLAN device (e.g: sw0p1.100) configured on top of a switchdev
 | 
						|
network device which is a bridge port member, the behavior of the software
 | 
						|
network stack must be preserved, or the configuration must be refused if that
 | 
						|
is not possible.
 | 
						|
 | 
						|
- with VLAN filtering turned off, the bridge will process all ingress traffic
 | 
						|
  for the port, except for the traffic tagged with a VLAN ID destined for a
 | 
						|
  VLAN upper. The VLAN upper interface (which consumes the VLAN tag) can even
 | 
						|
  be added to a second bridge, which includes other switch ports or software
 | 
						|
  interfaces. Some approaches to ensure that the forwarding domain for traffic
 | 
						|
  belonging to the VLAN upper interfaces are managed properly:
 | 
						|
 | 
						|
    * If forwarding destinations can be managed per VLAN, the hardware could be
 | 
						|
      configured to map all traffic, except the packets tagged with a VID
 | 
						|
      belonging to a VLAN upper interface, to an internal VID corresponding to
 | 
						|
      untagged packets. This internal VID spans all ports of the VLAN-unaware
 | 
						|
      bridge. The VID corresponding to the VLAN upper interface spans the
 | 
						|
      physical port of that VLAN interface, as well as the other ports that
 | 
						|
      might be bridged with it.
 | 
						|
    * Treat bridge ports with VLAN upper interfaces as standalone, and let
 | 
						|
      forwarding be handled in the software data path.
 | 
						|
 | 
						|
- with VLAN filtering turned on, these VLAN devices can be created as long as
 | 
						|
  the bridge does not have an existing VLAN entry with the same VID on any
 | 
						|
  bridge port. These VLAN devices cannot be enslaved into the bridge since they
 | 
						|
  duplicate functionality/use case with the bridge's VLAN data path processing.
 | 
						|
 | 
						|
Non-bridged network ports of the same switch fabric must not be disturbed in any
 | 
						|
way by the enabling of VLAN filtering on the bridge device(s). If the VLAN
 | 
						|
filtering setting is global to the entire chip, then the standalone ports
 | 
						|
should indicate to the network stack that VLAN filtering is required by setting
 | 
						|
'rx-vlan-filter: on [fixed]' in the ethtool features.
 | 
						|
 | 
						|
Because VLAN filtering can be turned on/off at runtime, the switchdev driver
 | 
						|
must be able to reconfigure the underlying hardware on the fly to honor the
 | 
						|
toggling of that option and behave appropriately. If that is not possible, the
 | 
						|
switchdev driver can also refuse to support dynamic toggling of the VLAN
 | 
						|
filtering knob at runtime and require a destruction of the bridge device(s) and
 | 
						|
creation of new bridge device(s) with a different VLAN filtering value to
 | 
						|
ensure VLAN awareness is pushed down to the hardware.
 | 
						|
 | 
						|
Even when VLAN filtering in the bridge is turned off, the underlying switch
 | 
						|
hardware and driver may still configure itself in a VLAN-aware mode provided
 | 
						|
that the behavior described above is observed.
 | 
						|
 | 
						|
The VLAN protocol of the bridge plays a role in deciding whether a packet is
 | 
						|
treated as tagged or not: a bridge using the 802.1ad protocol must treat both
 | 
						|
VLAN-untagged packets, as well as packets tagged with 802.1Q headers, as
 | 
						|
untagged.
 | 
						|
 | 
						|
The 802.1p (VID 0) tagged packets must be treated in the same way by the device
 | 
						|
as untagged packets, since the bridge device does not allow the manipulation of
 | 
						|
VID 0 in its database.
 | 
						|
 | 
						|
When the bridge has VLAN filtering enabled and a PVID is not configured on the
 | 
						|
ingress port, untagged and 802.1p tagged packets must be dropped. When the bridge
 | 
						|
has VLAN filtering enabled and a PVID exists on the ingress port, untagged and
 | 
						|
priority-tagged packets must be accepted and forwarded according to the
 | 
						|
bridge's port membership of the PVID VLAN. When the bridge has VLAN filtering
 | 
						|
disabled, the presence/lack of a PVID should not influence the packet
 | 
						|
forwarding decision.
 | 
						|
 | 
						|
Bridge IGMP snooping
 | 
						|
^^^^^^^^^^^^^^^^^^^^
 | 
						|
 | 
						|
The Linux bridge allows the configuration of IGMP snooping (statically, at
 | 
						|
interface creation time, or dynamically, during runtime) which must be observed
 | 
						|
by the underlying switchdev network device/hardware in the following way:
 | 
						|
 | 
						|
- when IGMP snooping is turned off, multicast traffic must be flooded to all
 | 
						|
  ports within the same bridge that have mcast_flood=true. The CPU/management
 | 
						|
  port should ideally not be flooded (unless the ingress interface has
 | 
						|
  IFF_ALLMULTI or IFF_PROMISC) and continue to learn multicast traffic through
 | 
						|
  the network stack notifications. If the hardware is not capable of doing that
 | 
						|
  then the CPU/management port must also be flooded and multicast filtering
 | 
						|
  happens in software.
 | 
						|
 | 
						|
- when IGMP snooping is turned on, multicast traffic must selectively flow
 | 
						|
  to the appropriate network ports (including CPU/management port). Flooding of
 | 
						|
  unknown multicast should be only towards the ports connected to a multicast
 | 
						|
  router (the local device may also act as a multicast router).
 | 
						|
 | 
						|
The switch must adhere to RFC 4541 and flood multicast traffic accordingly
 | 
						|
since that is what the Linux bridge implementation does.
 | 
						|
 | 
						|
Because IGMP snooping can be turned on/off at runtime, the switchdev driver
 | 
						|
must be able to reconfigure the underlying hardware on the fly to honor the
 | 
						|
toggling of that option and behave appropriately.
 | 
						|
 | 
						|
A switchdev driver can also refuse to support dynamic toggling of the multicast
 | 
						|
snooping knob at runtime and require the destruction of the bridge device(s)
 | 
						|
and creation of a new bridge device(s) with a different multicast snooping
 | 
						|
value.
 |