ACK: [SRU][J:linux]PATCH v3 1/1] net: bridge: switchdev: Skip MDB replays of deferred events on offload

Thu Feb 20 01:58:30 UTC 2025

On Tue, Feb 18, 2025 at 12:23:00PM GMT, Andrei Gherzan wrote:
> From: Tobias Waldekranz <tobias at waldekranz.com>
> 
> Before this change, generation of the list of MDB events to replay
> would race against the creation of new group memberships, either from
> the IGMP/MLD snooping logic or from user configuration.
> 
> While new memberships are immediately visible to walkers of
> br->mdb_list, the notification of their existence to switchdev event
> subscribers is deferred until a later point in time. So if a replay
> list was generated during a time that overlapped with such a window,
> it would also contain a replay of the not-yet-delivered event.
> 
> The driver would thus receive two copies of what the bridge internally
> considered to be one single event. On destruction of the bridge, only
> a single membership deletion event was therefore sent. As a
> consequence of this, drivers which reference count memberships (at
> least DSA), would be left with orphan groups in their hardware
> database when the bridge was destroyed.
> 
> This is only an issue when replaying additions. While deletion events
> may still be pending on the deferred queue, they will already have
> been removed from br->mdb_list, so no duplicates can be generated in
> that scenario.
> 
> To a user this meant that old group memberships, from a bridge in
> which a port was previously attached, could be reanimated (in
> hardware) when the port joined a new bridge, without the new bridge's
> knowledge.
> 
> For example, on an mv88e6xxx system, create a snooping bridge and
> immediately add a port to it:
> 
>     root at infix-06-0b-00:~$ ip link add dev br0 up type bridge mcast_snooping 1 && \
>     > ip link set dev x3 up master br0
> 
> And then destroy the bridge:
> 
>     root at infix-06-0b-00:~$ ip link del dev br0
>     root at infix-06-0b-00:~$ mvls atu
>     ADDRESS             FID  STATE      Q  F  0  1  2  3  4  5  6  7  8  9  a
>     DEV:0 Marvell 88E6393X
>     33:33:00:00:00:6a     1  static     -  -  0  .  .  .  .  .  .  .  .  .  .
>     33:33:ff:87:e4:3f     1  static     -  -  0  .  .  .  .  .  .  .  .  .  .
>     ff:ff:ff:ff:ff:ff     1  static     -  -  0  1  2  3  4  5  6  7  8  9  a
>     root at infix-06-0b-00:~$
> 
> The two IPv6 groups remain in the hardware database because the
> port (x3) is notified of the host's membership twice: once via the
> original event and once via a replay. Since only a single delete
> notification is sent, the count remains at 1 when the bridge is
> destroyed.
> 
> Then add the same port (or another port belonging to the same hardware
> domain) to a new bridge, this time with snooping disabled:
> 
>     root at infix-06-0b-00:~$ ip link add dev br1 up type bridge mcast_snooping 0 && \
>     > ip link set dev x3 up master br1
> 
> All multicast, including the two IPv6 groups from br0, should now be
> flooded, according to the policy of br1. But instead the old
> memberships are still active in the hardware database, causing the
> switch to only forward traffic to those groups towards the CPU (port
> 0).
> 
> Eliminate the race in two steps:
> 
> 1. Grab the write-side lock of the MDB while generating the replay
>    list.
> 
> This prevents new memberships from showing up while we are generating
> the replay list. But it leaves the scenario in which a deferred event
> was already generated, but not delivered, before we grabbed the
> lock. Therefore:
> 
> 2. Make sure that no deferred version of a replay event is already
>    enqueued to the switchdev deferred queue, before adding it to the
>    replay list, when replaying additions.
> 
> Fixes: 4f2673b3a2b6 ("net: bridge: add helper to replay port and host-joined mdb entries")
> Signed-off-by: Tobias Waldekranz <tobias at waldekranz.com>
> Reviewed-by: Vladimir Oltean <olteanv at gmail.com>
> Signed-off-by: David S. Miller <davem at davemloft.net>
> (backported from commit dc489f86257cab5056e747344f17a164f63bff4b)
> [agherzan: `br_mdb_queue_one` and `br_mdb_queue_one` were moved out of
> `br_mdb.c` along with other functions and as part of a larger
> consolidation into `br_switchdev.c`, to avoid a considerable
> restructuring backport, the changes were applied on the original file,
> `br_mdb.c`.  This patch also has a dependency on the support for
> differentiating new VLANs from changed ones, but given that the
> dependency is only for the purposes of comparing two
> `switchdev_obj_port_vlan` objects, this patch was adapted to ignore the
> `changed` struct member.]
> CVE-2024-26837
> Signed-off-by: Andrei Gherzan <andrei.gherzan at canonical.com>
> ---
>  include/net/switchdev.h   |  3 ++
>  net/bridge/br_mdb.c       | 78 ++++++++++++++++++++++++---------------
>  net/switchdev/switchdev.c | 72 ++++++++++++++++++++++++++++++++++++
>  3 files changed, 123 insertions(+), 30 deletions(-)
> 
> diff --git a/include/net/switchdev.h b/include/net/switchdev.h
> index 60d806b6a5ae1..161bdd1d96450 100644
> --- a/include/net/switchdev.h
> +++ b/include/net/switchdev.h
> @@ -277,6 +277,9 @@ void switchdev_deferred_process(void);
>  int switchdev_port_attr_set(struct net_device *dev,
>  			    const struct switchdev_attr *attr,
>  			    struct netlink_ext_ack *extack);
> +bool switchdev_port_obj_act_is_deferred(struct net_device *dev,
> +					enum switchdev_notifier_type nt,
> +					const struct switchdev_obj *obj);
>  int switchdev_port_obj_add(struct net_device *dev,
>  			   const struct switchdev_obj *obj,
>  			   struct netlink_ext_ack *extack);
> diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
> index 0281453f77663..f61302e9e88c1 100644
> --- a/net/bridge/br_mdb.c
> +++ b/net/bridge/br_mdb.c
> @@ -620,21 +620,40 @@ static int br_mdb_replay_one(struct notifier_block *nb, struct net_device *dev,
>  }
>  
>  static int br_mdb_queue_one(struct list_head *mdb_list,
> +			    struct net_device *dev,
> +			    unsigned long action,
>  			    enum switchdev_obj_id id,
>  			    const struct net_bridge_mdb_entry *mp,
>  			    struct net_device *orig_dev)
>  {
> -	struct switchdev_obj_port_mdb *mdb;
> +	struct switchdev_obj_port_mdb mdb = {
> +		.obj = {
> +			.id = id,
> +			.orig_dev = orig_dev,
> +		},
> +	};
> +	struct switchdev_obj_port_mdb *pmdb;
>  
> -	mdb = kzalloc(sizeof(*mdb), GFP_ATOMIC);
> -	if (!mdb)
> -		return -ENOMEM;
> +	br_switchdev_mdb_populate(&mdb, mp);
> +
> +	if (action == SWITCHDEV_PORT_OBJ_ADD &&
> +	    switchdev_port_obj_act_is_deferred(dev, action, &mdb.obj)) {
> +		/* This event is already in the deferred queue of
> +		 * events, so this replay must be elided, lest the
> +		 * driver receives duplicate events for it. This can
> +		 * only happen when replaying additions, since
> +		 * modifications are always immediately visible in
> +		 * br->mdb_list, whereas actual event delivery may be
> +		 * delayed.
> +		 */
> +		return 0;
> +	}
>  
> -	mdb->obj.id = id;
> -	mdb->obj.orig_dev = orig_dev;
> -	br_switchdev_mdb_populate(mdb, mp);
> -	list_add_tail(&mdb->obj.list, mdb_list);
> +	pmdb = kmemdup(&mdb, sizeof(mdb), GFP_ATOMIC);
> +	if (!pmdb)
> +		return -ENOMEM;
>  
> +	list_add_tail(&pmdb->obj.list, mdb_list);
>  	return 0;
>  }
>  
> @@ -662,51 +681,50 @@ int br_mdb_replay(struct net_device *br_dev, struct net_device *dev,
>  	if (!br_opt_get(br, BROPT_MULTICAST_ENABLED))
>  		return 0;
>  
> -	/* We cannot walk over br->mdb_list protected just by the rtnl_mutex,
> -	 * because the write-side protection is br->multicast_lock. But we
> -	 * need to emulate the [ blocking ] calling context of a regular
> -	 * switchdev event, so since both br->multicast_lock and RCU read side
> -	 * critical sections are atomic, we have no choice but to pick the RCU
> -	 * read side lock, queue up all our events, leave the critical section
> -	 * and notify switchdev from blocking context.
> +	if (adding)
> +		action = SWITCHDEV_PORT_OBJ_ADD;
> +	else
> +		action = SWITCHDEV_PORT_OBJ_DEL;
> +
> +	/* br_switchdev_mdb_queue_one() will take care to not queue a
> +	 * replay of an event that is already pending in the switchdev
> +	 * deferred queue. In order to safely determine that, there
> +	 * must be no new deferred MDB notifications enqueued for the
> +	 * duration of the MDB scan. Therefore, grab the write-side
> +	 * lock to avoid racing with any concurrent IGMP/MLD snooping.
>  	 */
> -	rcu_read_lock();
> +	spin_lock_bh(&br->multicast_lock);
>  
> -	hlist_for_each_entry_rcu(mp, &br->mdb_list, mdb_node) {
> +	hlist_for_each_entry(mp, &br->mdb_list, mdb_node) {
>  		struct net_bridge_port_group __rcu * const *pp;
>  		const struct net_bridge_port_group *p;
>  
>  		if (mp->host_joined) {
> -			err = br_mdb_queue_one(&mdb_list,
> +			err = br_mdb_queue_one(&mdb_list, dev, action,
>  					       SWITCHDEV_OBJ_ID_HOST_MDB,
>  					       mp, br_dev);
>  			if (err) {
> -				rcu_read_unlock();
> +				spin_unlock_bh(&br->multicast_lock);
>  				goto out_free_mdb;
>  			}
>  		}
>  
> -		for (pp = &mp->ports; (p = rcu_dereference(*pp)) != NULL;
> +		for (pp = &mp->ports; (p = mlock_dereference(*pp, br)) != NULL;
>  		     pp = &p->next) {
>  			if (p->key.port->dev != dev)
>  				continue;
>  
> -			err = br_mdb_queue_one(&mdb_list,
> -					       SWITCHDEV_OBJ_ID_PORT_MDB,
> -					       mp, dev);
> +			err = br_mdb_queue_one(&mdb_list, dev, action,
> +							 SWITCHDEV_OBJ_ID_PORT_MDB,
> +							 mp, dev);
>  			if (err) {
> -				rcu_read_unlock();
> +				spin_unlock_bh(&br->multicast_lock);
>  				goto out_free_mdb;
>  			}
>  		}
>  	}
>  
> -	rcu_read_unlock();
> -
> -	if (adding)
> -		action = SWITCHDEV_PORT_OBJ_ADD;
> -	else
> -		action = SWITCHDEV_PORT_OBJ_DEL;
> +	spin_unlock_bh(&br->multicast_lock);
>  
>  	list_for_each_entry(obj, &mdb_list, list) {
>  		err = br_mdb_replay_one(nb, dev, SWITCHDEV_OBJ_PORT_MDB(obj),
> diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
> index 0b2c18efc0791..d8a6742158ddd 100644
> --- a/net/switchdev/switchdev.c
> +++ b/net/switchdev/switchdev.c
> @@ -19,6 +19,34 @@
>  #include <linux/rtnetlink.h>
>  #include <net/switchdev.h>
>  
> +static bool switchdev_obj_eq(const struct switchdev_obj *a,
> +			     const struct switchdev_obj *b)
> +{
> +	const struct switchdev_obj_port_vlan *va, *vb;
> +	const struct switchdev_obj_port_mdb *ma, *mb;
> +
> +	if (a->id != b->id || a->orig_dev != b->orig_dev)
> +		return false;
> +
> +	switch (a->id) {
> +	case SWITCHDEV_OBJ_ID_PORT_VLAN:
> +		va = SWITCHDEV_OBJ_PORT_VLAN(a);
> +		vb = SWITCHDEV_OBJ_PORT_VLAN(b);
> +		return va->flags == vb->flags &&
> +			va->vid == vb->vid;
> +	case SWITCHDEV_OBJ_ID_PORT_MDB:
> +	case SWITCHDEV_OBJ_ID_HOST_MDB:
> +		ma = SWITCHDEV_OBJ_PORT_MDB(a);
> +		mb = SWITCHDEV_OBJ_PORT_MDB(b);
> +		return ma->vid == mb->vid &&
> +			ether_addr_equal(ma->addr, mb->addr);
> +	default:
> +		break;
> +	}
> +
> +	BUG();
> +}
> +
>  static LIST_HEAD(deferred);
>  static DEFINE_SPINLOCK(deferred_lock);
>  
> @@ -306,6 +334,50 @@ int switchdev_port_obj_del(struct net_device *dev,
>  }
>  EXPORT_SYMBOL_GPL(switchdev_port_obj_del);
>  
> +/**
> + *	switchdev_port_obj_act_is_deferred - Is object action pending?
> + *
> + *	@dev: port device
> + *	@nt: type of action; add or delete
> + *	@obj: object to test
> + *
> + *	Returns true if a deferred item is pending, which is
> + *	equivalent to the action @nt on an object @obj.
> + *
> + *	rtnl_lock must be held.
> + */
> +bool switchdev_port_obj_act_is_deferred(struct net_device *dev,
> +					enum switchdev_notifier_type nt,
> +					const struct switchdev_obj *obj)
> +{
> +	struct switchdev_deferred_item *dfitem;
> +	bool found = false;
> +
> +	ASSERT_RTNL();
> +
> +	spin_lock_bh(&deferred_lock);
> +
> +	list_for_each_entry(dfitem, &deferred, list) {
> +		if (dfitem->dev != dev)
> +			continue;
> +
> +		if ((dfitem->func == switchdev_port_obj_add_deferred &&
> +		     nt == SWITCHDEV_PORT_OBJ_ADD) ||
> +		    (dfitem->func == switchdev_port_obj_del_deferred &&
> +		     nt == SWITCHDEV_PORT_OBJ_DEL)) {
> +			if (switchdev_obj_eq((const void *)dfitem->data, obj)) {
> +				found = true;
> +				break;
> +			}
> +		}
> +	}
> +
> +	spin_unlock_bh(&deferred_lock);
> +
> +	return found;
> +}
> +EXPORT_SYMBOL_GPL(switchdev_port_obj_act_is_deferred);
> +
>  static ATOMIC_NOTIFIER_HEAD(switchdev_notif_chain);
>  static BLOCKING_NOTIFIER_HEAD(switchdev_blocking_notif_chain);
>  

Acked-by: Koichiro Den <koichiro.den at canonical.com>