ACK: [SRU][J:linux]PATCH v3 1/1] net: bridge: switchdev: Skip MDB replays of deferred events on offload
Koichiro Den
koichiro.den at canonical.com
Thu Feb 20 01:58:30 UTC 2025
On Tue, Feb 18, 2025 at 12:23:00PM GMT, Andrei Gherzan wrote:
> From: Tobias Waldekranz <tobias at waldekranz.com>
>
> Before this change, generation of the list of MDB events to replay
> would race against the creation of new group memberships, either from
> the IGMP/MLD snooping logic or from user configuration.
>
> While new memberships are immediately visible to walkers of
> br->mdb_list, the notification of their existence to switchdev event
> subscribers is deferred until a later point in time. So if a replay
> list was generated during a time that overlapped with such a window,
> it would also contain a replay of the not-yet-delivered event.
>
> The driver would thus receive two copies of what the bridge internally
> considered to be one single event. On destruction of the bridge, only
> a single membership deletion event was therefore sent. As a
> consequence of this, drivers which reference count memberships (at
> least DSA), would be left with orphan groups in their hardware
> database when the bridge was destroyed.
>
> This is only an issue when replaying additions. While deletion events
> may still be pending on the deferred queue, they will already have
> been removed from br->mdb_list, so no duplicates can be generated in
> that scenario.
>
> To a user this meant that old group memberships, from a bridge in
> which a port was previously attached, could be reanimated (in
> hardware) when the port joined a new bridge, without the new bridge's
> knowledge.
>
> For example, on an mv88e6xxx system, create a snooping bridge and
> immediately add a port to it:
>
> root at infix-06-0b-00:~$ ip link add dev br0 up type bridge mcast_snooping 1 && \
> > ip link set dev x3 up master br0
>
> And then destroy the bridge:
>
> root at infix-06-0b-00:~$ ip link del dev br0
> root at infix-06-0b-00:~$ mvls atu
> ADDRESS FID STATE Q F 0 1 2 3 4 5 6 7 8 9 a
> DEV:0 Marvell 88E6393X
> 33:33:00:00:00:6a 1 static - - 0 . . . . . . . . . .
> 33:33:ff:87:e4:3f 1 static - - 0 . . . . . . . . . .
> ff:ff:ff:ff:ff:ff 1 static - - 0 1 2 3 4 5 6 7 8 9 a
> root at infix-06-0b-00:~$
>
> The two IPv6 groups remain in the hardware database because the
> port (x3) is notified of the host's membership twice: once via the
> original event and once via a replay. Since only a single delete
> notification is sent, the count remains at 1 when the bridge is
> destroyed.
>
> Then add the same port (or another port belonging to the same hardware
> domain) to a new bridge, this time with snooping disabled:
>
> root at infix-06-0b-00:~$ ip link add dev br1 up type bridge mcast_snooping 0 && \
> > ip link set dev x3 up master br1
>
> All multicast, including the two IPv6 groups from br0, should now be
> flooded, according to the policy of br1. But instead the old
> memberships are still active in the hardware database, causing the
> switch to only forward traffic to those groups towards the CPU (port
> 0).
>
> Eliminate the race in two steps:
>
> 1. Grab the write-side lock of the MDB while generating the replay
> list.
>
> This prevents new memberships from showing up while we are generating
> the replay list. But it leaves the scenario in which a deferred event
> was already generated, but not delivered, before we grabbed the
> lock. Therefore:
>
> 2. Make sure that no deferred version of a replay event is already
> enqueued to the switchdev deferred queue, before adding it to the
> replay list, when replaying additions.
>
> Fixes: 4f2673b3a2b6 ("net: bridge: add helper to replay port and host-joined mdb entries")
> Signed-off-by: Tobias Waldekranz <tobias at waldekranz.com>
> Reviewed-by: Vladimir Oltean <olteanv at gmail.com>
> Signed-off-by: David S. Miller <davem at davemloft.net>
> (backported from commit dc489f86257cab5056e747344f17a164f63bff4b)
> [agherzan: `br_mdb_queue_one` and `br_mdb_queue_one` were moved out of
> `br_mdb.c` along with other functions and as part of a larger
> consolidation into `br_switchdev.c`, to avoid a considerable
> restructuring backport, the changes were applied on the original file,
> `br_mdb.c`. This patch also has a dependency on the support for
> differentiating new VLANs from changed ones, but given that the
> dependency is only for the purposes of comparing two
> `switchdev_obj_port_vlan` objects, this patch was adapted to ignore the
> `changed` struct member.]
> CVE-2024-26837
> Signed-off-by: Andrei Gherzan <andrei.gherzan at canonical.com>
> ---
> include/net/switchdev.h | 3 ++
> net/bridge/br_mdb.c | 78 ++++++++++++++++++++++++---------------
> net/switchdev/switchdev.c | 72 ++++++++++++++++++++++++++++++++++++
> 3 files changed, 123 insertions(+), 30 deletions(-)
>
> diff --git a/include/net/switchdev.h b/include/net/switchdev.h
> index 60d806b6a5ae1..161bdd1d96450 100644
> --- a/include/net/switchdev.h
> +++ b/include/net/switchdev.h
> @@ -277,6 +277,9 @@ void switchdev_deferred_process(void);
> int switchdev_port_attr_set(struct net_device *dev,
> const struct switchdev_attr *attr,
> struct netlink_ext_ack *extack);
> +bool switchdev_port_obj_act_is_deferred(struct net_device *dev,
> + enum switchdev_notifier_type nt,
> + const struct switchdev_obj *obj);
> int switchdev_port_obj_add(struct net_device *dev,
> const struct switchdev_obj *obj,
> struct netlink_ext_ack *extack);
> diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
> index 0281453f77663..f61302e9e88c1 100644
> --- a/net/bridge/br_mdb.c
> +++ b/net/bridge/br_mdb.c
> @@ -620,21 +620,40 @@ static int br_mdb_replay_one(struct notifier_block *nb, struct net_device *dev,
> }
>
> static int br_mdb_queue_one(struct list_head *mdb_list,
> + struct net_device *dev,
> + unsigned long action,
> enum switchdev_obj_id id,
> const struct net_bridge_mdb_entry *mp,
> struct net_device *orig_dev)
> {
> - struct switchdev_obj_port_mdb *mdb;
> + struct switchdev_obj_port_mdb mdb = {
> + .obj = {
> + .id = id,
> + .orig_dev = orig_dev,
> + },
> + };
> + struct switchdev_obj_port_mdb *pmdb;
>
> - mdb = kzalloc(sizeof(*mdb), GFP_ATOMIC);
> - if (!mdb)
> - return -ENOMEM;
> + br_switchdev_mdb_populate(&mdb, mp);
> +
> + if (action == SWITCHDEV_PORT_OBJ_ADD &&
> + switchdev_port_obj_act_is_deferred(dev, action, &mdb.obj)) {
> + /* This event is already in the deferred queue of
> + * events, so this replay must be elided, lest the
> + * driver receives duplicate events for it. This can
> + * only happen when replaying additions, since
> + * modifications are always immediately visible in
> + * br->mdb_list, whereas actual event delivery may be
> + * delayed.
> + */
> + return 0;
> + }
>
> - mdb->obj.id = id;
> - mdb->obj.orig_dev = orig_dev;
> - br_switchdev_mdb_populate(mdb, mp);
> - list_add_tail(&mdb->obj.list, mdb_list);
> + pmdb = kmemdup(&mdb, sizeof(mdb), GFP_ATOMIC);
> + if (!pmdb)
> + return -ENOMEM;
>
> + list_add_tail(&pmdb->obj.list, mdb_list);
> return 0;
> }
>
> @@ -662,51 +681,50 @@ int br_mdb_replay(struct net_device *br_dev, struct net_device *dev,
> if (!br_opt_get(br, BROPT_MULTICAST_ENABLED))
> return 0;
>
> - /* We cannot walk over br->mdb_list protected just by the rtnl_mutex,
> - * because the write-side protection is br->multicast_lock. But we
> - * need to emulate the [ blocking ] calling context of a regular
> - * switchdev event, so since both br->multicast_lock and RCU read side
> - * critical sections are atomic, we have no choice but to pick the RCU
> - * read side lock, queue up all our events, leave the critical section
> - * and notify switchdev from blocking context.
> + if (adding)
> + action = SWITCHDEV_PORT_OBJ_ADD;
> + else
> + action = SWITCHDEV_PORT_OBJ_DEL;
> +
> + /* br_switchdev_mdb_queue_one() will take care to not queue a
> + * replay of an event that is already pending in the switchdev
> + * deferred queue. In order to safely determine that, there
> + * must be no new deferred MDB notifications enqueued for the
> + * duration of the MDB scan. Therefore, grab the write-side
> + * lock to avoid racing with any concurrent IGMP/MLD snooping.
> */
> - rcu_read_lock();
> + spin_lock_bh(&br->multicast_lock);
>
> - hlist_for_each_entry_rcu(mp, &br->mdb_list, mdb_node) {
> + hlist_for_each_entry(mp, &br->mdb_list, mdb_node) {
> struct net_bridge_port_group __rcu * const *pp;
> const struct net_bridge_port_group *p;
>
> if (mp->host_joined) {
> - err = br_mdb_queue_one(&mdb_list,
> + err = br_mdb_queue_one(&mdb_list, dev, action,
> SWITCHDEV_OBJ_ID_HOST_MDB,
> mp, br_dev);
> if (err) {
> - rcu_read_unlock();
> + spin_unlock_bh(&br->multicast_lock);
> goto out_free_mdb;
> }
> }
>
> - for (pp = &mp->ports; (p = rcu_dereference(*pp)) != NULL;
> + for (pp = &mp->ports; (p = mlock_dereference(*pp, br)) != NULL;
> pp = &p->next) {
> if (p->key.port->dev != dev)
> continue;
>
> - err = br_mdb_queue_one(&mdb_list,
> - SWITCHDEV_OBJ_ID_PORT_MDB,
> - mp, dev);
> + err = br_mdb_queue_one(&mdb_list, dev, action,
> + SWITCHDEV_OBJ_ID_PORT_MDB,
> + mp, dev);
> if (err) {
> - rcu_read_unlock();
> + spin_unlock_bh(&br->multicast_lock);
> goto out_free_mdb;
> }
> }
> }
>
> - rcu_read_unlock();
> -
> - if (adding)
> - action = SWITCHDEV_PORT_OBJ_ADD;
> - else
> - action = SWITCHDEV_PORT_OBJ_DEL;
> + spin_unlock_bh(&br->multicast_lock);
>
> list_for_each_entry(obj, &mdb_list, list) {
> err = br_mdb_replay_one(nb, dev, SWITCHDEV_OBJ_PORT_MDB(obj),
> diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
> index 0b2c18efc0791..d8a6742158ddd 100644
> --- a/net/switchdev/switchdev.c
> +++ b/net/switchdev/switchdev.c
> @@ -19,6 +19,34 @@
> #include <linux/rtnetlink.h>
> #include <net/switchdev.h>
>
> +static bool switchdev_obj_eq(const struct switchdev_obj *a,
> + const struct switchdev_obj *b)
> +{
> + const struct switchdev_obj_port_vlan *va, *vb;
> + const struct switchdev_obj_port_mdb *ma, *mb;
> +
> + if (a->id != b->id || a->orig_dev != b->orig_dev)
> + return false;
> +
> + switch (a->id) {
> + case SWITCHDEV_OBJ_ID_PORT_VLAN:
> + va = SWITCHDEV_OBJ_PORT_VLAN(a);
> + vb = SWITCHDEV_OBJ_PORT_VLAN(b);
> + return va->flags == vb->flags &&
> + va->vid == vb->vid;
> + case SWITCHDEV_OBJ_ID_PORT_MDB:
> + case SWITCHDEV_OBJ_ID_HOST_MDB:
> + ma = SWITCHDEV_OBJ_PORT_MDB(a);
> + mb = SWITCHDEV_OBJ_PORT_MDB(b);
> + return ma->vid == mb->vid &&
> + ether_addr_equal(ma->addr, mb->addr);
> + default:
> + break;
> + }
> +
> + BUG();
> +}
> +
> static LIST_HEAD(deferred);
> static DEFINE_SPINLOCK(deferred_lock);
>
> @@ -306,6 +334,50 @@ int switchdev_port_obj_del(struct net_device *dev,
> }
> EXPORT_SYMBOL_GPL(switchdev_port_obj_del);
>
> +/**
> + * switchdev_port_obj_act_is_deferred - Is object action pending?
> + *
> + * @dev: port device
> + * @nt: type of action; add or delete
> + * @obj: object to test
> + *
> + * Returns true if a deferred item is pending, which is
> + * equivalent to the action @nt on an object @obj.
> + *
> + * rtnl_lock must be held.
> + */
> +bool switchdev_port_obj_act_is_deferred(struct net_device *dev,
> + enum switchdev_notifier_type nt,
> + const struct switchdev_obj *obj)
> +{
> + struct switchdev_deferred_item *dfitem;
> + bool found = false;
> +
> + ASSERT_RTNL();
> +
> + spin_lock_bh(&deferred_lock);
> +
> + list_for_each_entry(dfitem, &deferred, list) {
> + if (dfitem->dev != dev)
> + continue;
> +
> + if ((dfitem->func == switchdev_port_obj_add_deferred &&
> + nt == SWITCHDEV_PORT_OBJ_ADD) ||
> + (dfitem->func == switchdev_port_obj_del_deferred &&
> + nt == SWITCHDEV_PORT_OBJ_DEL)) {
> + if (switchdev_obj_eq((const void *)dfitem->data, obj)) {
> + found = true;
> + break;
> + }
> + }
> + }
> +
> + spin_unlock_bh(&deferred_lock);
> +
> + return found;
> +}
> +EXPORT_SYMBOL_GPL(switchdev_port_obj_act_is_deferred);
> +
> static ATOMIC_NOTIFIER_HEAD(switchdev_notif_chain);
> static BLOCKING_NOTIFIER_HEAD(switchdev_blocking_notif_chain);
>
Acked-by: Koichiro Den <koichiro.den at canonical.com>
More information about the kernel-team
mailing list