[SRU][F][PATCH 0/2] Kernel panic with "refcount_t: underflow" in mlx5 driver (LP: 2019011)

frank.heimes at canonical.com frank.heimes at canonical.com
Wed Jun 28 10:04:05 UTC 2023


BugLink: https://bugs.launchpad.net/bugs/2019011

SRU Justification:

[ Impact ]

 * The mlx5 driver is causing a Kernel panic with
   "refcount_t: underflow".

 * This issue occurs during a recovery when the PCI device
   is isolated and thus doesn't respond.

[ Fix ]

 * This issue got solved upstream with
   aaf2e65cac7f aaf2e65cac7f2e1ae729c2fbc849091df9699f96
   "net/mlx5: Fix handling of entry refcount when command
   is not issued to FW" (upstream since 6.1-rc1)

 * But to get aaf2e65cac7f a backport of b898ce7bccf1
   b898ce7bccf13087719c021d829dab607c175246
   "net/mlx5: cmdif, Avoid skipping reclaim pages if FW is
   not accessible" is required on top (upstream since 5.10)

[ Test Plan ]

 * An Ubuntu Server for s390x 20.04 LPAR or z/VM installation
   is needed that has Mellanox cards (RoCE Express 2.1)
   assigned, configured and enabled and that runs a 5.4
   kernel with mlx5 driver.

 * Create some network traffic on (one of the) RoCE device
   (interface ens???[d?]) for testing (e.g. with stress-ng).

 * Make sure the module/driver mlx5 is loaded and in use.

 * Trigger a recovery (via the Support Element)
   that will render the adapter (ports) unresponsive
   for a moment and should provoke a similar situation.

 * Alternatively the interface itself can be removed for
   a moment and re-added again (but this may break further
   things on top).

 * Due to the lack of RoCE Express 2.1 hardware,
   the verification is on IBM.

[ Where problems could occur ]

 * The modifications are limited to the Mellanox mlx5 driver
   only - no other network driver is affected.

 * The pre-required commit (aaf2e65cac7f) can have a bad
   impact on (re-)claiming pages if FW is not accessible,
   which could cause page leaks in case done wrong.
   But this commit is pretty save since it's upstream
   since v5.10.

 * The fix itself (aaf2e65cac7f) mainly changes the
   cmd_work_handler and mlx5_cmd_comp_handler functions
   in a way that instead of pci_channel_offline
   mlx5_cmd_is_down (introiduced by b898ce7bccf1).

 * Actually b898ce7bccf1 started with changing from
   pci_channel_offline to mlx5_cmd_is_down,
   but looks like a few cases
   (in the area of refcount increate/decrease) were missed,
   that are now covered by aaf2e65cac7f.

 * It fixes now on top refcounts are now always properly
   increment and decrement to achieve a symmetric state
   for all flows.

 * These changes may have an impact on all cases where the
   mlx5 device is not responding, which can happen in case
   of an offline channel, interface down, reset or recovery.

[ Other Info ]

 * A lookup at the master-next git trees for jammy, kinetic
   and lunar showed that both fixes are already included,
   hence only focal is affected.

Moshe Shemesh (1):
  net/mlx5: Fix handling of entry refcount when command is not issued to
    FW

Saeed Mahameed (1):
  net/mlx5: cmdif, Avoid skipping reclaim pages if FW is not accessible

 drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 23 ++++++++++---------
 .../ethernet/mellanox/mlx5/core/pagealloc.c   |  2 +-
 include/linux/mlx5/driver.h                   |  1 +
 3 files changed, 14 insertions(+), 12 deletions(-)

-- 
2.25.1




More information about the kernel-team mailing list