APPLIED: [SRU][F][PATCH 0/2] Kernel panic with "refcount_t: underflow" in mlx5 driver (LP: 2019011)

Roxana Nicolescu roxana.nicolescu at canonical.com
Fri Jul 7 13:52:44 UTC 2023


On 28/06/2023 12:04, frank.heimes at canonical.com wrote:
> BugLink: https://bugs.launchpad.net/bugs/2019011
>
> SRU Justification:
>
> [ Impact ]
>
>   * The mlx5 driver is causing a Kernel panic with
>     "refcount_t: underflow".
>
>   * This issue occurs during a recovery when the PCI device
>     is isolated and thus doesn't respond.
>
> [ Fix ]
>
>   * This issue got solved upstream with
>     aaf2e65cac7f aaf2e65cac7f2e1ae729c2fbc849091df9699f96
>     "net/mlx5: Fix handling of entry refcount when command
>     is not issued to FW" (upstream since 6.1-rc1)
>
>   * But to get aaf2e65cac7f a backport of b898ce7bccf1
>     b898ce7bccf13087719c021d829dab607c175246
>     "net/mlx5: cmdif, Avoid skipping reclaim pages if FW is
>     not accessible" is required on top (upstream since 5.10)
>
> [ Test Plan ]
>
>   * An Ubuntu Server for s390x 20.04 LPAR or z/VM installation
>     is needed that has Mellanox cards (RoCE Express 2.1)
>     assigned, configured and enabled and that runs a 5.4
>     kernel with mlx5 driver.
>
>   * Create some network traffic on (one of the) RoCE device
>     (interface ens???[d?]) for testing (e.g. with stress-ng).
>
>   * Make sure the module/driver mlx5 is loaded and in use.
>
>   * Trigger a recovery (via the Support Element)
>     that will render the adapter (ports) unresponsive
>     for a moment and should provoke a similar situation.
>
>   * Alternatively the interface itself can be removed for
>     a moment and re-added again (but this may break further
>     things on top).
>
>   * Due to the lack of RoCE Express 2.1 hardware,
>     the verification is on IBM.
>
> [ Where problems could occur ]
>
>   * The modifications are limited to the Mellanox mlx5 driver
>     only - no other network driver is affected.
>
>   * The pre-required commit (aaf2e65cac7f) can have a bad
>     impact on (re-)claiming pages if FW is not accessible,
>     which could cause page leaks in case done wrong.
>     But this commit is pretty save since it's upstream
>     since v5.10.
>
>   * The fix itself (aaf2e65cac7f) mainly changes the
>     cmd_work_handler and mlx5_cmd_comp_handler functions
>     in a way that instead of pci_channel_offline
>     mlx5_cmd_is_down (introiduced by b898ce7bccf1).
>
>   * Actually b898ce7bccf1 started with changing from
>     pci_channel_offline to mlx5_cmd_is_down,
>     but looks like a few cases
>     (in the area of refcount increate/decrease) were missed,
>     that are now covered by aaf2e65cac7f.
>
>   * It fixes now on top refcounts are now always properly
>     increment and decrement to achieve a symmetric state
>     for all flows.
>
>   * These changes may have an impact on all cases where the
>     mlx5 device is not responding, which can happen in case
>     of an offline channel, interface down, reset or recovery.
>
> [ Other Info ]
>
>   * A lookup at the master-next git trees for jammy, kinetic
>     and lunar showed that both fixes are already included,
>     hence only focal is affected.
>
> Moshe Shemesh (1):
>    net/mlx5: Fix handling of entry refcount when command is not issued to
>      FW
>
> Saeed Mahameed (1):
>    net/mlx5: cmdif, Avoid skipping reclaim pages if FW is not accessible
>
>   drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 23 ++++++++++---------
>   .../ethernet/mellanox/mlx5/core/pagealloc.c   |  2 +-
>   include/linux/mlx5/driver.h                   |  1 +
>   3 files changed, 14 insertions(+), 12 deletions(-)
>
Applied to focal:master-next. Thanks!

Roxana



More information about the kernel-team mailing list