[SRU][F][PATCH 0/2] Kernel panic with "refcount_t: underflow" in mlx5 driver (LP: 2019011)
frank.heimes at canonical.com
frank.heimes at canonical.com
Wed Jun 28 10:04:05 UTC 2023
BugLink: https://bugs.launchpad.net/bugs/2019011
SRU Justification:
[ Impact ]
* The mlx5 driver is causing a Kernel panic with
"refcount_t: underflow".
* This issue occurs during a recovery when the PCI device
is isolated and thus doesn't respond.
[ Fix ]
* This issue got solved upstream with
aaf2e65cac7f aaf2e65cac7f2e1ae729c2fbc849091df9699f96
"net/mlx5: Fix handling of entry refcount when command
is not issued to FW" (upstream since 6.1-rc1)
* But to get aaf2e65cac7f a backport of b898ce7bccf1
b898ce7bccf13087719c021d829dab607c175246
"net/mlx5: cmdif, Avoid skipping reclaim pages if FW is
not accessible" is required on top (upstream since 5.10)
[ Test Plan ]
* An Ubuntu Server for s390x 20.04 LPAR or z/VM installation
is needed that has Mellanox cards (RoCE Express 2.1)
assigned, configured and enabled and that runs a 5.4
kernel with mlx5 driver.
* Create some network traffic on (one of the) RoCE device
(interface ens???[d?]) for testing (e.g. with stress-ng).
* Make sure the module/driver mlx5 is loaded and in use.
* Trigger a recovery (via the Support Element)
that will render the adapter (ports) unresponsive
for a moment and should provoke a similar situation.
* Alternatively the interface itself can be removed for
a moment and re-added again (but this may break further
things on top).
* Due to the lack of RoCE Express 2.1 hardware,
the verification is on IBM.
[ Where problems could occur ]
* The modifications are limited to the Mellanox mlx5 driver
only - no other network driver is affected.
* The pre-required commit (aaf2e65cac7f) can have a bad
impact on (re-)claiming pages if FW is not accessible,
which could cause page leaks in case done wrong.
But this commit is pretty save since it's upstream
since v5.10.
* The fix itself (aaf2e65cac7f) mainly changes the
cmd_work_handler and mlx5_cmd_comp_handler functions
in a way that instead of pci_channel_offline
mlx5_cmd_is_down (introiduced by b898ce7bccf1).
* Actually b898ce7bccf1 started with changing from
pci_channel_offline to mlx5_cmd_is_down,
but looks like a few cases
(in the area of refcount increate/decrease) were missed,
that are now covered by aaf2e65cac7f.
* It fixes now on top refcounts are now always properly
increment and decrement to achieve a symmetric state
for all flows.
* These changes may have an impact on all cases where the
mlx5 device is not responding, which can happen in case
of an offline channel, interface down, reset or recovery.
[ Other Info ]
* A lookup at the master-next git trees for jammy, kinetic
and lunar showed that both fixes are already included,
hence only focal is affected.
Moshe Shemesh (1):
net/mlx5: Fix handling of entry refcount when command is not issued to
FW
Saeed Mahameed (1):
net/mlx5: cmdif, Avoid skipping reclaim pages if FW is not accessible
drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 23 ++++++++++---------
.../ethernet/mellanox/mlx5/core/pagealloc.c | 2 +-
include/linux/mlx5/driver.h | 1 +
3 files changed, 14 insertions(+), 12 deletions(-)
--
2.25.1
More information about the kernel-team
mailing list