[PATCH 0/1] [SRU][X/B] i40e PF reset due to incorrect MDD event
Stefan Bader
stefan.bader at canonical.com
Fri Mar 5 08:40:44 UTC 2021
On 04.03.21 20:51, Heitor Alves de Siqueira wrote:
> BugLink: https://bugs.launchpad.net/bugs/1772675
>
> [Impact]
> The i40e driver sometimes causes a "malicious device" event that the firmware
> detects, which causes the firmware to reset the NIC, causing an interruption in
> the network connection - which can cause further problems, e.g. if the interface
> is in a bond; the reset will at least cause a temporary interruption in network
> traffic.
>
> [Fix]
> In the case of MDD events issued for the PF, they are usually the result of a
> misconfigured TX descriptor and not due to "bad" actions in the VFs. We don't
> need to issue a reset to the whole NIC, TX hang checks should handle those if
> necessary.
>
> [Test Case]
> The bug is unfortunately difficult to reproduce, as there's no detailed
> documentation on how the i40e firmware detects and raises MDDs. We have seen
> reports of this happening in Xenial and Bionic, for workloads stressing i40e
> bonds in LACP mode.
> Reproducing is easily detected, as the network traffic will be interrupted and
> the system logs will contain a message like:
> i40e 0000:02:00.1: TX driver issue detected, PF reset issued
>
> [Regression Potential]
> Since we're removing resets for the NIC, regressions could show up as issues in
> connectivity after the MDD events are raised. If the firmware expects the whole
> NIC to reset, we could see TX/RX hangs and general unresponsiveness in
> networking. The potential for this should however be fairly low, as this patch
> has been present since kernel 5.2 and hasn't seen any fixes or regressions
> upstream. Basic smoke tests also showed that the driver continues working as
> expected.
>
> Carolyn Wyborny (1):
> i40e: change behavior on PF in response to MDD event
>
> drivers/net/ethernet/intel/i40e/i40e_main.c | 12 ++----------
> 1 file changed, 2 insertions(+), 10 deletions(-)
>
The change on its own is probably hard to judge. There could be other changes
which look unrelated but somehow make it work when put together. For that reason
I would hope to see feedback to the test kernel you seem to have prepared before
going ahead.
Minor note, reading the current impact I would classify this rather as "medium".
For "high" imo the system would have to show complete hangs or crashes.
-Stefan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20210305/a0596f4e/attachment.sig>
More information about the kernel-team
mailing list