[PATCH 0/1] [SRU][X/B] i40e PF reset due to incorrect MDD event

Heitor Alves de Siqueira halves at canonical.com
Thu Mar 4 19:51:38 UTC 2021


BugLink: https://bugs.launchpad.net/bugs/1772675

[Impact]
The i40e driver sometimes causes a "malicious device" event that the firmware
detects, which causes the firmware to reset the NIC, causing an interruption in
the network connection - which can cause further problems, e.g. if the interface
is in a bond; the reset will at least cause a temporary interruption in network
traffic.

[Fix]
In the case of MDD events issued for the PF, they are usually the result of a
misconfigured TX descriptor and not due to "bad" actions in the VFs. We don't
need to issue a reset to the whole NIC, TX hang checks should handle those if
necessary.

[Test Case]
The bug is unfortunately difficult to reproduce, as there's no detailed
documentation on how the i40e firmware detects and raises MDDs. We have seen
reports of this happening in Xenial and Bionic, for workloads stressing i40e
bonds in LACP mode.
Reproducing is easily detected, as the network traffic will be interrupted and
the system logs will contain a message like:
i40e 0000:02:00.1: TX driver issue detected, PF reset issued

[Regression Potential]
Since we're removing resets for the NIC, regressions could show up as issues in
connectivity after the MDD events are raised. If the firmware expects the whole
NIC to reset, we could see TX/RX hangs and general unresponsiveness in
networking. The potential for this should however be fairly low, as this patch
has been present since kernel 5.2 and hasn't seen any fixes or regressions
upstream. Basic smoke tests also showed that the driver continues working as
expected.

Carolyn Wyborny (1):
  i40e: change behavior on PF in response to MDD event

 drivers/net/ethernet/intel/i40e/i40e_main.c | 12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

-- 
2.30.1




More information about the kernel-team mailing list