APPLIED: [PATCH v2 0/1] [SRU][X/B] i40e PF reset due to incorrect MDD event

Kelsey Skunberg kelsey.skunberg at canonical.com
Wed Mar 10 23:48:27 UTC 2021


Applied to X/B master-next. thank you! 

-Kelsey

On 2021-03-10 16:57:01 , Heitor Alves de Siqueira wrote:
> BugLink: https://bugs.launchpad.net/bugs/1772675
> 
> This v2 has an updated test procedure and backport notes on the Xenial patch.
> Bionic is now correctly marked as a cherry pick.
> 
> [Impact]
> The i40e driver sometimes causes a "malicious device" event that the firmware
> detects, which causes the firmware to reset the NIC, causing an interruption in
> the network connection - which can cause further problems, e.g. if the interface
> is in a bond; the reset will at least cause a temporary interruption in network
> traffic.
> 
> [Fix]
> In the case of MDD events issued for the PF, they are usually the result of a
> misconfigured TX descriptor and not due to "bad" actions in the VFs. We don't
> need to issue a reset to the whole NIC, TX hang checks should handle those if
> necessary.
> 
> [Test Procedure]
> The bug is unfortunately difficult to reproduce, as there's no detailed
> documentation on how the i40e firmware detects and raises MDDs. We have seen
> reports of this happening in Xenial and Bionic, for workloads stressing i40e
> bonds in LACP mode.
> Reproducing is easily detected, as the network traffic will be interrupted and
> the system logs will contain a message like:
> i40e 0000:02:00.1: TX driver issue detected, PF reset issued
> 
> An alternative test procedure makes use of the kprobes attached to the LP bug.
> The test setup is as follows:
> - Create 2 VFs on primary NIC
> - Passthrough VF 1 to a Bionic VM
> - Start iperf3 client on VM, going through i40evf interface
> - Start another iperf3 client on host, going through i40e interface
> Both iperf3 clients should be using an external server located on a separate
> host. By loading the kprobe module while iperf3 is running, we should be able to
> raise MDDs more consistently. MDD behaviour can change according to firmware
> version, so we may need to try with different sets of probes. The one with the
> most consistent results seems to be 'corrupt_tx_desc_addr', which corrupts the
> cmd_type_offset_bsz field of the last TX descriptor before the NIC is notified
> of new data.
> 
> [Regression Potential]
> Since we're removing resets for the NIC, regressions could show up as issues in
> connectivity after the MDD events are raised. If the firmware expects the whole
> NIC to reset, we could see TX/RX hangs and general unresponsiveness in
> networking. The potential for this should however be fairly low, as this patch
> has been present since kernel 5.2 and hasn't seen any fixes or regressions
> upstream. Basic smoke tests also showed that the driver continues working as
> expected, and that necessary PF resets will be issued by the netdev watchdog in
> case of any hung queues.
> 
> Carolyn Wyborny (1):
>   i40e: change behavior on PF in response to MDD event
> 
>  drivers/net/ethernet/intel/i40e/i40e_main.c | 12 ++----------
>  1 file changed, 2 insertions(+), 10 deletions(-)
> 
> -- 
> 2.30.1
> 
> 
> -- 
> kernel-team mailing list
> kernel-team at lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team



More information about the kernel-team mailing list