[PATCH 0/1] [SRU][X/B] i40e PF reset due to incorrect MDD event

Tim Gardner tim.gardner at canonical.com
Fri Mar 5 13:08:52 UTC 2021



On 3/5/21 1:40 AM, Stefan Bader wrote:
> On 04.03.21 20:51, Heitor Alves de Siqueira wrote:
>> BugLink: https://bugs.launchpad.net/bugs/1772675
>>
>> [Impact]
>> The i40e driver sometimes causes a "malicious device" event that the 
>> firmware
>> detects, which causes the firmware to reset the NIC, causing an 
>> interruption in
>> the network connection - which can cause further problems, e.g. if the 
>> interface
>> is in a bond; the reset will at least cause a temporary interruption 
>> in network
>> traffic.
>>
>> [Fix]
>> In the case of MDD events issued for the PF, they are usually the 
>> result of a
>> misconfigured TX descriptor and not due to "bad" actions in the VFs. 
>> We don't
>> need to issue a reset to the whole NIC, TX hang checks should handle 
>> those if
>> necessary.
>>
>> [Test Case]
>> The bug is unfortunately difficult to reproduce, as there's no detailed
>> documentation on how the i40e firmware detects and raises MDDs. We 
>> have seen
>> reports of this happening in Xenial and Bionic, for workloads 
>> stressing i40e
>> bonds in LACP mode.
>> Reproducing is easily detected, as the network traffic will be 
>> interrupted and
>> the system logs will contain a message like:
>> i40e 0000:02:00.1: TX driver issue detected, PF reset issued
>>
>> [Regression Potential]
>> Since we're removing resets for the NIC, regressions could show up as 
>> issues in
>> connectivity after the MDD events are raised. If the firmware expects 
>> the whole
>> NIC to reset, we could see TX/RX hangs and general unresponsiveness in
>> networking. The potential for this should however be fairly low, as 
>> this patch
>> has been present since kernel 5.2 and hasn't seen any fixes or 
>> regressions
>> upstream. Basic smoke tests also showed that the driver continues 
>> working as
>> expected.
>>
>> Carolyn Wyborny (1):
>>    i40e: change behavior on PF in response to MDD event
>>
>>   drivers/net/ethernet/intel/i40e/i40e_main.c | 12 ++----------
>>   1 file changed, 2 insertions(+), 10 deletions(-)
>>
> The change on its own is probably hard to judge. There could be other 
> changes which look unrelated but somehow make it work when put together. 
> For that reason I would hope to see feedback to the test kernel you seem 
> to have prepared before going ahead.
> Minor note, reading the current impact I would classify this rather as 
> "medium". For "high" imo the system would have to show complete hangs or 
> crashes.
> 
> -Stefan
> 
> 

I'm with Stefan on this one. The driver in 5.12-rcX takes a pretty big 
hammer to the firmware when one of these TX/RX events is detected, which 
is a bit different approach then this patch takes. In the upstream 
driver a state bit is being set that looks like it eventually forces a 
fimware reset. So, upstream clearly was not happy with the results of 
this patch. I think testing is required.

rtg

-----------
Tim Gardner
Canonical, Inc



More information about the kernel-team mailing list