[PATCH 0/1] [SRU][X/B] i40e PF reset due to incorrect MDD event

Heitor Alves de Siqueira halves at canonical.com
Fri Mar 5 14:07:29 UTC 2021


Thanks for the detailed feedback, Tim, Stefan!

I understand the concerns, and think it's pretty reasonable to expect more
testing on this patchset. I did some basic smoke testing with VF passthrough,
but it's true that this doesn't really stress the MDD related areas of the
driver.

I'm actively working with some Ubuntu users that have reported issues with MDD
events on Xenial, so hopefully we'll be able to do more thorough testing before
the next release cycle starts on the 10th. I'll also re-evaluate the patchset in
light of the changes mentioned in 5.12-rc, so that we have more data to
understand whether this should be the proper solution to include in our kernels.

Thanks,
Heitor

On Fri, Mar 5, 2021 at 10:08 AM Tim Gardner <tim.gardner at canonical.com> wrote:
>
>
>
> On 3/5/21 1:40 AM, Stefan Bader wrote:
> > On 04.03.21 20:51, Heitor Alves de Siqueira wrote:
> >> BugLink: https://bugs.launchpad.net/bugs/1772675
> >>
> >> [Impact]
> >> The i40e driver sometimes causes a "malicious device" event that the
> >> firmware
> >> detects, which causes the firmware to reset the NIC, causing an
> >> interruption in
> >> the network connection - which can cause further problems, e.g. if the
> >> interface
> >> is in a bond; the reset will at least cause a temporary interruption
> >> in network
> >> traffic.
> >>
> >> [Fix]
> >> In the case of MDD events issued for the PF, they are usually the
> >> result of a
> >> misconfigured TX descriptor and not due to "bad" actions in the VFs.
> >> We don't
> >> need to issue a reset to the whole NIC, TX hang checks should handle
> >> those if
> >> necessary.
> >>
> >> [Test Case]
> >> The bug is unfortunately difficult to reproduce, as there's no detailed
> >> documentation on how the i40e firmware detects and raises MDDs. We
> >> have seen
> >> reports of this happening in Xenial and Bionic, for workloads
> >> stressing i40e
> >> bonds in LACP mode.
> >> Reproducing is easily detected, as the network traffic will be
> >> interrupted and
> >> the system logs will contain a message like:
> >> i40e 0000:02:00.1: TX driver issue detected, PF reset issued
> >>
> >> [Regression Potential]
> >> Since we're removing resets for the NIC, regressions could show up as
> >> issues in
> >> connectivity after the MDD events are raised. If the firmware expects
> >> the whole
> >> NIC to reset, we could see TX/RX hangs and general unresponsiveness in
> >> networking. The potential for this should however be fairly low, as
> >> this patch
> >> has been present since kernel 5.2 and hasn't seen any fixes or
> >> regressions
> >> upstream. Basic smoke tests also showed that the driver continues
> >> working as
> >> expected.
> >>
> >> Carolyn Wyborny (1):
> >>    i40e: change behavior on PF in response to MDD event
> >>
> >>   drivers/net/ethernet/intel/i40e/i40e_main.c | 12 ++----------
> >>   1 file changed, 2 insertions(+), 10 deletions(-)
> >>
> > The change on its own is probably hard to judge. There could be other
> > changes which look unrelated but somehow make it work when put together.
> > For that reason I would hope to see feedback to the test kernel you seem
> > to have prepared before going ahead.
> > Minor note, reading the current impact I would classify this rather as
> > "medium". For "high" imo the system would have to show complete hangs or
> > crashes.
> >
> > -Stefan
> >
> >
>
> I'm with Stefan on this one. The driver in 5.12-rcX takes a pretty big
> hammer to the firmware when one of these TX/RX events is detected, which
> is a bit different approach then this patch takes. In the upstream
> driver a state bit is being set that looks like it eventually forces a
> fimware reset. So, upstream clearly was not happy with the results of
> this patch. I think testing is required.
>
> rtg
>
> -----------
> Tim Gardner
> Canonical, Inc



More information about the kernel-team mailing list