[PATCH 0/2][SRU][B][C] i40e: Fix DCB and overlapping tx timeout issues

Nivedita Singhvi nivedita.singhvi at canonical.com
Tue Mar 19 15:11:55 UTC 2019


BugLink: https://bugs.launchpad.net/bugs/1779756

[Impact]
The i40e driver can get stalled on tx timeouts. This can happen when
DCB is enabled on the connected switch. This can also trigger a
second situation when a tx timeout occurs before the recovery of
a previous timeout has completed due to CPU load, which is not
handled correctly. This leads to networking delays, drops and
application timeouts and hangs. Note that the first tx timeout
cause is just one of the ways to end up in the second situation.

This issue was seen on a heavily loaded Kafka broker node running
the 4.15.0-38-generic kernel on Xenial.

Symptoms include messages in the kernel log of the form:

---
[4733544.982116] i40e 0000:18:00.1 eno2: tx_timeout: VSI_seid: 390, Q 6, NTC: 0x1a0, HWB: 0x66, NTU: 0x66, TAIL: 0x66, INT: 0x0
[4733544.982119] i40e 0000:18:00.1 eno2: tx_timeout recovery level 1, hung_queue 6
----

Fix
~~~~
With the test kernel provided in this LP bug which had these
two commits compiled in, the problem has not been seen again,
and has been running successfully for several months:

"i40e: prevent overlapping tx_timeout recover"
Commit: d5585b7b6846a6d0f9517afe57be3843150719da

"i40e: Fix for Tx timeouts when interface is brought up if
 DCB is enabled"
Commit: fa38e30ac73fbb01d7e5d0fd1b12d412fa3ac3ee

* The first commit is already in Disco
* The second commit is already in Disco, Cosmic

So Bionic needs both patches and Cosmic only needs the first. 

[Test Case]
* We are considering the case of both issues above occurring.
* Seen by reporter on a Kafka broker node with heavy traffic.
* Not easy to reproduce as it requires something like the
  following example environment and heavy load:

  Kernel: 4.15.0-38-generic
  Network driver: i40e
        version: 2.1.14-k
        firmware-version: 6.00 0x800034e6 18.3.6
  NIC: Intel 40Gb XL710
  DCB enabled

[Regression Potential]
Low, as the first only impacts i40e DCB environment, and has
been running for several months in production-load testing
successfully.

Note: The first patch should be applied only to Cosmic.

Alan Brady (1):
  i40e: prevent overlapping tx_timeout recover

Martyna Szapar (1):
  i40e: Fix for Tx timeouts when interface is brought up if DCB is
    enabled

 drivers/net/ethernet/intel/i40e/i40e.h      |  1 +
 drivers/net/ethernet/intel/i40e/i40e_main.c | 20 +++++++++++++-------
 2 files changed, 14 insertions(+), 7 deletions(-)

-- 
2.17.1




More information about the kernel-team mailing list