ACK/cmnt: [PATCH 0/2][SRU][B][C] i40e: Fix DCB and overlapping tx timeout issues
Kleber Souza
kleber.souza at canonical.com
Tue Mar 26 11:29:49 UTC 2019
On 3/19/19 4:11 PM, Nivedita Singhvi wrote:
> BugLink: https://bugs.launchpad.net/bugs/1779756
>
> [Impact]
> The i40e driver can get stalled on tx timeouts. This can happen when
> DCB is enabled on the connected switch. This can also trigger a
> second situation when a tx timeout occurs before the recovery of
> a previous timeout has completed due to CPU load, which is not
> handled correctly. This leads to networking delays, drops and
> application timeouts and hangs. Note that the first tx timeout
> cause is just one of the ways to end up in the second situation.
>
> This issue was seen on a heavily loaded Kafka broker node running
> the 4.15.0-38-generic kernel on Xenial.
>
> Symptoms include messages in the kernel log of the form:
>
> ---
> [4733544.982116] i40e 0000:18:00.1 eno2: tx_timeout: VSI_seid: 390, Q 6, NTC: 0x1a0, HWB: 0x66, NTU: 0x66, TAIL: 0x66, INT: 0x0
> [4733544.982119] i40e 0000:18:00.1 eno2: tx_timeout recovery level 1, hung_queue 6
> ----
>
> Fix
> ~~~~
> With the test kernel provided in this LP bug which had these
> two commits compiled in, the problem has not been seen again,
> and has been running successfully for several months:
>
> "i40e: prevent overlapping tx_timeout recover"
> Commit: d5585b7b6846a6d0f9517afe57be3843150719da
>
> "i40e: Fix for Tx timeouts when interface is brought up if
> DCB is enabled"
> Commit: fa38e30ac73fbb01d7e5d0fd1b12d412fa3ac3ee
>
> * The first commit is already in Disco
> * The second commit is already in Disco, Cosmic
>
> So Bionic needs both patches and Cosmic only needs the first.
>
> [Test Case]
> * We are considering the case of both issues above occurring.
> * Seen by reporter on a Kafka broker node with heavy traffic.
> * Not easy to reproduce as it requires something like the
> following example environment and heavy load:
>
> Kernel: 4.15.0-38-generic
> Network driver: i40e
> version: 2.1.14-k
> firmware-version: 6.00 0x800034e6 18.3.6
> NIC: Intel 40Gb XL710
> DCB enabled
>
> [Regression Potential]
> Low, as the first only impacts i40e DCB environment, and has
> been running for several months in production-load testing
> successfully.
>
> Note: The first patch should be applied only to Cosmic.
>
> Alan Brady (1):
> i40e: prevent overlapping tx_timeout recover
>
> Martyna Szapar (1):
> i40e: Fix for Tx timeouts when interface is brought up if DCB is
> enabled
>
> drivers/net/ethernet/intel/i40e/i40e.h | 1 +
> drivers/net/ethernet/intel/i40e/i40e_main.c | 20 +++++++++++++-------
> 2 files changed, 14 insertions(+), 7 deletions(-)
>
The correct cherry pick provenance line is
"(cherry picked from commit ...)"
without the dash "-", as added by "git cherry-pick -x". This can be fixed
applying.
Apart from that it looks good. Clean cherry pick and extensively tested.
Acked-by: Kleber Sacilotto de Souza <kleber.souza at canonical.com>q
More information about the kernel-team
mailing list