[PATCH 0/1][SRU X] UBUNTU: SAUCE: bnxt_en_bpo: Fix TX timeout during netpoll
nivedita.singhvi at canonical.com
Sun Mar 3 19:06:48 UTC 2019
The bnxt_en_bpo driver experienced tx timeouts causing the system to experience
network stalls and fail to send data and heartbeat packets.
The following 25Gb Broadcom NIC error was seen on Xenial running the
4.4.0-141-generic kernel on an amd64 host seeing moderate-heavy network
traffic (just once):
* The bnxt_en_po driver froze on a "TX timed out" error and triggered the
Netdev Watchdog timer under load.
* From kernel log:
"NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out"
See attached kern.log excerpt file for full excerpt of error log.
* Release = Xenial
Kernel = 4.4.0-141-generic #167
eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet
* This caused the driver to reset in order to recover:
"bnxt_en_bpo 0000:19:00.1 eno2d1: TX timeout detected, starting reset task!"
source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout()
* The loss of connectivity and softirq stall caused other cascading failures
on the system.
* The bnxt_en_po driver is the imported Broadcom driver pulled in to support
newer Broadcom HW (specific boards) while the bnx_en module continues to
support the older HW. The current Linux upstream driver does not compile
easily with the 4.4 kernel (too many changes).
* This upstream and bnxt_en driver fix is a likely solution:
"bnxt_en: Fix TX timeout during netpoll"
This fix has not been applied to the bnxt_en_po driver version, but review of
the code indicates that it is susceptible to the bug, and the fix would be
* Unfortunately, this is not easy to reproduce. Also, it is only seen on
4.4 kernels with newer Broadcom NICs supported by the bnxt_en_bpo driver.
* The patch is restricted to the bpo driver, with very constrained scope
- just the newest Broadcom NICs being used by the Xenial 4.4 kernel (as
opposed to the hwe 4.15 etc. kernels, which would have the in-tree fixed
* The patch is very small and backport is fairly minimal and simple.
* The fix has been running on the in-tree driver in upstream mainline as well
as the Ubuntu Linux in-tree driver, although the Broadcom driver has a lot of
lower level code that is different, this piece is still the same.
Michael Chan (1):
The current netpoll implementation in the bnxt_en driver has problems
that may miss TX completion events. bnxt_poll_work() in effect is
only handling at most 1 TX packet before exiting. In addition,
there may be in flight TX completions that ->poll() may miss even
after we fix bnxt_poll_work() to handle all visible TX completions.
netpoll may not call ->poll() again and HW may not generate IRQ
because the driver does not ARM the IRQ when the budget (0 for
netpoll) is reached.
ubuntu/bnxt/bnxt.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
More information about the kernel-team