APPLIED/cmt: [PATCH 0/1][SRU X] UBUNTU: SAUCE: bnxt_en_bpo: Fix TX timeout during netpoll
Khaled Elmously
khalid.elmously at canonical.com
Sun Mar 3 22:45:08 UTC 2019
Thanks for the re-send Nivedita :)
On 2019-03-04 00:36:48 , Nivedita Singhvi wrote:
> BugLink: http://bugs.launchpad.net/bugs/1814095
>
>
> [Impact]
>
> The bnxt_en_bpo driver experienced tx timeouts causing the system to experience
> network stalls and fail to send data and heartbeat packets.
>
> The following 25Gb Broadcom NIC error was seen on Xenial running the
> 4.4.0-141-generic kernel on an amd64 host seeing moderate-heavy network
> traffic (just once):
>
> * The bnxt_en_po driver froze on a "TX timed out" error and triggered the
> Netdev Watchdog timer under load.
>
> * From kernel log:
> "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out"
> See attached kern.log excerpt file for full excerpt of error log.
>
> * Release = Xenial
> Kernel = 4.4.0-141-generic #167
> eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet
>
> * This caused the driver to reset in order to recover:
>
> "bnxt_en_bpo 0000:19:00.1 eno2d1: TX timeout detected, starting reset task!"
>
> driver: bnxt_en_bpo
> version: 1.8.1
> source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout()
>
> * The loss of connectivity and softirq stall caused other cascading failures
> on the system.
>
> * The bnxt_en_po driver is the imported Broadcom driver pulled in to support
> newer Broadcom HW (specific boards) while the bnx_en module continues to
> support the older HW. The current Linux upstream driver does not compile
> easily with the 4.4 kernel (too many changes).
>
> * This upstream and bnxt_en driver fix is a likely solution:
> "bnxt_en: Fix TX timeout during netpoll"
> commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906
>
> This fix has not been applied to the bnxt_en_po driver version, but review of
> the code indicates that it is susceptible to the bug, and the fix would be
> reasonable.
>
>
> [Test Case]
>
> * Unfortunately, this is not easy to reproduce. Also, it is only seen on
> 4.4 kernels with newer Broadcom NICs supported by the bnxt_en_bpo driver.
>
>
> [Regression Potential]
>
> * The patch is restricted to the bpo driver, with very constrained scope
> - just the newest Broadcom NICs being used by the Xenial 4.4 kernel (as
> opposed to the hwe 4.15 etc. kernels, which would have the in-tree fixed
> driver).
>
> * The patch is very small and backport is fairly minimal and simple.
>
> * The fix has been running on the in-tree driver in upstream mainline as well
> as the Ubuntu Linux in-tree driver, although the Broadcom driver has a lot of
> lower level code that is different, this piece is still the same.
>
>
> Michael Chan (1):
> The current netpoll implementation in the bnxt_en driver has problems
> that may miss TX completion events. bnxt_poll_work() in effect is
> only handling at most 1 TX packet before exiting. In addition,
> there may be in flight TX completions that ->poll() may miss even
> after we fix bnxt_poll_work() to handle all visible TX completions.
> netpoll may not call ->poll() again and HW may not generate IRQ
> because the driver does not ARM the IRQ when the budget (0 for
> netpoll) is reached.
>
> ubuntu/bnxt/bnxt.c | 13 ++++++++++---
> 1 file changed, 10 insertions(+), 3 deletions(-)
>
> --
> 2.17.1
>
>
> --
> kernel-team mailing list
> kernel-team at lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team
More information about the kernel-team
mailing list