APPLIED/cmt: [PATCH 0/1][SRU X] UBUNTU: SAUCE: bnxt_en_bpo: Fix TX timeout during netpoll
Stefan Bader
stefan.bader at canonical.com
Thu Mar 14 08:58:21 UTC 2019
On 03.03.19 23:45, Khaled Elmously wrote:
> Thanks for the re-send Nivedita :)
>
> On 2019-03-04 00:36:48 , Nivedita Singhvi wrote:
>> BugLink: http://bugs.launchpad.net/bugs/1814095
The patch itself was missing the BugLink line. I fixed it up now while cranking.
-Stefan
>>
>>
>> [Impact]
>>
>> The bnxt_en_bpo driver experienced tx timeouts causing the system to experience
>> network stalls and fail to send data and heartbeat packets.
>>
>> The following 25Gb Broadcom NIC error was seen on Xenial running the
>> 4.4.0-141-generic kernel on an amd64 host seeing moderate-heavy network
>> traffic (just once):
>>
>> * The bnxt_en_po driver froze on a "TX timed out" error and triggered the
>> Netdev Watchdog timer under load.
>>
>> * From kernel log:
>> "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out"
>> See attached kern.log excerpt file for full excerpt of error log.
>>
>> * Release = Xenial
>> Kernel = 4.4.0-141-generic #167
>> eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet
>>
>> * This caused the driver to reset in order to recover:
>>
>> "bnxt_en_bpo 0000:19:00.1 eno2d1: TX timeout detected, starting reset task!"
>>
>> driver: bnxt_en_bpo
>> version: 1.8.1
>> source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout()
>>
>> * The loss of connectivity and softirq stall caused other cascading failures
>> on the system.
>>
>> * The bnxt_en_po driver is the imported Broadcom driver pulled in to support
>> newer Broadcom HW (specific boards) while the bnx_en module continues to
>> support the older HW. The current Linux upstream driver does not compile
>> easily with the 4.4 kernel (too many changes).
>>
>> * This upstream and bnxt_en driver fix is a likely solution:
>> "bnxt_en: Fix TX timeout during netpoll"
>> commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906
>>
>> This fix has not been applied to the bnxt_en_po driver version, but review of
>> the code indicates that it is susceptible to the bug, and the fix would be
>> reasonable.
>>
>>
>> [Test Case]
>>
>> * Unfortunately, this is not easy to reproduce. Also, it is only seen on
>> 4.4 kernels with newer Broadcom NICs supported by the bnxt_en_bpo driver.
>>
>>
>> [Regression Potential]
>>
>> * The patch is restricted to the bpo driver, with very constrained scope
>> - just the newest Broadcom NICs being used by the Xenial 4.4 kernel (as
>> opposed to the hwe 4.15 etc. kernels, which would have the in-tree fixed
>> driver).
>>
>> * The patch is very small and backport is fairly minimal and simple.
>>
>> * The fix has been running on the in-tree driver in upstream mainline as well
>> as the Ubuntu Linux in-tree driver, although the Broadcom driver has a lot of
>> lower level code that is different, this piece is still the same.
>>
>>
>> Michael Chan (1):
>> The current netpoll implementation in the bnxt_en driver has problems
>> that may miss TX completion events. bnxt_poll_work() in effect is
>> only handling at most 1 TX packet before exiting. In addition,
>> there may be in flight TX completions that ->poll() may miss even
>> after we fix bnxt_poll_work() to handle all visible TX completions.
>> netpoll may not call ->poll() again and HW may not generate IRQ
>> because the driver does not ARM the IRQ when the budget (0 for
>> netpoll) is reached.
>>
>> ubuntu/bnxt/bnxt.c | 13 ++++++++++---
>> 1 file changed, 10 insertions(+), 3 deletions(-)
>>
>> --
>> 2.17.1
>>
>>
>> --
>> kernel-team mailing list
>> kernel-team at lists.ubuntu.com
>> https://lists.ubuntu.com/mailman/listinfo/kernel-team
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20190314/21daade5/attachment.sig>
More information about the kernel-team
mailing list