ACK/Cmnt: [SRU][F/J:linux-bluefield][PATCH v1 0/1] UBUNTU: SAUCE: mlxbf-gige: Fix kernel panic at shutdown
Bartlomiej Zolnierkiewicz
bartlomiej.zolnierkiewicz at canonical.com
Thu Jun 22 12:04:09 UTC 2023
->shutdown operation is generally not the same as ->remove one as the
device instance shouldn't be removed from the system but I'm not aware
of possible problems of doing it this way for mlxbf-gige driver and
there are a few others drivers doing the same thing already so:
Acked-by: Bartlomiej Zolnierkiewicz <bartlomiej.zolnierkiewicz at canonical.com>
On Fri, Jun 2, 2023 at 7:05 PM Asmaa Mnebhi <asmaa at nvidia.com> wrote:
>
> BugLink: https://bugs.launchpad.net/bugs/2022370
>
> SRU Justification:
>
> [Impact]
>
> We occasionally see a race condition (once every 350 reboots) where napi is still
> running (mlxbf_gige_poll) while a shutdown has been initiated through "reboot".
> Since mlxbf_gige_poll is still running, it tries to access a NULL pointer and as
> a result causes a kernel panic.
>
> [Fix]
>
> The fix is to explicitly disable napi and dequeue it during shutdown.
> mlxbf_gige_remove already calls:
> unregister_netdev->unregister_netdevice->unregister_netdev_queue->
> rollback_registered->rollback_registered_many->dev_close_many->
> __dev_close_many->ndo_stop->mlxbf_gige_stop which stops napi
>
> So use mlxbf_gige_remove in place of the existing shutdown logic.
>
> [Test Case]
>
> * Issue at least 1000 reboots from linux and make sure there is no panic caused by the mlxbf-gige driver.
>
> [Regression Potential]
>
> * since this issue is hard to reproduce, it hasn't been tested thoroughly yet. so it needs several reboot loops to validate it.
More information about the kernel-team
mailing list