NAK: [SRU][J:linux-bluefield][PATCH 0/1] Devlink backport: fix race and lock issue
William Tu
witu at nvidia.com
Thu Oct 19 16:10:46 UTC 2023
Hi Tim,
Thanks, I will create new LP bug.
William
From: Tim Gardner <tim.gardner at canonical.com>
Date: Thursday, October 19, 2023 at 9:07 AM
To: William Tu <witu at nvidia.com>, kernel-team at lists.ubuntu.com <kernel-team at lists.ubuntu.com>
Cc: Majd Dibbiny <majd at nvidia.com>, Bodong Wang <bodong at nvidia.com>, Jiri Pirko <jiri at nvidia.com>, Vladimir Sokolovsky <vlad at nvidia.com>
Subject: Re: NAK: [SRU][J:linux-bluefield][PATCH 0/1] Devlink backport: fix race and lock issue
External email: Use caution opening links or attachments
On 10/19/23 10:03 AM, Tim Gardner wrote:
> On 10/19/23 9:09 AM, William Tu wrote:
>> BugLink: https://bugs.launchpad.net/bugs/2032378
>>
>> The patch is a follow-up from the previous devlink backport series.
>> We've found that devlink reload hangs the system when testing against
>> OFED 2307.
>>
>> [ 1089.747409] INFO: task devlink:8753 blocked for more than 120 seconds.
>> [ 1089.760560] Tainted: G OE 5.15.0-1027-bluefield
>> #29-Ubuntu
>> [ 1089.775086] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [ 1089.790829] task:devlink state:D stack: 0 pid: 8753
>> ppid: 5090 flags:0x00000004
>> [ 1089.790838] Call trace:
>> [ 1089.790840] __switch_to+0xf8/0x150
>> [ 1089.790857] __schedule+0x2b8/0x790
>> [ 1089.790865] schedule+0x64/0x140
>> [ 1089.790870] schedule_preempt_disabled+0x18/0x24
>> [ 1089.790874] __mutex_lock.constprop.0+0x1a0/0x680
>> [ 1089.790878] __mutex_lock_slowpath+0x40/0x90
>> [ 1089.790883] mutex_lock+0x64/0x70
>> [ 1089.790887] devl_lock+0x1c/0x30
>> [ 1089.790893] mlx5_detach_device+0x58/0x190 [mlx5_core]
>> [ 1089.791055] mlx5_unload_one+0x40/0xe4 [mlx5_core]
>> [ 1089.791177] mlx5_devlink_reload_down+0x184/0x270 [mlx5_core]
>> [ 1089.791318] devlink_reload+0x214/0x290
>>
>> Checking the OFED source code, we found this missing devl trap group
>> also need to be backported to avoid deadlock.
>>
>> void mlx5_detach_device(struct mlx5_core_dev *dev, bool suspend)
>> {
>> ...
>> #ifdef HAVE_DEVL_PORT_REGISTER
>> #ifdef HAVE_DEVL_TRAP_GROUPS_REGISTER
>> devl_assert_locked(priv_to_devlink(dev));
>> #else
>> devl_lock(devlink);
>> #endif /* HAVE_DEVL_TRAP_GROUPS_REGISTER */
>> #endif /* HAVE_DEVL_PORT_REGISTER */
>> mutex_lock(&mlx5_intf_mutex);
>> #ifdef HAVE_DEVL_PORT_REGISTER
>>
>> I'm re-using the same BugLink as it is relevant issue.
>>
>> Jiri Pirko (1):
>> net: devlink: add unlocked variants of devling_trap*() functions
>>
>> include/net/devlink.h | 20 +++++
>> net/core/devlink.c | 180 ++++++++++++++++++++++++++++++++++--------
>> 2 files changed, 168 insertions(+), 32 deletions(-)
>>
>
> This needs a new LP bug since 00371808 is already fix committed. Also,
> there was no patch or PR attached to this email. What are we supposed to
> do with it ?
>
Never mind that last part. It was in my SPAM for some reason.
Nevertheless, you need a new LP bug.
--
-----------
Tim Gardner
Canonical, Inc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20231019/fbd3ee15/attachment-0001.html>
More information about the kernel-team
mailing list