[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work
Brian Murray
1840348 at bugs.launchpad.net
Tue Oct 26 20:13:47 UTC 2021
Hello Kellen, or anyone else affected,
Accepted ceph into bionic-proposed. The package will build now and be
available at
https://launchpad.net/ubuntu/+source/ceph/12.2.13-0ubuntu0.18.04.9 in a
few hours, and then in the -proposed repository.
Please help us by testing this new package. See
https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how
to enable and use -proposed. Your feedback will aid us getting this
update out to other Ubuntu users.
If this package fixes the bug for you, please add a comment to this bug,
mentioning the version of the package you tested, what testing has been
performed on the package and change the tag from verification-needed-
bionic to verification-done-bionic. If it does not fix the bug for you,
please add a comment stating that, and change the tag to verification-
failed-bionic. In either case, without details of your testing we will
not be able to proceed.
Further information regarding the verification process can be found at
https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in
advance for helping!
N.B. The updated package will be released to -updates after the bug(s)
fixed by this package have been verified and the package has been in
-proposed for a minimum of 7 days.
** Changed in: ceph (Ubuntu Bionic)
Status: In Progress => Fix Committed
** Tags added: verification-needed verification-needed-bionic
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1840348
Title:
Sharded OpWQ drops suicide_grace after waiting for work
Status in Ubuntu Cloud Archive:
Fix Released
Status in Ubuntu Cloud Archive queens series:
In Progress
Status in Ubuntu Cloud Archive rocky series:
Won't Fix
Status in Ubuntu Cloud Archive stein series:
Won't Fix
Status in Ubuntu Cloud Archive train series:
Fix Released
Status in ceph package in Ubuntu:
Fix Released
Status in ceph source package in Bionic:
Fix Committed
Status in ceph source package in Eoan:
Won't Fix
Status in ceph source package in Focal:
Fix Released
Bug description:
[Impact]
The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. The `threadpool_default_timeout` grace is left applied and suicide_grace is disabled.
After finding work, the original work queue grace/suicide_grace values
are not re-applied. This can result in hung operations that do not
trigger an OSD suicide recovery.
The missing suicide recovery was observed on Luminous 12.2.11. The
environment was consistently hitting a known authentication race
condition (issue#37778 [0]) due to repeated OSD service restarts on a
node exhibiting MCEs from a faulty DIMM.
The auth race condition would stall pg operations. In some cases, the
hung ops would persist for hours without suicide recovery.
[Test Case]
I have not identified a reliable reproducer. Currently testing the fix by exercising I/O.
Recommend letting this bake upstream before considering a back-port.
[Regression Potential]
This fix improves suicide_grace coverage of the Sharded OpWq.
This change is made in a critical code path that drives client I/O. An
OSD suicide will trigger a service restart and repeated restarts
(flapping) will adversely impact cluster performance.
The fix mitigates risk by keeping the applied suicide_grace value
consistent with the value applied before entering
`OSD::ShardedOpWQ::_process()`. The fix is also restricted to the
empty queue edge-case that drops the suicide_grace timeout. The
suicide_grace value is only re-applied when work is found after
waiting on an empty queue.
- In-Progress -
Opened upstream tracker for issue#45076 [1] and fix pr#34575 [2]
[0] https://tracker.ceph.com/issues/37778
[1] https://tracker.ceph.com/issues/45076
[2] https://github.com/ceph/ceph/pull/34575
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1840348/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list