[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work

Fri Apr 10 19:15:35 UTC 2020

There are two edge-cases in 12.2.11 where a worker thread's suicide_grace value gets dropped:
[0] In the Threadpool context, Threadpool:worker() drops suicide_grace while waiting on an empty work queue.
[1] In the ShardedThreadpool context, OSD::ShardedOpWQ::_process() drops suicide_grace while opportunistically waiting for more work (to prevent additional lock contention).

The Threadpool context always re-assigns suicide_grace before driving
any work. The ShardedThreadpool context does not follow this pattern.
After delaying to find additional work, the default sharded work queue
timeouts are not re-applied.

This oversight exists in Luminous on-wards. Mimic, and Nautilus have
each reworked the ShardedOpWQ code path, but did not address the
problem.

[0] https://github.com/ceph/ceph/blob/v12.2.11/src/common/WorkQueue.cc#L137
[1] https://github.com/ceph/ceph/blob/v12.2.11/src/osd/OSD.cc#L10476

** Description changed:

- Multiple incidents have been seen where ops were blocked for various
- reasons and the suicide_grace timeout was not observed, meaning that the
- OSD failed to suicide as expected.
+ [Impact]
+ The Sharded OpWQ will opportunistically wait for more work when processing an
+ empty queue. While waiting, the heartbeat timeout and suicide_grace values are
+ modified. On Luminous, the `threadpool_default_timeout` grace is left applied
+ and suicide_grace is left disabled. On later releases both the grace and
+ suicide_grace are left disabled. 
+ 
+ After finding work, the original work queue grace/suicide_grace values are
+ not re-applied. This can result in hung operations that do not trigger an OSD
+ suicide recovery.
+ 
+ The missing suicide recovery was observed on Luminous 12.2.11. The environment
+ was consistently hitting a known authentication race condition (issue#37778
+ [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a
+ faulty DIMM. 
+ 
+ The auth race condition would stall pg operations. In some cases the hung ops
+ would persist for hours without suicide recovery.
+ 
+ [Test Case]
+ - In-Progress -
+ Haven't landed on a reliable reproducer. Currently testing the fix by
+ exercising I/O. Since the fix applies to all version of Ceph, the plan is to
+ let this bake in the latest release before considering a back-port. 
+ 
+ [Regression Potential]
+ This fix improves suicide_grace coverage of the Sharded OpWq. 
+ 
+ This change is made in a critical code path that drives client I/O. An OSD
+ suicide will trigger a service restart and repeated restarts (flapping) will
+ adversely impact cluster performance. 
+ 
+ The fix mitigates risk by keeping the applied suicide_grace value consistent
+ with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix
+ is also restricted to the empty queue edge-case that drops the suicide_grace
+ timeout. The suicide_grace value is only re-applied when work is found after
+ waiting on an empty queue. 
+ 
+ - In-Progress -
+ The fix will bake upstream on later levels before back-port consideration.

** Description changed:

  [Impact]
- The Sharded OpWQ will opportunistically wait for more work when processing an
- empty queue. While waiting, the heartbeat timeout and suicide_grace values are
- modified. On Luminous, the `threadpool_default_timeout` grace is left applied
- and suicide_grace is left disabled. On later releases both the grace and
- suicide_grace are left disabled. 
+ The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. On Luminous, the `threadpool_default_timeout` grace is left applied and suicide_grace is left disabled. On later releases both the grace and suicide_grace are left disabled.

- After finding work, the original work queue grace/suicide_grace values are
- not re-applied. This can result in hung operations that do not trigger an OSD
- suicide recovery.
+ After finding work, the original work queue grace/suicide_grace values
+ are not re-applied. This can result in hung operations that do not
+ trigger an OSD suicide recovery.

- The missing suicide recovery was observed on Luminous 12.2.11. The environment
- was consistently hitting a known authentication race condition (issue#37778
- [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a
- faulty DIMM. 
+ The missing suicide recovery was observed on Luminous 12.2.11. The
+ environment was consistently hitting a known authentication race
+ condition (issue#37778 [0]) due to repeated OSD service restarts on a
+ node exhibiting MCEs from a faulty DIMM.

- The auth race condition would stall pg operations. In some cases the hung ops
- would persist for hours without suicide recovery.
+ The auth race condition would stall pg operations. In some cases, the
+ hung ops would persist for hours without suicide recovery.

  [Test Case]
  - In-Progress -
- Haven't landed on a reliable reproducer. Currently testing the fix by
- exercising I/O. Since the fix applies to all version of Ceph, the plan is to
- let this bake in the latest release before considering a back-port. 
+ Haven't landed on a reliable reproducer. Currently testing the fix by exercising I/O. Since the fix applies to all version of Ceph, the plan is to let this bake in the latest release before considering a back-port.

  [Regression Potential]
- This fix improves suicide_grace coverage of the Sharded OpWq. 
+ This fix improves suicide_grace coverage of the Sharded OpWq.

- This change is made in a critical code path that drives client I/O. An OSD
- suicide will trigger a service restart and repeated restarts (flapping) will
- adversely impact cluster performance. 
+ This change is made in a critical code path that drives client I/O. An
+ OSD suicide will trigger a service restart and repeated restarts
+ (flapping) will adversely impact cluster performance.

- The fix mitigates risk by keeping the applied suicide_grace value consistent
- with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix
- is also restricted to the empty queue edge-case that drops the suicide_grace
- timeout. The suicide_grace value is only re-applied when work is found after
- waiting on an empty queue. 
+ The fix mitigates risk by keeping the applied suicide_grace value
+ consistent with the value applied before entering
+ `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty
+ queue edge-case that drops the suicide_grace timeout. The suicide_grace
+ value is only re-applied when work is found after waiting on an empty
+ queue.

  - In-Progress -
  The fix will bake upstream on later levels before back-port consideration.

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to ceph in Ubuntu.
https://bugs.launchpad.net/bugs/1840348

Title:
  Sharded OpWQ drops suicide_grace after waiting for work

Status in ceph package in Ubuntu:
  Triaged

Bug description:
  [Impact]
  The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. On Luminous, the `threadpool_default_timeout` grace is left applied and suicide_grace is left disabled. On later releases both the grace and suicide_grace are left disabled.

  After finding work, the original work queue grace/suicide_grace values
  are not re-applied. This can result in hung operations that do not
  trigger an OSD suicide recovery.

  The missing suicide recovery was observed on Luminous 12.2.11. The
  environment was consistently hitting a known authentication race
  condition (issue#37778 [0]) due to repeated OSD service restarts on a
  node exhibiting MCEs from a faulty DIMM.

  The auth race condition would stall pg operations. In some cases, the
  hung ops would persist for hours without suicide recovery.

  [Test Case]
  - In-Progress -
  Haven't landed on a reliable reproducer. Currently testing the fix by exercising I/O. Since the fix applies to all version of Ceph, the plan is to let this bake in the latest release before considering a back-port.

  [Regression Potential]
  This fix improves suicide_grace coverage of the Sharded OpWq.

  This change is made in a critical code path that drives client I/O. An
  OSD suicide will trigger a service restart and repeated restarts
  (flapping) will adversely impact cluster performance.

  The fix mitigates risk by keeping the applied suicide_grace value
  consistent with the value applied before entering
  `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the
  empty queue edge-case that drops the suicide_grace timeout. The
  suicide_grace value is only re-applied when work is found after
  waiting on an empty queue.

  - In-Progress -
  The fix will bake upstream on later levels before back-port consideration.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1840348/+subscriptions