[Bug 1906496] Re: [SRU] mgr can be very slow in a large ceph cluster

Sun Dec 13 20:24:27 UTC 2020

@Corey, yes, I am happy to do the SRU verification when the packages are
available. I've updated the [Test case] section to note a simplified,
functional test.

** Description changed:

- [Impact] 
+ [Impact]
  Ceph upstream implemented a new feature [1] that will check/report those long network ping times between osds, but it introduced an issue that ceph-mgr might be very slow because it needs to dump all the new osd network ping stats [2] for some tasks, this can be bad especially when the cluster has large number of osds.

  Since these kind osd network ping stats doesn't need to be exposed to
  the python mgr module. so, it only makes the mgr doing more work than it
  needs to, it could cause the mgr slow or even hang and could cause the
  cpu usage of mgr process constantly high. The fix is to disable the ping
  time dump for those mgr python modules.

  This resulted in ceph-mgr not responding to commands and/or hanging (and
  had to be restarted) in clusters with a large number of OSDs.

  [0] is the upstreambug. It was backported to Nautilus but rejected for
  Luminous and Mimic because they reached EOL in upstream. But I want to
  backport to these two releases Ubuntu/UCA.

  The major fix from upstream is here [3], and also I found an improvement
  commit [4] that submitted later in another PR.

  [Test Case]
  Deploy a Ceph cluster (Luminous 13.2.9 or Mimic 13.2.9) with large number of Ceph OSDs (600+). During normal operations of the cluster, as the ceph-mgr dumps the network ping stats regularly, this problem would manifest. This is relatively hard to reproduce as the ceph-mgr may not always get overloaded and thus not hang.
+ 
+ A simpler version could be to deploy a Ceph cluster with as many OSDs as
+ the hardware/system setup allows and drive I/O on the cluster for
+ sometime. Then various queries could be sent to the manager to verify it
+ does report and doesn't get stuck.

  [Regression Potential]
  Fix has been accepted upstream (the changes are here in "sync" with upstream to the extent these old releases match the latest source code) and have been confirmed to work. So the risk is minimal.

  At worst, this could affect modules that consume the stats from ceph-mgr
  (such as prometheus or other monitoring scripts/tools) and thus becomes
  less useful. But still shouldn't cause any problems to the operations of
  the cluster itself.

  [Other Info]
  - In addition to the fix from [1], another commit [4] is also cherry-picked and backported here - this was also accepted upstream.

  - Since the ceph-mgr hangs when affected, this also impact sosreport
  collection - commands time out as the mgr doesn't respond and thus info
  get truncated/not collected in that case. This fix should help avoid
  that problem in sosreports.

  [0] https://tracker.ceph.com/issues/43364
  [1] https://github.com/ceph/ceph/pull/28755
  [2] https://github.com/ceph/ceph/pull/28755/files#diff-5498d83111f1210998ee186e98d5836d2bce9992be7648addc83f59e798cddd8L430
  [3] https://github.com/ceph/ceph/pull/32406
  [4] https://github.com/ceph/ceph/pull/32554/commits/1112584621016c4a8cac1bedb1a1b8b17c394f7f

** Description changed:

  [Impact]
  Ceph upstream implemented a new feature [1] that will check/report those long network ping times between osds, but it introduced an issue that ceph-mgr might be very slow because it needs to dump all the new osd network ping stats [2] for some tasks, this can be bad especially when the cluster has large number of osds.

  Since these kind osd network ping stats doesn't need to be exposed to
  the python mgr module. so, it only makes the mgr doing more work than it
  needs to, it could cause the mgr slow or even hang and could cause the
  cpu usage of mgr process constantly high. The fix is to disable the ping
  time dump for those mgr python modules.

  This resulted in ceph-mgr not responding to commands and/or hanging (and
  had to be restarted) in clusters with a large number of OSDs.

  [0] is the upstreambug. It was backported to Nautilus but rejected for
  Luminous and Mimic because they reached EOL in upstream. But I want to
  backport to these two releases Ubuntu/UCA.

  The major fix from upstream is here [3], and also I found an improvement
  commit [4] that submitted later in another PR.

  [Test Case]
  Deploy a Ceph cluster (Luminous 13.2.9 or Mimic 13.2.9) with large number of Ceph OSDs (600+). During normal operations of the cluster, as the ceph-mgr dumps the network ping stats regularly, this problem would manifest. This is relatively hard to reproduce as the ceph-mgr may not always get overloaded and thus not hang.

  A simpler version could be to deploy a Ceph cluster with as many OSDs as
- the hardware/system setup allows and drive I/O on the cluster for
- sometime. Then various queries could be sent to the manager to verify it
- does report and doesn't get stuck.
+ the hardware/system setup allows (not necessarily 600+) and drive I/O on
+ the cluster for sometime. Then various queries could be sent to the
+ manager to verify it does report and doesn't get stuck.

  [Regression Potential]
  Fix has been accepted upstream (the changes are here in "sync" with upstream to the extent these old releases match the latest source code) and have been confirmed to work. So the risk is minimal.

  At worst, this could affect modules that consume the stats from ceph-mgr
  (such as prometheus or other monitoring scripts/tools) and thus becomes
  less useful. But still shouldn't cause any problems to the operations of
  the cluster itself.

  [Other Info]
  - In addition to the fix from [1], another commit [4] is also cherry-picked and backported here - this was also accepted upstream.

  - Since the ceph-mgr hangs when affected, this also impact sosreport
  collection - commands time out as the mgr doesn't respond and thus info
  get truncated/not collected in that case. This fix should help avoid
  that problem in sosreports.

  [0] https://tracker.ceph.com/issues/43364
  [1] https://github.com/ceph/ceph/pull/28755
  [2] https://github.com/ceph/ceph/pull/28755/files#diff-5498d83111f1210998ee186e98d5836d2bce9992be7648addc83f59e798cddd8L430
  [3] https://github.com/ceph/ceph/pull/32406
  [4] https://github.com/ceph/ceph/pull/32554/commits/1112584621016c4a8cac1bedb1a1b8b17c394f7f

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1906496

Title:
  [SRU] mgr can be very slow in a large ceph cluster

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive queens series:
  Triaged
Status in Ubuntu Cloud Archive stein series:
  Triaged
Status in Ubuntu Cloud Archive train series:
  Fix Released
Status in Ubuntu Cloud Archive ussuri series:
  Fix Released
Status in ceph package in Ubuntu:
  Fix Released
Status in ceph source package in Bionic:
  Triaged
Status in ceph source package in Focal:
  Fix Released
Status in ceph source package in Groovy:
  Fix Released
Status in ceph source package in Hirsute:
  Fix Released

Bug description:
  [Impact]
  Ceph upstream implemented a new feature [1] that will check/report those long network ping times between osds, but it introduced an issue that ceph-mgr might be very slow because it needs to dump all the new osd network ping stats [2] for some tasks, this can be bad especially when the cluster has large number of osds.

  Since these kind osd network ping stats doesn't need to be exposed to
  the python mgr module. so, it only makes the mgr doing more work than
  it needs to, it could cause the mgr slow or even hang and could cause
  the cpu usage of mgr process constantly high. The fix is to disable
  the ping time dump for those mgr python modules.

  This resulted in ceph-mgr not responding to commands and/or hanging
  (and had to be restarted) in clusters with a large number of OSDs.

  [0] is the upstreambug. It was backported to Nautilus but rejected for
  Luminous and Mimic because they reached EOL in upstream. But I want to
  backport to these two releases Ubuntu/UCA.

  The major fix from upstream is here [3], and also I found an
  improvement commit [4] that submitted later in another PR.

  [Test Case]
  Deploy a Ceph cluster (Luminous 13.2.9 or Mimic 13.2.9) with large number of Ceph OSDs (600+). During normal operations of the cluster, as the ceph-mgr dumps the network ping stats regularly, this problem would manifest. This is relatively hard to reproduce as the ceph-mgr may not always get overloaded and thus not hang.

  A simpler version could be to deploy a Ceph cluster with as many OSDs
  as the hardware/system setup allows (not necessarily 600+) and drive
  I/O on the cluster for sometime. Then various queries could be sent to
  the manager to verify it does report and doesn't get stuck.

  [Regression Potential]
  Fix has been accepted upstream (the changes are here in "sync" with upstream to the extent these old releases match the latest source code) and have been confirmed to work. So the risk is minimal.

  At worst, this could affect modules that consume the stats from ceph-
  mgr  (such as prometheus or other monitoring scripts/tools) and thus
  becomes less useful. But still shouldn't cause any problems to the
  operations of the cluster itself.

  [Other Info]
  - In addition to the fix from [1], another commit [4] is also cherry-picked and backported here - this was also accepted upstream.

  - Since the ceph-mgr hangs when affected, this also impact sosreport
  collection - commands time out as the mgr doesn't respond and thus
  info get truncated/not collected in that case. This fix should help
  avoid that problem in sosreports.

  [0] https://tracker.ceph.com/issues/43364
  [1] https://github.com/ceph/ceph/pull/28755
  [2] https://github.com/ceph/ceph/pull/28755/files#diff-5498d83111f1210998ee186e98d5836d2bce9992be7648addc83f59e798cddd8L430
  [3] https://github.com/ceph/ceph/pull/32406
  [4] https://github.com/ceph/ceph/pull/32554/commits/1112584621016c4a8cac1bedb1a1b8b17c394f7f

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1906496/+subscriptions