[Bug 1911900] Re: [SRU] Active scrub blocks upmap balancer

Mon Apr 12 14:01:25 UTC 2021

Hi Pon, if you still need Bionic SRU for this one can you attach a
debdiff for bionic. Thanks.

** Changed in: ceph (Ubuntu Bionic)
       Status: In Progress => New

** Changed in: cloud-archive
     Assignee: Ponnuvel Palaniyappan (pponnuvel) => (unassigned)

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1911900

Title:
  [SRU] Active scrub blocks upmap balancer

Status in Ubuntu Cloud Archive:
  Invalid
Status in Ubuntu Cloud Archive ussuri series:
  Fix Released
Status in ceph package in Ubuntu:
  Fix Released
Status in ceph source package in Bionic:
  New
Status in ceph source package in Focal:
  Fix Released
Status in ceph source package in Groovy:
  Fix Released
Status in ceph source package in Hirsute:
  Fix Released

Bug description:
  [Impact]

  When scrubs are in progress, balancer stop due to the bug [0]. And
  shows:

  <timestamp> calc_pg_upmaps abort due to max <= 0

  in the logs.

  Typically when deep-scrub is done in maintenance windows and can take
  few hours. If balancing is paused for the duration, it can affect
  client I/O performance later when balacing starts happening after
  deep-scrub is done.

  This bug was introduced in Octopus. We need to backport upstream bug
  [0] to just Octopus. It's been fixed in upstream master branch [1].

  [Test Case]

  In an Octopus Ceph cluster that has some data (large enough to be able to notice balancing), take down one or more OSDs to introduce "unbalanced" objects.
  Make sure Ceph balancer module is enabled and active (which should be the default case in Octopus).
  Perform some I/O so that data goes to the rest of the OSDs.

  Then start deep-scrubbing and re-add the previously taken down so
  balancing start to happen.

  [Regression potential]

  Low potential. This is a bug fix of a previously correct code.

  If anything goes wrong, the balancer module might not function properly and thus
  leaving the cluster unbalanced and potentially requiring manual balancing.

  [Other Info]

  It's been accepted upstream and backported to Octopus. Ref [0] and
  [1].

  [0] https://tracker.ceph.com/issues/48309
  [1] https://github.com/ceph/ceph/pull/38337

  For Hirsute, James is working on a snapshot of Pacific and that should include the fix for this.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1911900/+subscriptions