[Bug 1843085] Re: Backport of zero-length gc chain fixes to Luminous

Dan Hill 1843085 at bugs.launchpad.net
Wed Jan 15 18:08:25 UTC 2020


** Tags removed: verification-needed
** Tags added: verification-done

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1843085

Title:
  Backport of zero-length gc chain fixes to Luminous

Status in Ubuntu Cloud Archive:
  Invalid
Status in Ubuntu Cloud Archive queens series:
  Fix Committed
Status in Ubuntu Cloud Archive rocky series:
  Fix Released
Status in ceph package in Ubuntu:
  Invalid
Status in ceph source package in Bionic:
  Fix Committed

Bug description:
  [Impact]
  Cancelling large S3/Swift object puts may result in garbage collection entries with zero-length chains. Rados gateway garbage collection does not efficiently process and clean up these zero-length chains.

  A large number of zero-length chains will result in rgw processes
  quickly spinning through the garbage collection lists doing very
  little work. This can result in abnormally high cpu utilization and op
  workloads.

  [Test Case]
  Modify garbage collection parameters by editing ceph.conf on the target rgw:
  ```
  rgw enable gc threads = false
  rgw gc obj min wait = 60
  rgw gc processor period = 60
  ```

  Restart the ceph-radosgw service to apply the new configuration:
  `sudo systemctl restart ceph-radosgw at rgw.$HOSTNAME`

  Repeatedly interrupt 512MB object put requests for randomized object names:
  ```
  for i in {0..1000}; do 
    f=$(mktemp); fallocate -l 512M $f
    s3cmd put $f s3://test_bucket --disable-multipart &
    pid=$!
    sleep $((RANDOM % 7 + 3)); kill $pid
    rm $f
  done
  ```

  Delete all objects in the bucket index:
  ```
  for f in $(s3cmd ls s3://test_bucket | awk '{print $4}'); do
    s3cmd del $f
  done
  ```

  By default rgw_max_gc_objs splits the garbage collection list into 32 shards.
  Capture omap detail and verify zero-length chains were left over:
  ```
  export CEPH_ARGS="--id=rgw.$HOSTNAME"
  for i in {0..31}; do 
    sudo -E rados -p default.rgw.log --namespace gc listomapvals gc.$i
  done
  ```

  Confirm the garbage collection list contains expired objects by listing expiration timestamps:
  `sudo -E radosgw-admin gc list | grep time; date`

  Raise the debug level and process the garbage collection list:
  `sudo -E radosgw-admin --debug-rgw=20 --err-to-stderr gc process`

  Use the logs to verify the garbage collection process iterates through all remaining omap entry tags. Then confirm all rados objects have been cleaned up:
  `sudo -E rados -p default.rgw.buckets.data ls`

  [Regression Potential]
  Backport has been accepted into the Luminous release stable branch upstream.

  [Other Information]
  This issue has been reported upstream [0] and was fixed in Nautilus alongside a number of other garbage collection issues/enhancements in pr#26601 [1]:
  * adds additional logging to make future debugging easier.
  * resolves bug where the truncated flag was not always set correctly in gc_iterate_entries
  * resolves bug where marker in RGWGC::process was not advanced
  * resolves bug in which gc entries with a zero-length chain were not trimmed
  * resolves bug where same gc entry tag was added to list for deletion multiple times

  These fixes were slated for back-port into Luminous and Mimic, but the
  Luminous work was not completed because of a required dependency: AIO
  GC [2]. This dependency has been resolved upstream, and is pending SRU
  verification in Ubuntu packages [3].

  [0] https://tracker.ceph.com/issues/38454
  [1] https://github.com/ceph/ceph/pull/26601
  [2] https://tracker.ceph.com/issues/23223
  [3] https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1838858

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1843085/+subscriptions



More information about the Ubuntu-openstack-bugs mailing list