[Bug 1996010] Re: [SRU] the leak in bluestore_cache_other mempool

Lucas Kanashiro 1996010 at bugs.launchpad.net
Thu Jun 15 18:47:10 UTC 2023


As Robie mentioned in comment #8, it is not clear to me if this SRU to
Focal will be handled by the OpenStack team or if you want help to get
this landed. Could you please clarify that? In case the OpenStack team
is going to handle this, please unsubscribe ~ubuntu-sponsors.

I just took a quick look and your debdiff in comment #1 is outdated, you
need to rebase your changes against the latest version in focal-updates
which is 15.2.17-0ubuntu0.20.04.4.

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1996010

Title:
  [SRU] the leak in bluestore_cache_other mempool

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in Ubuntu Cloud Archive wallaby series:
  Fix Released
Status in Ubuntu Cloud Archive xena series:
  Fix Released
Status in Ubuntu Cloud Archive yoga series:
  Fix Released
Status in ceph package in Ubuntu:
  Confirmed
Status in ceph source package in Focal:
  Confirmed
Status in ceph source package in Jammy:
  Fix Released
Status in ceph source package in Kinetic:
  Fix Released
Status in ceph source package in Lunar:
  Fix Released

Bug description:
  [Impact]

  This issue has been observed from ceph octopus 15.2.16. 
  Bluestore's onode cache might be completely disabled because of the entry leak happened in bluestore_cache_other mempool. 

  Below log shows the cache's maximum size had become 0:
  ------
  2022-10-25T00:47:26.562+0000 7f424f78e700 30 bluestore.MempoolThread(0x564a9dae2a68) _resize_shards max_shard_onodes: 0 max_shard_buffer: 8388608
  -------

  The dump_mempools bluestore_cache_other had consumed most majority of the cache due to the leak while only 3 onodes (2 of them are pinned) are in the cache:
  ---------------
  "bluestore_cache_onode": {
  "items": 3,
  "bytes": 1848
  },
  "bluestore_cache_meta": {
  "items": 13973,
  "bytes": 111338
  },
  "bluestore_cache_other": {
  "items": 5601156,
  "bytes": 224152996
  },
  "bluestore_Buffer": {
  "items": 1,
  "bytes": 96
  },
  "bluestore_Extent": {
  "items": 20,
  "bytes": 960
  },
  "bluestore_Blob": {
  "items": 8,
  "bytes": 832
  },
  "bluestore_SharedBlob": {
  "items": 8,
  "bytes": 896
  },
  --------------

  
  This could cause the io experiencing high latency as the 0 sized cache will significantly increasing the need to fetch the meta data from rocksdb or even from disk.
  Another impact is that this can significantly increase the possibility of hitting the race condition in Onode::put [2], which will crash the osds, especially in large cluster.

  [Test Case]

  1. Deploy a 15.2.16 ceph cluster

  2. Create enough rbd images to spread all over the OSDs

  3. Stressingthem with fio 4k randwrite workload in parallel until the
  OSDs got enough onodes in its cache (more than 60k onodes and you'll
  see the bluestore_cache_other is over 1 GB):

     fio --name=randwrite --rw=randwrite --ioengine=rbd --bs=4k
  --direct=1 --numjobs=1 --size=100G --iodepth=16 --clientname=admin
  --pool=bench --rbdname=test

  4. Shrink the pg_num to a very low number so that pgs per osd is around 1.
  Once the shrink finished

  5. Enable debug_bluestore=20/20, we can observe a 0 sized onode cache
  by grep max_shard_onodes. Also can observe the leaked
  bluestore_cache_other mempool via "ceph daemon osd.id dump_mempools"

  [Potential Regression]
  The patch correct the apparent wrong AU calculation of the bluestore_cache_other pool, it wouldn't increase any regression. 

  [Other Info]
  The patch[1] had been backported to upstream Pacific and Quincy, but not Octopus.
  Pacific is going to have it on 16.2.11 which is still pending.
  Quincy already had it in 17.2.4

  We'll need to backport this fix to Octopus.

  [1]https://github.com/ceph/ceph/pull/46911

  [2]https://tracker.ceph.com/issues/56382

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1996010/+subscriptions




More information about the Ubuntu-openstack-bugs mailing list