[Bug 1959649] [NEW] BlueFS spillover detected for particular OSDs

Tue Feb 1 05:43:34 UTC 2022

Public bug reported:

This is an issue described in https://tracker.ceph.com/issues/38745,
where ceph health details shows messages like,

sudo ceph health detail
HEALTH_WARN 3 OSD(s) experiencing BlueFS spillover; mon juju-6879b7-6-lxd-1 is low on available space
[WRN] BLUEFS_SPILLOVER: 3 OSD(s) experiencing BlueFS spillover <---
osd.41 spilled over 66 MiB metadata from 'db' device (3.0 GiB used of 29 GiB) to slow device
osd.96 spilled over 461 MiB metadata from 'db' device (3.0 GiB used of 29 GiB) to slow device
osd.105 spilled over 198 MiB metadata from 'db' device (3.0 GiB used of 29 GiB) to slow device

The bluefs spillover is very likely caused because of the rocksdb's
level-sized issue.

https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-
ref/#sizing has a statement about this leveled sizing.

Between versions 15.2.6 and 15.2.10 , if the value of
bluestore_volume_selection_policy is not set to use_some_extra, this
issue can be faced inspite of free space available, due to the fact that
RocksDB only uses "leveled" space on the NVME partition. The values are
set to be 300MB, 3GB, 30GB and 300GB. Every DB space above such a limit
will automatically end up on slow devices.

There is also a discussion at www.mail-archive.com/ceph-
users at ceph.io/msg05782.html

Running compaction on the database, i.e ceph tell osd.XX compact
(replace XX with the OSD number) can work around the issue, but the best
fix is to either,

I am also pasting some notes Dongdong mentions on SF case 00326782,
where the fix is to either,

A. Redeploy the OSDs with a larger DB lvm/partition.

OR

B. Migrate to a new larger DB lvm/partition, this can be done offline
with ceph-volume lvm migrate, please refer to
https://docs.ceph.com/en/octopus/ceph-volume/lvm/migrate/ but it
requires to upgrade the cluster to 15.2.14 first.

A will be much safer, but more time-consuming. B will be much faster,
but its recommended to do it on one node first and wait/monitoring for a
couple of weeks before moving forward.

As mentioned above, to avoid running into the issue even with free space
available, the value of bluestore_volume_selection_policy should be set
to use_some_extra for all OSDs. 15.2.6 has
bluestore_volume_selection_policy but the default was only set to
use_some_extra 15.2.11 onwards. (https://tracker.ceph.com/issues/47053)

** Affects: ceph (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to ceph in Ubuntu.
https://bugs.launchpad.net/bugs/1959649

Title:
  BlueFS spillover detected for particular OSDs

Status in ceph package in Ubuntu:
  New

Bug description:
  This is an issue described in https://tracker.ceph.com/issues/38745,
  where ceph health details shows messages like,

  sudo ceph health detail
  HEALTH_WARN 3 OSD(s) experiencing BlueFS spillover; mon juju-6879b7-6-lxd-1 is low on available space
  [WRN] BLUEFS_SPILLOVER: 3 OSD(s) experiencing BlueFS spillover <---
  osd.41 spilled over 66 MiB metadata from 'db' device (3.0 GiB used of 29 GiB) to slow device
  osd.96 spilled over 461 MiB metadata from 'db' device (3.0 GiB used of 29 GiB) to slow device
  osd.105 spilled over 198 MiB metadata from 'db' device (3.0 GiB used of 29 GiB) to slow device

  The bluefs spillover is very likely caused because of the rocksdb's
  level-sized issue.

  https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-
  ref/#sizing has a statement about this leveled sizing.

  Between versions 15.2.6 and 15.2.10 , if the value of
  bluestore_volume_selection_policy is not set to use_some_extra, this
  issue can be faced inspite of free space available, due to the fact
  that RocksDB only uses "leveled" space on the NVME partition. The
  values are set to be 300MB, 3GB, 30GB and 300GB. Every DB space above
  such a limit will automatically end up on slow devices.

  There is also a discussion at www.mail-archive.com/ceph-
  users at ceph.io/msg05782.html

  Running compaction on the database, i.e ceph tell osd.XX compact
  (replace XX with the OSD number) can work around the issue, but the
  best fix is to either,

  I am also pasting some notes Dongdong mentions on SF case 00326782,
  where the fix is to either,

  A. Redeploy the OSDs with a larger DB lvm/partition.

  OR

  B. Migrate to a new larger DB lvm/partition, this can be done offline
  with ceph-volume lvm migrate, please refer to
  https://docs.ceph.com/en/octopus/ceph-volume/lvm/migrate/ but it
  requires to upgrade the cluster to 15.2.14 first.

  A will be much safer, but more time-consuming. B will be much faster,
  but its recommended to do it on one node first and wait/monitoring for
  a couple of weeks before moving forward.

  As mentioned above, to avoid running into the issue even with free
  space available, the value of bluestore_volume_selection_policy should
  be set to use_some_extra for all OSDs. 15.2.6 has
  bluestore_volume_selection_policy but the default was only set to
  use_some_extra 15.2.11 onwards.
  (https://tracker.ceph.com/issues/47053)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1959649/+subscriptions