[Bug 1804261] Re: Ceph OSD units requires reboot if they boot before vault (and if not unsealed with 150s)

Drew Freiberger 1804261 at bugs.launchpad.net
Tue Sep 29 17:04:15 UTC 2020


The bionic-ussuri package has the retries set for 10000.  My start time
to vault unseal time was about 18 hours.  We should have this set to
heal for up to 5 days after machine start.

I'm almost wondering if vaultlocker-decrypt also needs the retries
increased as well.

Here's a workaround I've found for anyone experiencing this
operationally:

After unsealing the vault, loop through ceph-osd units with the
following two loops to decrypt and start the LVM volumes for ceph-osd
services to startup:

for i in $(ls /etc/systemd/system/multi-user.target.wants/vaultlocker-decrypt@*|cut -d/ -f6); do sudo systemctl start $i; done
for i in $(ls /etc/systemd/system/multi-user.target.wants/ceph-volume@*|cut -d/ -f6); do sudo systemctl start $i; done

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1804261

Title:
  Ceph OSD units requires reboot if they boot before vault (and if not
  unsealed with 150s)

Status in OpenStack ceph-osd charm:
  Invalid
Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive queens series:
  Fix Released
Status in Ubuntu Cloud Archive rocky series:
  Fix Released
Status in Ubuntu Cloud Archive stein series:
  Fix Released
Status in Ubuntu Cloud Archive train series:
  Fix Released
Status in Ubuntu Cloud Archive ussuri series:
  Fix Released
Status in ceph package in Ubuntu:
  Fix Released
Status in ceph source package in Bionic:
  Fix Released
Status in ceph source package in Disco:
  Won't Fix
Status in ceph source package in Eoan:
  Fix Released
Status in ceph source package in Focal:
  Fix Released

Bug description:
  [Impact]
  Various configuration option values that are read from environment variables are incorrectly parsed as strings rather than ints which means that for certain deployment use-cases, the timeouts for starting the ceph-osd volume units cannot be increased to accommodate dependencies starting first.

  [Test Case]
  Deploy ceph with vault for key management
  set a systemd override for ceph-volume@
  Environment=CEPH_VOLUME_SYSTEMD_TRIES=2000
  Seal vault units (by restarting the vault service)
  Reboot ceph-osd machines - Environment override is ignored as its not correctly parsed.

  [Regression Potential]
  Low - this fix has been accept upstream in later releases.

  
  [Original Bug Report]
  In a scenario where Ceph is encrypted and using Vault as the keymanager, in a scenario where vault and ceph are both stopped, any OSDs on the unit(s) affected will require a further reboot if they try to start before vault is unsealed.

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-ceph-osd/+bug/1804261/+subscriptions



More information about the Ubuntu-openstack-bugs mailing list