[Bug 1828617] Re: Hosts randomly 'losing' disks, breaking ceph-osd service enumeration

Corey Bryant corey.bryant at canonical.com
Fri May 31 14:38:07 UTC 2019


Thanks for testing. That should rule out udev as the cause of the race.

A couple of observations from the log:

* There is a loop for each osd that calls 'ceph-volume lvm trigger' 30 times until the OSD is activated, for example for 4:
[2019-05-31 01:27:29,235][ceph_volume.process][INFO  ] Running command: ceph-volume lvm trigger 4-7478edfc-f321-40a2-a105-8e8a2c8ca3f6
[2019-05-31 01:27:35,435][ceph_volume.process][INFO  ] stderr -->  RuntimeError: could not find osd.4 with fsid 7478edfc-f321-40a2-a105-8e8a2c8ca3f6                        
[2019-05-31 01:27:35,530][systemd][WARNING] command returned non-zero exit status: 1                                                                                        
[2019-05-31 01:27:35,531][systemd][WARNING] failed activating OSD, retries left: 30                                                                           
[2019-05-31 01:27:44,122][ceph_volume.process][INFO  ] stderr -->  RuntimeError: could not find osd.4 with fsid 7478edfc-f321-40a2-a105-8e8a2c8ca3f6                        
[2019-05-31 01:27:44,174][systemd][WARNING] command returned non-zero exit status: 1                                                                 
[2019-05-31 01:27:44,175][systemd][WARNING] failed activating OSD, retries left: 29
...

I wonder if we can have similar 'ceph-volume lvm trigger' calls for WAL
and DB devices per OSD. Does that even make sense? Or perhaps another
call with a similar goal. We should be able to determine if an OSD has a
DB or WAL device from the lvm tags.

* The first 3 osd's that are activated are 18, 4, and 11 and they are the 3 that are missing block.db/block.wal symlinks. That's just more confirmation this is a race:
[2019-05-31 01:28:03,370][systemd][INFO  ] successfully trggered activation for: 18-eb5270dc-1110-420f-947e-aab7fae299c9                     
[2019-05-31 01:28:12,354][systemd][INFO  ] successfully trggered activation for: 4-7478edfc-f321-40a2-a105-8e8a2c8ca3f6                                                     
[2019-05-31 01:28:12,530][systemd][INFO  ] successfully trggered activation for: 11-33de740d-bd8c-4b47-a601-3e6e634e489a

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to ceph in Ubuntu.
https://bugs.launchpad.net/bugs/1828617

Title:
  Hosts randomly 'losing' disks, breaking ceph-osd service enumeration

Status in ceph package in Ubuntu:
  New

Bug description:
  Ubuntu 18.04.2 Ceph deployment.

  Ceph OSD devices utilizing LVM volumes pointing to udev-based physical devices.
  LVM module is supposed to create PVs from devices using the links in /dev/disk/by-dname/ folder that are created by udev.
  However on reboot it happens (not always, rather like race condition) that Ceph services cannot start, and pvdisplay doesn't show any volumes created. The folder /dev/disk/by-dname/ however has all necessary device created by the end of boot process.

  The behaviour can be fixed manually by running "#/sbin/lvm pvscan
  --cache --activate ay /dev/nvme0n1" command for re-activating the LVM
  components and then the services can be started.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1828617/+subscriptions



More information about the Ubuntu-openstack-bugs mailing list