[Bug 1828617] Re: Hosts randomly 'losing' disks, breaking ceph-osd service enumeration
James Page
james.page at ubuntu.com
Fri Sep 20 07:36:43 UTC 2019
$ apt-cache policy ceph-osd
ceph-osd:
Installed: 13.2.6-0ubuntu0.19.04.4
Candidate: 13.2.6-0ubuntu0.19.04.4
Version table:
*** 13.2.6-0ubuntu0.19.04.4 500
500 http://archive.ubuntu.com/ubuntu disco-proposed/main amd64 Packages
100 /var/lib/dpkg/status
13.2.6-0ubuntu0.19.04.3 500
500 http://nova.clouds.archive.ubuntu.com/ubuntu disco-updates/main amd64 Packages
500 http://security.ubuntu.com/ubuntu disco-security/main amd64 Packages
13.2.4+dfsg1-0ubuntu2 500
500 http://nova.clouds.archive.ubuntu.com/ubuntu disco/main amd64 Packages
disco-proposed tested with a deployment using separate db and wal devices; OSD's restarted reliably over 10 x reboot iterations across three machines.
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop0 7:0 0 88.7M 1 loop /snap/core/7396
loop1 7:1 0 54.5M 1 loop
loop2 7:2 0 89M 1 loop /snap/core/7713
loop3 7:3 0 54.6M 1 loop /snap/lxd/11964
loop4 7:4 0 54.6M 1 loop /snap/lxd/11985
vda 252:0 0 20G 0 disk
├─vda1 252:1 0 19.9G 0 part /
├─vda14 252:14 0 4M 0 part
└─vda15 252:15 0 106M 0 part /boot/efi
vdb 252:16 0 40G 0 disk /mnt
vdc 252:32 0 10G 0 disk
└─ceph--683a8389--9788--4fd5--b59e--bdd69936a768-osd--block--683a8389--9788--4fd5--b59e--bdd69936a768
253:0 0 10G 0 lvm
vdd 252:48 0 10G 0 disk
└─ceph--1fd8022f--e851--4cfa--82aa--64693510c705-osd--block--1fd8022f--e851--4cfa--82aa--64693510c705
253:6 0 10G 0 lvm
vde 252:64 0 10G 0 disk
└─ceph--302bafc8--9981--47a3--b66b--3d84ab550ba5-osd--block--302bafc8--9981--47a3--b66b--3d84ab550ba5
253:3 0 10G 0 lvm
vdf 252:80 0 5G 0 disk
├─ceph--db--28e3b53f--1468--4136--914d--6630343a2a67-osd--db--683a8389--9788--4fd5--b59e--bdd69936a768
│ 253:2 0 1G 0 lvm
├─ceph--db--28e3b53f--1468--4136--914d--6630343a2a67-osd--db--302bafc8--9981--47a3--b66b--3d84ab550ba5
│ 253:5 0 1G 0 lvm
└─ceph--db--28e3b53f--1468--4136--914d--6630343a2a67-osd--db--1fd8022f--e851--4cfa--82aa--64693510c705
253:8 0 1G 0 lvm
vdg 252:96 0 5G 0 disk
├─ceph--wal--40c6b471--3ba2--41d5--9215--eabf391499de-osd--wal--683a8389--9788--4fd5--b59e--bdd69936a768
│ 253:1 0 96M 0 lvm
├─ceph--wal--40c6b471--3ba2--41d5--9215--eabf391499de-osd--wal--302bafc8--9981--47a3--b66b--3d84ab550ba5
│ 253:4 0 96M 0 lvm
└─ceph--wal--40c6b471--3ba2--41d5--9215--eabf391499de-osd--wal--1fd8022f--e851--4cfa--82aa--64693510c705
253:7 0 96M 0 lvm
** Tags removed: verification-needed verification-needed-disco
** Tags added: verification-done verification-done-disco
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1828617
Title:
Hosts randomly 'losing' disks, breaking ceph-osd service enumeration
Status in Ubuntu Cloud Archive:
Fix Committed
Status in Ubuntu Cloud Archive queens series:
In Progress
Status in Ubuntu Cloud Archive rocky series:
Fix Committed
Status in Ubuntu Cloud Archive stein series:
Fix Committed
Status in Ubuntu Cloud Archive train series:
In Progress
Status in ceph package in Ubuntu:
Fix Released
Status in ceph source package in Bionic:
Fix Committed
Status in ceph source package in Disco:
Fix Committed
Status in ceph source package in Eoan:
Fix Released
Bug description:
[Impact]
For deployments where the bluestore DB and WAL devices are on separate underlying OSD's, its possible on reboot that the LV's configured on these devices have not yet been scanned and detected; the OSD boot process ignores this fact and tries to boot the OSD anyway as soon as the primary LV supporting the OSD is detected, resulting in the OSD crashing as required block device symlinks are not present.
[Test Case]
Deploy ceph with bluestore + separate DB and WAL devices.
Reboot servers
OSD devices will fail to start after reboot (its a race so not always).
[Regression Potential]
Low - the fix has been landed upstream and simple ensures that if a separate LV is expected for the DB and WAL devices for an OSD, the OSD will not try to boot until they are present.
[Original Bug Report]
Ubuntu 18.04.2 Ceph deployment.
Ceph OSD devices utilizing LVM volumes pointing to udev-based physical devices.
LVM module is supposed to create PVs from devices using the links in /dev/disk/by-dname/ folder that are created by udev.
However on reboot it happens (not always, rather like race condition) that Ceph services cannot start, and pvdisplay doesn't show any volumes created. The folder /dev/disk/by-dname/ however has all necessary device created by the end of boot process.
The behaviour can be fixed manually by running "#/sbin/lvm pvscan
--cache --activate ay /dev/nvme0n1" command for re-activating the LVM
components and then the services can be started.
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1828617/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list