[Bug 1874939] Re: ceph-osd can't connect after upgrade to focal
James Page
1874939 at bugs.launchpad.net
Fri May 22 04:59:24 UTC 2020
As a side note - even if there is a bug here (and it sounds like there
might be) I would recommend placing the mon and mgr daemons in LXD
containers ontop of the machines hosting the osd's - this will allow you
to manage them independently from an upgrade process for both ceph
upgrades and ubuntu release upgrades.
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to ceph in Ubuntu.
https://bugs.launchpad.net/bugs/1874939
Title:
ceph-osd can't connect after upgrade to focal
Status in ceph package in Ubuntu:
In Progress
Bug description:
Upon upgrading a Ceph node with do-release-upgrade from eoan to focal,
the OSD doesn't connect after the upgrade. I rolled back the change
(VBox snapshot) and tried again, same result. I also tried to hold
back the Ceph packages and upgrade after the fact, but again same
result.
Epected behavior: OSD connects to cluster after upgrade.
Actual behavior: OSD log shows endlessly repeated
'tick_without_osd_lock' messages. OSD will stay down from perspective
of the cluster.
Extract from debug log of OSD:
2020-04-24T16:25:35.811-0700 7fd70e83d700 5 osd.0 16499 heartbeat osd_stat(store_statfs(0x44990000/0x40000000/0x240000000, data 0x14bb97877/0x1bb660000, compress 0x0/0x0/0x0, omap 0x2bbf, meta 0x3fffd441), peers [] op hist [])
2020-04-24T16:25:35.811-0700 7fd70e83d700 20 osd.0 16499 check_full_status cur ratio 0.769796, physical ratio 0.769796, new state none
2020-04-24T16:25:36.043-0700 7fd7272ea700 10 osd.0 16499 tick
2020-04-24T16:25:36.043-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- start
2020-04-24T16:25:36.043-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- finish
2020-04-24T16:25:36.043-0700 7fd7272ea700 20 osd.0 16499 tick last_purged_snaps_scrub 2020-04-24T15:54:43.601161-0700 next 2020-04-25T15:54:43.601161-0700
2020-04-24T16:25:36.631-0700 7fd72606c700 10 osd.0 16499 tick_without_osd_lock
2020-04-24T16:25:37.055-0700 7fd7272ea700 10 osd.0 16499 tick
2020-04-24T16:25:37.055-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- start
2020-04-24T16:25:37.055-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- finish
2020-04-24T16:25:37.055-0700 7fd7272ea700 20 osd.0 16499 tick last_purged_snaps_scrub 2020-04-24T15:54:43.601161-0700 next 2020-04-25T15:54:43.601161-0700
2020-04-24T16:25:37.595-0700 7fd72606c700 10 osd.0 16499 tick_without_osd_lock
2020-04-24T16:25:38.071-0700 7fd7272ea700 10 osd.0 16499 tick
2020-04-24T16:25:38.071-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- start
2020-04-24T16:25:38.071-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- finish
2020-04-24T16:25:38.071-0700 7fd7272ea700 20 osd.0 16499 tick last_purged_snaps_scrub 2020-04-24T15:54:43.601161-0700 next 2020-04-25T15:54:43.601161-0700
2020-04-24T16:25:38.243-0700 7fd71cc0d700 20 osd.0 16499 reports for 0 queries
2020-04-24T16:25:38.583-0700 7fd72606c700 10 osd.0 16499 tick_without_osd_lock
2020-04-24T16:25:39.103-0700 7fd7272ea700 10 osd.0 16499 tick
2020-04-24T16:25:39.103-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- start
2020-04-24T16:25:39.103-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- finish
2020-04-24T16:25:39.103-0700 7fd7272ea700 20 osd.0 16499 tick last_purged_snaps_scrub 2020-04-24T15:54:43.601161-0700 next 2020-04-25T15:54:43.601161-0700
This repeats over and over again.
strace of the process yields lots of unfinished futex access attempts:
[pid 2130] futex(0x55b17b8e216c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=937726129}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 2100] write(12, "2020-04-24T16:47:33.915-0700 7fd"..., 79) = 79
[pid 2100] futex(0x55b17b7108e4, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2190] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772053, tv_nsec=969572004}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=20189832}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=70811223}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 2163] <... io_getevents resumed> [], {tv_sec=0, tv_nsec=250000000}) = 0
[pid 2134] <... io_getevents resumed> [], {tv_sec=0, tv_nsec=250000000}) = 0
[pid 2163] io_getevents(0x7fd7272eb000, 1, 16, <unfinished ...>
[pid 2134] io_getevents(0x7fd7272fc000, 1, 16, <unfinished ...>
[pid 2190] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=121288477}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 2200] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 2200] futex(0x55b17b7108e4, FUTEX_WAKE_PRIVATE, 2147483647) = 1
[pid 2200] futex(0x7ffc4aa8b708, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2200] futex(0x7ffc4aa8b770, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772055, tv_nsec=102644954}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 2100] <... futex resumed> ) = 0
[pid 2100] futex(0x55b17b710838, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2100] futex(0x55b17b7108e0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2190] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=171906673}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 2127] <... clock_nanosleep resumed> NULL) = 0
[pid 2127] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=256000000}, <unfinished ...>
[pid 2190] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=222271211}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=273226419}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=323615391}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 2163] <... io_getevents resumed> [], {tv_sec=0, tv_nsec=250000000}) = 0
[pid 2163] io_getevents(0x7fd7272eb000, 1, 16, <unfinished ...>
[pid 2134] <... io_getevents resumed> [], {tv_sec=0, tv_nsec=250000000}) = 0
[pid 2134] io_getevents(0x7fd7272fc000, 1, 16, <unfinished ...>
[pid 2190] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=373946132}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=424283527}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 2127] <... clock_nanosleep resumed> NULL) = 0
[pid 2127] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=256000000}, <unfinished ...>
[pid 2190] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=474599677}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=525368586}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=575839547}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 2163] <... io_getevents resumed> [], {tv_sec=0, tv_nsec=250000000}) = 0
[pid 2134] <... io_getevents resumed> [], {tv_sec=0, tv_nsec=250000000}) = 0
[pid 2163] io_getevents(0x7fd7272eb000, 1, 16, <unfinished ...>
[pid 2134] io_getevents(0x7fd7272fc000, 1, 16, ^Cstrace: Process 2093 detached
Suspected cause: OSD can not connect to monitor.
Repeatability: On 5 attempts (3 separate nodes and 2 repetitions) the
result was the same.
Research done: I checked Launchpad and the Ceph bug tracker, couldn't
find something similar. Tried restart of process, reboot of node,
revert change and re-upgrade, hold Ceph packages and manually upgrade
after do-release-upgrade, strace of the process.
Impact: Right now upgrade of Ceph to 20.04 LTS appears to be broken
ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: ceph-osd 15.2.1-0ubuntu1
ProcVersionSignature: Ubuntu 5.4.0-26.30-generic 5.4.30
Uname: Linux 5.4.0-26-generic x86_64
ApportVersion: 2.20.11-0ubuntu27
Architecture: amd64
CasperMD5CheckResult: skip
Date: Fri Apr 24 16:30:53 2020
ProcEnviron:
SHELL=/bin/bash
LANG=en_US.UTF-8
TERM=xterm-256color
PATH=(custom, no user)
SourcePackage: ceph
UpgradeStatus: Upgraded to focal on 2020-04-24 (0 days ago)
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list