[Bug 1874939] Re: ceph-osd can't connect after upgrade to focal
Christian Huebner
1874939 at bugs.launchpad.net
Fri May 1 00:49:45 UTC 2020
I tried to do the upgrade by hand (disable all the services that can not
be autostarted, do the upgrade (btw, a manpage has been moved from ceph-
deploy to ceph-base and thus the apt upgrade fails. do-release-upgrade
is using --force-overwrite for this, but that's not a clean solution).
Solution is to first uninstall ceph-deploy and then do the upgrade, but
this should be fixed.
I restarted all services manually in the correct order. mon and mgr work
fine, the OSDs do not.
The result is mostly the same. This time at least all OSDs came up, but
like before they hang in peering. I'll continue research on this. The
OSDs still log that they are waiting for a new monmap.
Although started from the 15.2.1 binary they show up in Ceph report as
14.2.8, probably because they have not been converted yet (which should
automatically happen when the OSDs connect to the monitors for the first
time). Next step is tracing the OSDs to see where they hang, but
probably still some futex deadlock.
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to ceph in Ubuntu.
https://bugs.launchpad.net/bugs/1874939
Title:
ceph-osd can't connect after upgrade to focal
Status in ceph package in Ubuntu:
New
Bug description:
Upon upgrading a Ceph node with do-release-upgrade from eoan to focal,
the OSD doesn't connect after the upgrade. I rolled back the change
(VBox snapshot) and tried again, same result. I also tried to hold
back the Ceph packages and upgrade after the fact, but again same
result.
Epected behavior: OSD connects to cluster after upgrade.
Actual behavior: OSD log shows endlessly repeated
'tick_without_osd_lock' messages. OSD will stay down from perspective
of the cluster.
Extract from debug log of OSD:
2020-04-24T16:25:35.811-0700 7fd70e83d700 5 osd.0 16499 heartbeat osd_stat(store_statfs(0x44990000/0x40000000/0x240000000, data 0x14bb97877/0x1bb660000, compress 0x0/0x0/0x0, omap 0x2bbf, meta 0x3fffd441), peers [] op hist [])
2020-04-24T16:25:35.811-0700 7fd70e83d700 20 osd.0 16499 check_full_status cur ratio 0.769796, physical ratio 0.769796, new state none
2020-04-24T16:25:36.043-0700 7fd7272ea700 10 osd.0 16499 tick
2020-04-24T16:25:36.043-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- start
2020-04-24T16:25:36.043-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- finish
2020-04-24T16:25:36.043-0700 7fd7272ea700 20 osd.0 16499 tick last_purged_snaps_scrub 2020-04-24T15:54:43.601161-0700 next 2020-04-25T15:54:43.601161-0700
2020-04-24T16:25:36.631-0700 7fd72606c700 10 osd.0 16499 tick_without_osd_lock
2020-04-24T16:25:37.055-0700 7fd7272ea700 10 osd.0 16499 tick
2020-04-24T16:25:37.055-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- start
2020-04-24T16:25:37.055-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- finish
2020-04-24T16:25:37.055-0700 7fd7272ea700 20 osd.0 16499 tick last_purged_snaps_scrub 2020-04-24T15:54:43.601161-0700 next 2020-04-25T15:54:43.601161-0700
2020-04-24T16:25:37.595-0700 7fd72606c700 10 osd.0 16499 tick_without_osd_lock
2020-04-24T16:25:38.071-0700 7fd7272ea700 10 osd.0 16499 tick
2020-04-24T16:25:38.071-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- start
2020-04-24T16:25:38.071-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- finish
2020-04-24T16:25:38.071-0700 7fd7272ea700 20 osd.0 16499 tick last_purged_snaps_scrub 2020-04-24T15:54:43.601161-0700 next 2020-04-25T15:54:43.601161-0700
2020-04-24T16:25:38.243-0700 7fd71cc0d700 20 osd.0 16499 reports for 0 queries
2020-04-24T16:25:38.583-0700 7fd72606c700 10 osd.0 16499 tick_without_osd_lock
2020-04-24T16:25:39.103-0700 7fd7272ea700 10 osd.0 16499 tick
2020-04-24T16:25:39.103-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- start
2020-04-24T16:25:39.103-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- finish
2020-04-24T16:25:39.103-0700 7fd7272ea700 20 osd.0 16499 tick last_purged_snaps_scrub 2020-04-24T15:54:43.601161-0700 next 2020-04-25T15:54:43.601161-0700
This repeats over and over again.
strace of the process yields lots of unfinished futex access attempts:
[pid 2130] futex(0x55b17b8e216c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=937726129}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 2100] write(12, "2020-04-24T16:47:33.915-0700 7fd"..., 79) = 79
[pid 2100] futex(0x55b17b7108e4, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2190] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772053, tv_nsec=969572004}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=20189832}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=70811223}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 2163] <... io_getevents resumed> [], {tv_sec=0, tv_nsec=250000000}) = 0
[pid 2134] <... io_getevents resumed> [], {tv_sec=0, tv_nsec=250000000}) = 0
[pid 2163] io_getevents(0x7fd7272eb000, 1, 16, <unfinished ...>
[pid 2134] io_getevents(0x7fd7272fc000, 1, 16, <unfinished ...>
[pid 2190] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=121288477}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 2200] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 2200] futex(0x55b17b7108e4, FUTEX_WAKE_PRIVATE, 2147483647) = 1
[pid 2200] futex(0x7ffc4aa8b708, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2200] futex(0x7ffc4aa8b770, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772055, tv_nsec=102644954}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 2100] <... futex resumed> ) = 0
[pid 2100] futex(0x55b17b710838, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2100] futex(0x55b17b7108e0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2190] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=171906673}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 2127] <... clock_nanosleep resumed> NULL) = 0
[pid 2127] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=256000000}, <unfinished ...>
[pid 2190] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=222271211}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=273226419}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=323615391}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 2163] <... io_getevents resumed> [], {tv_sec=0, tv_nsec=250000000}) = 0
[pid 2163] io_getevents(0x7fd7272eb000, 1, 16, <unfinished ...>
[pid 2134] <... io_getevents resumed> [], {tv_sec=0, tv_nsec=250000000}) = 0
[pid 2134] io_getevents(0x7fd7272fc000, 1, 16, <unfinished ...>
[pid 2190] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=373946132}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=424283527}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 2127] <... clock_nanosleep resumed> NULL) = 0
[pid 2127] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=256000000}, <unfinished ...>
[pid 2190] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=474599677}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=525368586}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=575839547}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 2163] <... io_getevents resumed> [], {tv_sec=0, tv_nsec=250000000}) = 0
[pid 2134] <... io_getevents resumed> [], {tv_sec=0, tv_nsec=250000000}) = 0
[pid 2163] io_getevents(0x7fd7272eb000, 1, 16, <unfinished ...>
[pid 2134] io_getevents(0x7fd7272fc000, 1, 16, ^Cstrace: Process 2093 detached
Suspected cause: OSD can not connect to monitor.
Repeatability: On 5 attempts (3 separate nodes and 2 repetitions) the
result was the same.
Research done: I checked Launchpad and the Ceph bug tracker, couldn't
find something similar. Tried restart of process, reboot of node,
revert change and re-upgrade, hold Ceph packages and manually upgrade
after do-release-upgrade, strace of the process.
Impact: Right now upgrade of Ceph to 20.04 LTS appears to be broken
ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: ceph-osd 15.2.1-0ubuntu1
ProcVersionSignature: Ubuntu 5.4.0-26.30-generic 5.4.30
Uname: Linux 5.4.0-26-generic x86_64
ApportVersion: 2.20.11-0ubuntu27
Architecture: amd64
CasperMD5CheckResult: skip
Date: Fri Apr 24 16:30:53 2020
ProcEnviron:
SHELL=/bin/bash
LANG=en_US.UTF-8
TERM=xterm-256color
PATH=(custom, no user)
SourcePackage: ceph
UpgradeStatus: Upgraded to focal on 2020-04-24 (0 days ago)
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list