[Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

Matthias Hüther 1874939 at bugs.launchpad.net
Mon Jan 11 09:58:35 UTC 2021


I have the same issue.
That's why I've been testing a few things over the last few days:

Upgrade process:
Luminous -> Mimic -> Nautilus -> Octopus
(All Versions run under Bionic)


It doesn't matter whether I activate msgr2 or not. I always get the problem after upgrading to Octopus:
2021-01-11T09: 46: 33.674 + 0000 7fb8cf2d1700 1 osd.0 194 tick checking mon for new map
2021-01-11T09: 47: 04.490 + 0000 7fb8cf2d1700 1 osd.0 194 tick checking mon for new map
2021-01-11T09: 47: 34.514 + 0000 7fb8cf2d1700 1 osd.0 194 tick checking mon for new map
2021-01-11T09: 48: 05.451 + 0000 7fb8cf2d1700 1 osd.0 194 tick checking mon for new map

With a fresh installed version from Ceph Mimic and update to Nautilus -> Octopus I don't get this problem.
The problem apparently only comes from the update process from Luminous to Mimic, which then affects Octopus at the latest.

Workaround: Execute command on one of the Ceph monitors:
ceph osd require-osd-release mimic
After that, the octupus osd's can connect again.

Perhaps it is a good idea to run the "ceph osd require-osd-release
[version]" command after every update.

e.g .:
After update Luminous -> Mimic
-> command: ceph osd require-osd-release mimic

After update Mimic -> Nautilus
-> command: ceph osd require-osd-release nautilus

After update Nautilus -> Octopus
-> command: ceph osd require-osd-release octopus

Apparently this is not done by the charms yet. Maybe the charms should
do that or it should be mentioned in the charm documentation. What do
you think about that?

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to ceph in Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

Status in ceph package in Ubuntu:
  Incomplete

Bug description:
  Upon upgrading a Ceph node with do-release-upgrade from eoan to focal,
  the OSD doesn't connect after the upgrade. I rolled back the change
  (VBox snapshot) and tried again, same result. I also tried to hold
  back the Ceph packages and upgrade after the fact, but again same
  result.

  Epected behavior: OSD connects to cluster after upgrade.

  Actual behavior: OSD log shows endlessly repeated
  'tick_without_osd_lock' messages. OSD will stay down from perspective
  of the cluster.

  Extract from debug log of OSD:

  2020-04-24T16:25:35.811-0700 7fd70e83d700  5 osd.0 16499 heartbeat osd_stat(store_statfs(0x44990000/0x40000000/0x240000000, data 0x14bb97877/0x1bb660000, compress 0x0/0x0/0x0, omap 0x2bbf, meta 0x3fffd441), peers [] op hist [])
  2020-04-24T16:25:35.811-0700 7fd70e83d700 20 osd.0 16499 check_full_status cur ratio 0.769796, physical ratio 0.769796, new state none
  2020-04-24T16:25:36.043-0700 7fd7272ea700 10 osd.0 16499 tick
  2020-04-24T16:25:36.043-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- start
  2020-04-24T16:25:36.043-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- finish
  2020-04-24T16:25:36.043-0700 7fd7272ea700 20 osd.0 16499 tick last_purged_snaps_scrub 2020-04-24T15:54:43.601161-0700 next 2020-04-25T15:54:43.601161-0700
  2020-04-24T16:25:36.631-0700 7fd72606c700 10 osd.0 16499 tick_without_osd_lock
  2020-04-24T16:25:37.055-0700 7fd7272ea700 10 osd.0 16499 tick
  2020-04-24T16:25:37.055-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- start
  2020-04-24T16:25:37.055-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- finish
  2020-04-24T16:25:37.055-0700 7fd7272ea700 20 osd.0 16499 tick last_purged_snaps_scrub 2020-04-24T15:54:43.601161-0700 next 2020-04-25T15:54:43.601161-0700
  2020-04-24T16:25:37.595-0700 7fd72606c700 10 osd.0 16499 tick_without_osd_lock
  2020-04-24T16:25:38.071-0700 7fd7272ea700 10 osd.0 16499 tick
  2020-04-24T16:25:38.071-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- start
  2020-04-24T16:25:38.071-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- finish
  2020-04-24T16:25:38.071-0700 7fd7272ea700 20 osd.0 16499 tick last_purged_snaps_scrub 2020-04-24T15:54:43.601161-0700 next 2020-04-25T15:54:43.601161-0700
  2020-04-24T16:25:38.243-0700 7fd71cc0d700 20 osd.0 16499 reports for 0 queries
  2020-04-24T16:25:38.583-0700 7fd72606c700 10 osd.0 16499 tick_without_osd_lock
  2020-04-24T16:25:39.103-0700 7fd7272ea700 10 osd.0 16499 tick
  2020-04-24T16:25:39.103-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- start
  2020-04-24T16:25:39.103-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- finish
  2020-04-24T16:25:39.103-0700 7fd7272ea700 20 osd.0 16499 tick last_purged_snaps_scrub 2020-04-24T15:54:43.601161-0700 next 2020-04-25T15:54:43.601161-0700

  This repeats over and over again.

  strace of the process yields lots of unfinished futex access attempts:

  [pid  2130] futex(0x55b17b8e216c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=937726129}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
  [pid  2100] write(12, "2020-04-24T16:47:33.915-0700 7fd"..., 79) = 79
  [pid  2100] futex(0x55b17b7108e4, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
  [pid  2190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
  [pid  2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
  [pid  2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772053, tv_nsec=969572004}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
  [pid  2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
  [pid  2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=20189832}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
  [pid  2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
  [pid  2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=70811223}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
  [pid  2163] <... io_getevents resumed> [], {tv_sec=0, tv_nsec=250000000}) = 0
  [pid  2134] <... io_getevents resumed> [], {tv_sec=0, tv_nsec=250000000}) = 0
  [pid  2163] io_getevents(0x7fd7272eb000, 1, 16,  <unfinished ...>
  [pid  2134] io_getevents(0x7fd7272fc000, 1, 16,  <unfinished ...>
  [pid  2190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
  [pid  2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
  [pid  2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=121288477}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
  [pid  2200] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
  [pid  2200] futex(0x55b17b7108e4, FUTEX_WAKE_PRIVATE, 2147483647) = 1
  [pid  2200] futex(0x7ffc4aa8b708, FUTEX_WAKE_PRIVATE, 1) = 0
  [pid  2200] futex(0x7ffc4aa8b770, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772055, tv_nsec=102644954}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
  [pid  2100] <... futex resumed> )       = 0
  [pid  2100] futex(0x55b17b710838, FUTEX_WAKE_PRIVATE, 1) = 0
  [pid  2100] futex(0x55b17b7108e0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
  [pid  2190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
  [pid  2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
  [pid  2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=171906673}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
  [pid  2127] <... clock_nanosleep resumed> NULL) = 0
  [pid  2127] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=256000000},  <unfinished ...>
  [pid  2190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
  [pid  2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
  [pid  2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=222271211}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
  [pid  2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
  [pid  2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=273226419}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
  [pid  2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
  [pid  2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=323615391}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
  [pid  2163] <... io_getevents resumed> [], {tv_sec=0, tv_nsec=250000000}) = 0
  [pid  2163] io_getevents(0x7fd7272eb000, 1, 16,  <unfinished ...>
  [pid  2134] <... io_getevents resumed> [], {tv_sec=0, tv_nsec=250000000}) = 0
  [pid  2134] io_getevents(0x7fd7272fc000, 1, 16,  <unfinished ...>
  [pid  2190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
  [pid  2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
  [pid  2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=373946132}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
  [pid  2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
  [pid  2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=424283527}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
  [pid  2127] <... clock_nanosleep resumed> NULL) = 0
  [pid  2127] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=256000000},  <unfinished ...>
  [pid  2190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
  [pid  2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
  [pid  2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=474599677}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
  [pid  2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
  [pid  2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=525368586}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
  [pid  2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
  [pid  2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=575839547}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
  [pid  2163] <... io_getevents resumed> [], {tv_sec=0, tv_nsec=250000000}) = 0
  [pid  2134] <... io_getevents resumed> [], {tv_sec=0, tv_nsec=250000000}) = 0
  [pid  2163] io_getevents(0x7fd7272eb000, 1, 16,  <unfinished ...>
  [pid  2134] io_getevents(0x7fd7272fc000, 1, 16, ^Cstrace: Process 2093 detached


  
  Suspected cause: OSD can not connect to monitor. 

  Repeatability: On 5 attempts (3 separate nodes and 2 repetitions) the
  result was the same.

  Research done: I checked Launchpad and the Ceph bug tracker, couldn't
  find something similar. Tried restart of process, reboot of node,
  revert change and re-upgrade, hold Ceph packages and manually upgrade
  after do-release-upgrade, strace of the process.

  Impact: Right now upgrade of Ceph to 20.04 LTS appears to be broken

  ProblemType: Bug
  DistroRelease: Ubuntu 20.04
  Package: ceph-osd 15.2.1-0ubuntu1
  ProcVersionSignature: Ubuntu 5.4.0-26.30-generic 5.4.30
  Uname: Linux 5.4.0-26-generic x86_64
  ApportVersion: 2.20.11-0ubuntu27
  Architecture: amd64
  CasperMD5CheckResult: skip
  Date: Fri Apr 24 16:30:53 2020
  ProcEnviron:
   SHELL=/bin/bash
   LANG=en_US.UTF-8
   TERM=xterm-256color
   PATH=(custom, no user)
  SourcePackage: ceph
  UpgradeStatus: Upgraded to focal on 2020-04-24 (0 days ago)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions



More information about the Ubuntu-openstack-bugs mailing list