[Bug 1838400] Re: OSD crashes when loading pgs with "FAILED assert(interval.last > last)"

Thu Apr 9 09:38:20 UTC 2020

Upstream activity seems to have dropped off on reproduction and
resolution of this issue in Luminous - is this still an issue?

I'd also note that its usual to upgrade a xenial deployment to the most
recent updates in the UCA for Queens before upgrading the underlying
Ubuntu series to avoid any version changes in deployed components during
the series upgrade process.

** Changed in: ceph (Ubuntu)
       Status: New => Incomplete

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to ceph in Ubuntu.
https://bugs.launchpad.net/bugs/1838400

Title:
  OSD crashes when loading pgs with "FAILED assert(interval.last >
  last)"

Status in OpenStack ceph-osd charm:
  Invalid
Status in ceph package in Ubuntu:
  Incomplete

Bug description:
  This issue is tracked at https://tracker.ceph.com/issues/21142

  Today, I hit it when running the following procedure:
  0) nova-compute openstack-origin="cloud:xenial-queens"
  1) juju upgrade-series 19 prepare bionic
  2) apt-get update, dist-upgrade, do-release-upgrade, reboot

  Machine 19 is hyperconverged, and runs nova-compute and ceph-osd.

  Ceph was initially 12.2.8, and was updated to the latest in the
  xenial-queens repo (12.2.12).

  3) "ceph -s" showed half of the cluster with OSDs down due to "FAILED assert(interval.last > last)"
  3.1) I tried to upgrade all ceph packages from 12.2.8 to 12.2.12 but the issue remained.

  In the end, the suggestion from the linked bug was:
  """
  # set [DEFAULT] ceph.conf section to
  debug osd = 10/5

  # from the CLI (for osd.N)
  service ceph-osd at N restart
  # wait until coredump is seen in the logs
  service ceph-osd at N stop
  perl -nle '/pg (\S+) first map.*,\ssame_interval_since/ && print$1' /var/log/ceph/ceph-osd.N.log | sort -u | xargs -I@ bash -xc 'ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-N/ --op rm-past-intervals --pgid @'
  """
  Note: running the above on all the OSDs that were down fixed them.

  It seems the above fix is pending backport, and also to be included in
  the ubuntu ceph packaging.

  4) once the upgrade of that single compute-storage node was over,
  4.0) juju config nova-compute openstack-origin=distro
  4.1) juju config ceph-osd source=distro
  4.2) juju upgrade-series 19 complete

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-ceph-osd/+bug/1838400/+subscriptions