[Bug 1669227] [NEW] Marking unfound objects lost causes OSD to crash

Thu Mar 2 02:56:36 UTC 2017

Public bug reported:

On Firefly and Hammer, marking an unfound object as lost causes OSDs to
crash due to a failed assertion in the ordering of the journal entries.
The problem seems to be that the unfound objects are removed from the
set of objects needed to be locally recovered but failed to advance the
pg log pointer.

This has been raised as an upstream bug in:
http://tracker.ceph.com/issues/13468

This issue was fixed upstream in https://github.com/ceph/ceph/pull/6841

Backtrace for crash is:

#0  0x00007fd7b59781fb in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
#1  0x00005593051cf9da in reraise_fatal (signum=6) at global/signal_handler.cc:59
#2  handle_fatal_signal (signum=6) at global/signal_handler.cc:105
#3  <signal handler called>
#4  0x00007fd7b401dc37 in __GI_raise (sig=sig at entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#5  0x00007fd7b4021028 in __GI_abort () at abort.c:89
#6  0x00007fd7b4928535 in __gnu_cxx::__verbose_terminate_handler () at ../../../../src/libstdc++-v3/libsupc++/vterminate.cc:95
#7  0x00007fd7b49266d6 in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:38
#8  0x00007fd7b4926703 in std::terminate () at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:48
#9  0x00007fd7b4926922 in __cxxabiv1::__cxa_throw (obj=0x559310e37820, tinfo=0x55930574ea50 <typeinfo for ceph::FailedAssertion>, dest=0x0) at ../../../../src/libstdc++-v3/libsupc++/eh_throw.cc:87
#10 0x00005593052b3932 in ceph::__ceph_assert_fail (assertion=assertion at entry=0x55930539d520 "info.last_complete == info.last_update", file=file at entry=0x5593053cca40 "osd/ReplicatedPG.cc", 
    line=line at entry=9015, func=func at entry=0x5593053d7160 <ReplicatedPG::recover_got(hobject_t, eversion_t)::__PRETTY_FUNCTION__> "void ReplicatedPG::recover_got(hobject_t, eversion_t)")
    at common/assert.cc:77
#11 0x0000559304ffd009 in ReplicatedPG::recover_got (this=this at entry=0x559330da2000, oid=..., v=...) at osd/ReplicatedPG.cc:9015
#12 0x0000559305004c48 in ReplicatedPG::on_local_recover (this=0x559330da2000, hoid=..., stat_diff=..., _recovery_info=..., obc=..., t=0x55931ac71cc0) at osd/ReplicatedPG.cc:243
#13 0x0000559305169eb2 in ECBackend::handle_recovery_push (this=this at entry=0x55932d072f40, op=..., m=m at entry=0x7fd78a8fc440) at osd/ECBackend.cc:313
#14 0x000055930516c656 in ECBackend::handle_message (this=0x55932d072f40, _op=...) at osd/ECBackend.cc:690
#15 0x0000559304fefbab in ReplicatedPG::do_request (this=0x559330da2000, op=..., handle=...) at osd/ReplicatedPG.cc:1114
#16 0x0000559304e4a381 in OSD::dequeue_op (this=0x5593075cf400, pg=..., op=..., handle=...) at osd/OSD.cc:7872
#17 0x0000559304e65724 in OSD::OpWQ::_process (this=0x5593075d0258, pg=..., handle=...) at osd/OSD.cc:7842
#18 0x0000559304ea7ecc in ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process (this=0x5593075d0258, handle=...)
    at ./common/WorkQueue.h:191
#19 0x00005593052a42c1 in ThreadPool::worker (this=0x5593075cf870, wt=0x5593076b9dd0) at common/WorkQueue.cc:128
#20 0x00005593052a51b0 in ThreadPool::WorkThread::entry (this=<optimized out>) at common/WorkQueue.h:318
#21 0x00007fd7b5970184 in start_thread (arg=0x7fd78a8fd700) at pthread_create.c:312
#22 0x00007fd7b40e137d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

** Affects: ceph (Ubuntu)
     Importance: Undecided
         Status: New

** Tags: sts

** Tags added: sts

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to ceph in Ubuntu.
https://bugs.launchpad.net/bugs/1669227

Title:
  Marking unfound objects lost causes OSD to crash

Status in ceph package in Ubuntu:
  New

Bug description:
  On Firefly and Hammer, marking an unfound object as lost causes OSDs
  to crash due to a failed assertion in the ordering of the journal
  entries. The problem seems to be that the unfound objects are removed
  from the set of objects needed to be locally recovered but failed to
  advance the pg log pointer.

  This has been raised as an upstream bug in:
  http://tracker.ceph.com/issues/13468

  This issue was fixed upstream in
  https://github.com/ceph/ceph/pull/6841

  Backtrace for crash is:

  #0  0x00007fd7b59781fb in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
  #1  0x00005593051cf9da in reraise_fatal (signum=6) at global/signal_handler.cc:59
  #2  handle_fatal_signal (signum=6) at global/signal_handler.cc:105
  #3  <signal handler called>
  #4  0x00007fd7b401dc37 in __GI_raise (sig=sig at entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
  #5  0x00007fd7b4021028 in __GI_abort () at abort.c:89
  #6  0x00007fd7b4928535 in __gnu_cxx::__verbose_terminate_handler () at ../../../../src/libstdc++-v3/libsupc++/vterminate.cc:95
  #7  0x00007fd7b49266d6 in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:38
  #8  0x00007fd7b4926703 in std::terminate () at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:48
  #9  0x00007fd7b4926922 in __cxxabiv1::__cxa_throw (obj=0x559310e37820, tinfo=0x55930574ea50 <typeinfo for ceph::FailedAssertion>, dest=0x0) at ../../../../src/libstdc++-v3/libsupc++/eh_throw.cc:87
  #10 0x00005593052b3932 in ceph::__ceph_assert_fail (assertion=assertion at entry=0x55930539d520 "info.last_complete == info.last_update", file=file at entry=0x5593053cca40 "osd/ReplicatedPG.cc", 
      line=line at entry=9015, func=func at entry=0x5593053d7160 <ReplicatedPG::recover_got(hobject_t, eversion_t)::__PRETTY_FUNCTION__> "void ReplicatedPG::recover_got(hobject_t, eversion_t)")
      at common/assert.cc:77
  #11 0x0000559304ffd009 in ReplicatedPG::recover_got (this=this at entry=0x559330da2000, oid=..., v=...) at osd/ReplicatedPG.cc:9015
  #12 0x0000559305004c48 in ReplicatedPG::on_local_recover (this=0x559330da2000, hoid=..., stat_diff=..., _recovery_info=..., obc=..., t=0x55931ac71cc0) at osd/ReplicatedPG.cc:243
  #13 0x0000559305169eb2 in ECBackend::handle_recovery_push (this=this at entry=0x55932d072f40, op=..., m=m at entry=0x7fd78a8fc440) at osd/ECBackend.cc:313
  #14 0x000055930516c656 in ECBackend::handle_message (this=0x55932d072f40, _op=...) at osd/ECBackend.cc:690
  #15 0x0000559304fefbab in ReplicatedPG::do_request (this=0x559330da2000, op=..., handle=...) at osd/ReplicatedPG.cc:1114
  #16 0x0000559304e4a381 in OSD::dequeue_op (this=0x5593075cf400, pg=..., op=..., handle=...) at osd/OSD.cc:7872
  #17 0x0000559304e65724 in OSD::OpWQ::_process (this=0x5593075d0258, pg=..., handle=...) at osd/OSD.cc:7842
  #18 0x0000559304ea7ecc in ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process (this=0x5593075d0258, handle=...)
      at ./common/WorkQueue.h:191
  #19 0x00005593052a42c1 in ThreadPool::worker (this=0x5593075cf870, wt=0x5593076b9dd0) at common/WorkQueue.cc:128
  #20 0x00005593052a51b0 in ThreadPool::WorkThread::entry (this=<optimized out>) at common/WorkQueue.h:318
  #21 0x00007fd7b5970184 in start_thread (arg=0x7fd78a8fd700) at pthread_create.c:312
  #22 0x00007fd7b40e137d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1669227/+subscriptions