[Bug 1599681] [NEW] ceph-osd process hung and blocked ps listings

Brad Marshall 1599681 at bugs.launchpad.net
Thu Jul 7 01:08:07 UTC 2016


Public bug reported:

We ran into a situation over the past couple of days where we had 2
different ceph-osd nodes crash in such a way that they caused ps listing
to hang when enumerating the process.  Both had a call trace associated
with them:

Node 1:
Jul  4 07:46:15 provider-cs-03 kernel: [4188396.493011] ceph-osd        D ffff882029a67b90     0  5312      1 0x00000004
Jul  4 07:46:15 provider-cs-03 kernel: [4188396.590564]  ffff882029a67b90 ffff881037cb8000 ffff8820284f3700 ffff882029a68000
Jul  4 07:46:16 provider-cs-03 kernel: [4188396.688603]  ffff88203296e5a8 ffff88203296e5c0 0000000000000015 ffff8820284f3700
Jul  4 07:46:16 provider-cs-03 kernel: [4188396.789329]  ffff882029a67ba8 ffffffff817ec495 ffff8820284f3700 ffff882029a67bf8
Jul  4 07:46:16 provider-cs-03 kernel: [4188396.891376] Call Trace:
Jul  4 07:46:16 provider-cs-03 kernel: [4188396.939271]  [<ffffffff817ec495>] schedule+0x35/0x80
Jul  4 07:46:16 provider-cs-03 kernel: [4188396.989957]  [<ffffffff817eeb6a>] rwsem_down_read_failed+0xea/0x120
Jul  4 07:46:16 provider-cs-03 kernel: [4188397.041502]  [<ffffffff813dbd84>] call_rwsem_down_read_failed+0x14/0x30
Jul  4 07:46:16 provider-cs-03 kernel: [4188397.092616]  [<ffffffff813dbf15>] ? __clear_user+0x25/0x50
Jul  4 07:46:16 provider-cs-03 kernel: [4188397.141510]  [<ffffffff813dbf15>] ? __clear_user+0x25/0x50
Jul  4 07:46:16 provider-cs-03 kernel: [4188397.189877]  [<ffffffff817ee1f0>] ? down_read+0x20/0x30
Jul  4 07:46:16 provider-cs-03 kernel: [4188397.237513]  [<ffffffff81067f18>] __do_page_fault+0x398/0x430
Jul  4 07:46:16 provider-cs-03 kernel: [4188397.285588]  [<ffffffff81067fd2>] do_page_fault+0x22/0x30
Jul  4 07:46:16 provider-cs-03 kernel: [4188397.332936]  [<ffffffff817f1e78>] page_fault+0x28/0x30
Jul  4 07:46:16 provider-cs-03 kernel: [4188397.379495]  [<ffffffff813dbf15>] ? __clear_user+0x25/0x50
Jul  4 07:46:16 provider-cs-03 kernel: [4188397.426400]  [<ffffffff81039c58>] copy_fpstate_to_sigframe+0x118/0x1d0
Jul  4 07:46:16 provider-cs-03 kernel: [4188397.474904]  [<ffffffff8102d1fd>] get_sigframe.isra.7.constprop.9+0x12d/0x150
Jul  4 07:46:16 provider-cs-03 kernel: [4188397.563204]  [<ffffffff8102d698>] do_signal+0x1e8/0x6d0
Jul  4 07:46:16 provider-cs-03 kernel: [4188397.609783]  [<ffffffff816d19f2>] ? __sys_sendmsg+0x42/0x80
Jul  4 07:46:16 provider-cs-03 kernel: [4188397.656633]  [<ffffffff811b2ed0>] ? handle_mm_fault+0x250/0x540
Jul  4 07:46:16 provider-cs-03 kernel: [4188397.703785]  [<ffffffff8107884c>] exit_to_usermode_loop+0x59/0xa2
Jul  4 07:46:17 provider-cs-03 kernel: [4188397.751367]  [<ffffffff81003a6e>] syscall_return_slowpath+0x4e/0x60
Jul  4 07:46:17 provider-cs-03 kernel: [4188397.799369]  [<ffffffff817efe58>] int_ret_from_sys_call+0x25/0x8f

Node 2:
[733869.727139] CPU: 17 PID: 1735127 Comm: ceph-osd Not tainted 4.4.0-15-generic #31-Ubuntu
[733869.796954] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.3.6 06/03/2015
[733869.927182] task: ffff881841dc6e00 ti: ffff8810cc0a0000 task.ti: ffff8810cc0a0000
[733870.059139] RIP: 0010:[<ffffffff810b479d>]  [<ffffffff810b479d>] task_numa_find_cpu+0x2cd/0x710
[733870.192753] RSP: 0000:ffff8810cc0a3bd8  EFLAGS: 00010257
[733870.260298] RAX: 0000000000000000 RBX: ffff8810cc0a3c78 RCX: 0000000000000012
[733870.389322] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8810210a0e00
[733870.517883] RBP: ffff8810cc0a3c40 R08: 0000000000000006 R09: 000000000000013e
[733870.646335] R10: 00000000000003b4 R11: 000000000000001f R12: ffff881018118000
[733870.774514] R13: 0000000000000006 R14: ffff8810210a0e00 R15: 0000000000000379
[733870.902262] FS:  00007fdcfab03700(0000) GS:ffff88203e600000(0000) knlGS:0000000000000000
[733871.031347] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[733871.097820] CR2: 00007fdcfab02c20 CR3: 0000001029204000 CR4: 00000000001406e0
[733871.223381] Stack:
[733871.282453]  ffff8810cc0a3c40 ffffffff811f04ce ffff88102f6e9680 0000000000000012
[733871.404947]  0000000000000077 000000000000008f 0000000000016d40 0000000000000006
[733871.527250]  ffff881841dc6e00 ffff8810cc0a3c78 00000000000001ac 00000000000001b8
[733871.649648] Call Trace:
[733871.707884]  [<ffffffff811f04ce>] ? migrate_page_copy+0x21e/0x530
[733871.770946]  [<ffffffff810b501e>] task_numa_migrate+0x43e/0x9b0
[733871.832808]  [<ffffffff811c9700>] ? page_add_anon_rmap+0x10/0x20
[733871.893897]  [<ffffffff810b5609>] numa_migrate_preferred+0x79/0x80
[733871.954283]  [<ffffffff810b9c24>] task_numa_fault+0x7f4/0xd40
[733872.013128]  [<ffffffff811bdf90>] handle_mm_fault+0xbc0/0x1820
[733872.071309]  [<ffffffff81101420>] ? do_futex+0x120/0x500
[733872.128149]  [<ffffffff812288c5>] ? __fget_light+0x25/0x60
[733872.184044]  [<ffffffff8106a537>] __do_page_fault+0x197/0x400
[733872.239300]  [<ffffffff8106a7c2>] do_page_fault+0x22/0x30
[733872.293001]  [<ffffffff81824178>] page_fault+0x28/0x30
[733872.345187] Code: d0 4c 89 f7 e8 95 c7 ff ff 49 8b 84 24 d8 01 00 00 49 8b 76 78 31 d2 49 0f af 86 b0 00 00 00 4c 8b 45 d0 48 8b 4d b0 48 83 c6 01 <48> f7 f6 4c 89 c6 48 89 da 48 8d 3c 01 48 29 c6 e8 de c5 ff ff 
[733872.507088] RIP  [<ffffffff810b479d>] task_numa_find_cpu+0x2cd/0x710
[733872.559965]  RSP <ffff8810cc0a3bd8>
[733872.673773] ---[ end trace aec37273a19e57dc ]---

In the ceph logs for node 1 there is:

./include/interval_set.h: 340: FAILED assert(0)

 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x56042ebdeceb]
 2: (()+0x4892b8) [0x56042e9512b8]
 3: (boost::statechart::simple_state<ReplicatedPG::WaitingOnReplicas, ReplicatedPG::SnapTrimmer, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)
0>::react_impl(boost::statechart::event_base const&, void const*)+0xb2) [0x56042e97a8d2]
 4: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_queue
d_events()+0x127) [0x56042e9646e7]
 5: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_event
(boost::statechart::event_base const&)+0x84) [0x56042e9648b4]
 6: (ReplicatedPG::snap_trimmer()+0x52c) [0x56042e8eb5dc]
 7: (OSD::SnapTrimWQ::_process(PG*)+0x1a) [0x56042e7807da]
 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa56) [0x56042ebcf8d6]
 9: (ThreadPool::WorkThread::entry()+0x10) [0x56042ebd0980]
 10: (()+0x8184) [0x7f27ecc66184]
 11: (clone()+0x6d) [0x7f27eb1d137d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Unfortunately the only way we could get the processes to respond again
was to reboot the systems.

Is there any way of figuring out what went wrong here?

$ lsb_release -rd
Description:	Ubuntu 14.04.4 LTS
Release:	14.04

$ dpkg-query -W ceph
ceph	0.94.7-0ubuntu0.15.04.1~cloud0

** Affects: ceph (Ubuntu)
     Importance: Undecided
         Status: New


** Tags: canonical-bootstack

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to ceph in Ubuntu.
https://bugs.launchpad.net/bugs/1599681

Title:
  ceph-osd process hung and blocked ps listings

Status in ceph package in Ubuntu:
  New

Bug description:
  We ran into a situation over the past couple of days where we had 2
  different ceph-osd nodes crash in such a way that they caused ps
  listing to hang when enumerating the process.  Both had a call trace
  associated with them:

  Node 1:
  Jul  4 07:46:15 provider-cs-03 kernel: [4188396.493011] ceph-osd        D ffff882029a67b90     0  5312      1 0x00000004
  Jul  4 07:46:15 provider-cs-03 kernel: [4188396.590564]  ffff882029a67b90 ffff881037cb8000 ffff8820284f3700 ffff882029a68000
  Jul  4 07:46:16 provider-cs-03 kernel: [4188396.688603]  ffff88203296e5a8 ffff88203296e5c0 0000000000000015 ffff8820284f3700
  Jul  4 07:46:16 provider-cs-03 kernel: [4188396.789329]  ffff882029a67ba8 ffffffff817ec495 ffff8820284f3700 ffff882029a67bf8
  Jul  4 07:46:16 provider-cs-03 kernel: [4188396.891376] Call Trace:
  Jul  4 07:46:16 provider-cs-03 kernel: [4188396.939271]  [<ffffffff817ec495>] schedule+0x35/0x80
  Jul  4 07:46:16 provider-cs-03 kernel: [4188396.989957]  [<ffffffff817eeb6a>] rwsem_down_read_failed+0xea/0x120
  Jul  4 07:46:16 provider-cs-03 kernel: [4188397.041502]  [<ffffffff813dbd84>] call_rwsem_down_read_failed+0x14/0x30
  Jul  4 07:46:16 provider-cs-03 kernel: [4188397.092616]  [<ffffffff813dbf15>] ? __clear_user+0x25/0x50
  Jul  4 07:46:16 provider-cs-03 kernel: [4188397.141510]  [<ffffffff813dbf15>] ? __clear_user+0x25/0x50
  Jul  4 07:46:16 provider-cs-03 kernel: [4188397.189877]  [<ffffffff817ee1f0>] ? down_read+0x20/0x30
  Jul  4 07:46:16 provider-cs-03 kernel: [4188397.237513]  [<ffffffff81067f18>] __do_page_fault+0x398/0x430
  Jul  4 07:46:16 provider-cs-03 kernel: [4188397.285588]  [<ffffffff81067fd2>] do_page_fault+0x22/0x30
  Jul  4 07:46:16 provider-cs-03 kernel: [4188397.332936]  [<ffffffff817f1e78>] page_fault+0x28/0x30
  Jul  4 07:46:16 provider-cs-03 kernel: [4188397.379495]  [<ffffffff813dbf15>] ? __clear_user+0x25/0x50
  Jul  4 07:46:16 provider-cs-03 kernel: [4188397.426400]  [<ffffffff81039c58>] copy_fpstate_to_sigframe+0x118/0x1d0
  Jul  4 07:46:16 provider-cs-03 kernel: [4188397.474904]  [<ffffffff8102d1fd>] get_sigframe.isra.7.constprop.9+0x12d/0x150
  Jul  4 07:46:16 provider-cs-03 kernel: [4188397.563204]  [<ffffffff8102d698>] do_signal+0x1e8/0x6d0
  Jul  4 07:46:16 provider-cs-03 kernel: [4188397.609783]  [<ffffffff816d19f2>] ? __sys_sendmsg+0x42/0x80
  Jul  4 07:46:16 provider-cs-03 kernel: [4188397.656633]  [<ffffffff811b2ed0>] ? handle_mm_fault+0x250/0x540
  Jul  4 07:46:16 provider-cs-03 kernel: [4188397.703785]  [<ffffffff8107884c>] exit_to_usermode_loop+0x59/0xa2
  Jul  4 07:46:17 provider-cs-03 kernel: [4188397.751367]  [<ffffffff81003a6e>] syscall_return_slowpath+0x4e/0x60
  Jul  4 07:46:17 provider-cs-03 kernel: [4188397.799369]  [<ffffffff817efe58>] int_ret_from_sys_call+0x25/0x8f

  Node 2:
  [733869.727139] CPU: 17 PID: 1735127 Comm: ceph-osd Not tainted 4.4.0-15-generic #31-Ubuntu
  [733869.796954] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.3.6 06/03/2015
  [733869.927182] task: ffff881841dc6e00 ti: ffff8810cc0a0000 task.ti: ffff8810cc0a0000
  [733870.059139] RIP: 0010:[<ffffffff810b479d>]  [<ffffffff810b479d>] task_numa_find_cpu+0x2cd/0x710
  [733870.192753] RSP: 0000:ffff8810cc0a3bd8  EFLAGS: 00010257
  [733870.260298] RAX: 0000000000000000 RBX: ffff8810cc0a3c78 RCX: 0000000000000012
  [733870.389322] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8810210a0e00
  [733870.517883] RBP: ffff8810cc0a3c40 R08: 0000000000000006 R09: 000000000000013e
  [733870.646335] R10: 00000000000003b4 R11: 000000000000001f R12: ffff881018118000
  [733870.774514] R13: 0000000000000006 R14: ffff8810210a0e00 R15: 0000000000000379
  [733870.902262] FS:  00007fdcfab03700(0000) GS:ffff88203e600000(0000) knlGS:0000000000000000
  [733871.031347] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [733871.097820] CR2: 00007fdcfab02c20 CR3: 0000001029204000 CR4: 00000000001406e0
  [733871.223381] Stack:
  [733871.282453]  ffff8810cc0a3c40 ffffffff811f04ce ffff88102f6e9680 0000000000000012
  [733871.404947]  0000000000000077 000000000000008f 0000000000016d40 0000000000000006
  [733871.527250]  ffff881841dc6e00 ffff8810cc0a3c78 00000000000001ac 00000000000001b8
  [733871.649648] Call Trace:
  [733871.707884]  [<ffffffff811f04ce>] ? migrate_page_copy+0x21e/0x530
  [733871.770946]  [<ffffffff810b501e>] task_numa_migrate+0x43e/0x9b0
  [733871.832808]  [<ffffffff811c9700>] ? page_add_anon_rmap+0x10/0x20
  [733871.893897]  [<ffffffff810b5609>] numa_migrate_preferred+0x79/0x80
  [733871.954283]  [<ffffffff810b9c24>] task_numa_fault+0x7f4/0xd40
  [733872.013128]  [<ffffffff811bdf90>] handle_mm_fault+0xbc0/0x1820
  [733872.071309]  [<ffffffff81101420>] ? do_futex+0x120/0x500
  [733872.128149]  [<ffffffff812288c5>] ? __fget_light+0x25/0x60
  [733872.184044]  [<ffffffff8106a537>] __do_page_fault+0x197/0x400
  [733872.239300]  [<ffffffff8106a7c2>] do_page_fault+0x22/0x30
  [733872.293001]  [<ffffffff81824178>] page_fault+0x28/0x30
  [733872.345187] Code: d0 4c 89 f7 e8 95 c7 ff ff 49 8b 84 24 d8 01 00 00 49 8b 76 78 31 d2 49 0f af 86 b0 00 00 00 4c 8b 45 d0 48 8b 4d b0 48 83 c6 01 <48> f7 f6 4c 89 c6 48 89 da 48 8d 3c 01 48 29 c6 e8 de c5 ff ff 
  [733872.507088] RIP  [<ffffffff810b479d>] task_numa_find_cpu+0x2cd/0x710
  [733872.559965]  RSP <ffff8810cc0a3bd8>
  [733872.673773] ---[ end trace aec37273a19e57dc ]---

  In the ceph logs for node 1 there is:

  ./include/interval_set.h: 340: FAILED assert(0)

   ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x56042ebdeceb]
   2: (()+0x4892b8) [0x56042e9512b8]
   3: (boost::statechart::simple_state<ReplicatedPG::WaitingOnReplicas, ReplicatedPG::SnapTrimmer, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
  mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)
  0>::react_impl(boost::statechart::event_base const&, void const*)+0xb2) [0x56042e97a8d2]
   4: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_queue
  d_events()+0x127) [0x56042e9646e7]
   5: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_event
  (boost::statechart::event_base const&)+0x84) [0x56042e9648b4]
   6: (ReplicatedPG::snap_trimmer()+0x52c) [0x56042e8eb5dc]
   7: (OSD::SnapTrimWQ::_process(PG*)+0x1a) [0x56042e7807da]
   8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa56) [0x56042ebcf8d6]
   9: (ThreadPool::WorkThread::entry()+0x10) [0x56042ebd0980]
   10: (()+0x8184) [0x7f27ecc66184]
   11: (clone()+0x6d) [0x7f27eb1d137d]
   NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

  Unfortunately the only way we could get the processes to respond again
  was to reboot the systems.

  Is there any way of figuring out what went wrong here?

  $ lsb_release -rd
  Description:	Ubuntu 14.04.4 LTS
  Release:	14.04

  $ dpkg-query -W ceph
  ceph	0.94.7-0ubuntu0.15.04.1~cloud0

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1599681/+subscriptions



More information about the Ubuntu-openstack-bugs mailing list