[Bug 1599681] [NEW] ceph-osd process hung and blocked ps listings
Brad Marshall
1599681 at bugs.launchpad.net
Thu Jul 7 01:08:07 UTC 2016
Public bug reported:
We ran into a situation over the past couple of days where we had 2
different ceph-osd nodes crash in such a way that they caused ps listing
to hang when enumerating the process. Both had a call trace associated
with them:
Node 1:
Jul 4 07:46:15 provider-cs-03 kernel: [4188396.493011] ceph-osd D ffff882029a67b90 0 5312 1 0x00000004
Jul 4 07:46:15 provider-cs-03 kernel: [4188396.590564] ffff882029a67b90 ffff881037cb8000 ffff8820284f3700 ffff882029a68000
Jul 4 07:46:16 provider-cs-03 kernel: [4188396.688603] ffff88203296e5a8 ffff88203296e5c0 0000000000000015 ffff8820284f3700
Jul 4 07:46:16 provider-cs-03 kernel: [4188396.789329] ffff882029a67ba8 ffffffff817ec495 ffff8820284f3700 ffff882029a67bf8
Jul 4 07:46:16 provider-cs-03 kernel: [4188396.891376] Call Trace:
Jul 4 07:46:16 provider-cs-03 kernel: [4188396.939271] [<ffffffff817ec495>] schedule+0x35/0x80
Jul 4 07:46:16 provider-cs-03 kernel: [4188396.989957] [<ffffffff817eeb6a>] rwsem_down_read_failed+0xea/0x120
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.041502] [<ffffffff813dbd84>] call_rwsem_down_read_failed+0x14/0x30
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.092616] [<ffffffff813dbf15>] ? __clear_user+0x25/0x50
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.141510] [<ffffffff813dbf15>] ? __clear_user+0x25/0x50
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.189877] [<ffffffff817ee1f0>] ? down_read+0x20/0x30
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.237513] [<ffffffff81067f18>] __do_page_fault+0x398/0x430
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.285588] [<ffffffff81067fd2>] do_page_fault+0x22/0x30
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.332936] [<ffffffff817f1e78>] page_fault+0x28/0x30
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.379495] [<ffffffff813dbf15>] ? __clear_user+0x25/0x50
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.426400] [<ffffffff81039c58>] copy_fpstate_to_sigframe+0x118/0x1d0
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.474904] [<ffffffff8102d1fd>] get_sigframe.isra.7.constprop.9+0x12d/0x150
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.563204] [<ffffffff8102d698>] do_signal+0x1e8/0x6d0
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.609783] [<ffffffff816d19f2>] ? __sys_sendmsg+0x42/0x80
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.656633] [<ffffffff811b2ed0>] ? handle_mm_fault+0x250/0x540
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.703785] [<ffffffff8107884c>] exit_to_usermode_loop+0x59/0xa2
Jul 4 07:46:17 provider-cs-03 kernel: [4188397.751367] [<ffffffff81003a6e>] syscall_return_slowpath+0x4e/0x60
Jul 4 07:46:17 provider-cs-03 kernel: [4188397.799369] [<ffffffff817efe58>] int_ret_from_sys_call+0x25/0x8f
Node 2:
[733869.727139] CPU: 17 PID: 1735127 Comm: ceph-osd Not tainted 4.4.0-15-generic #31-Ubuntu
[733869.796954] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.3.6 06/03/2015
[733869.927182] task: ffff881841dc6e00 ti: ffff8810cc0a0000 task.ti: ffff8810cc0a0000
[733870.059139] RIP: 0010:[<ffffffff810b479d>] [<ffffffff810b479d>] task_numa_find_cpu+0x2cd/0x710
[733870.192753] RSP: 0000:ffff8810cc0a3bd8 EFLAGS: 00010257
[733870.260298] RAX: 0000000000000000 RBX: ffff8810cc0a3c78 RCX: 0000000000000012
[733870.389322] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8810210a0e00
[733870.517883] RBP: ffff8810cc0a3c40 R08: 0000000000000006 R09: 000000000000013e
[733870.646335] R10: 00000000000003b4 R11: 000000000000001f R12: ffff881018118000
[733870.774514] R13: 0000000000000006 R14: ffff8810210a0e00 R15: 0000000000000379
[733870.902262] FS: 00007fdcfab03700(0000) GS:ffff88203e600000(0000) knlGS:0000000000000000
[733871.031347] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[733871.097820] CR2: 00007fdcfab02c20 CR3: 0000001029204000 CR4: 00000000001406e0
[733871.223381] Stack:
[733871.282453] ffff8810cc0a3c40 ffffffff811f04ce ffff88102f6e9680 0000000000000012
[733871.404947] 0000000000000077 000000000000008f 0000000000016d40 0000000000000006
[733871.527250] ffff881841dc6e00 ffff8810cc0a3c78 00000000000001ac 00000000000001b8
[733871.649648] Call Trace:
[733871.707884] [<ffffffff811f04ce>] ? migrate_page_copy+0x21e/0x530
[733871.770946] [<ffffffff810b501e>] task_numa_migrate+0x43e/0x9b0
[733871.832808] [<ffffffff811c9700>] ? page_add_anon_rmap+0x10/0x20
[733871.893897] [<ffffffff810b5609>] numa_migrate_preferred+0x79/0x80
[733871.954283] [<ffffffff810b9c24>] task_numa_fault+0x7f4/0xd40
[733872.013128] [<ffffffff811bdf90>] handle_mm_fault+0xbc0/0x1820
[733872.071309] [<ffffffff81101420>] ? do_futex+0x120/0x500
[733872.128149] [<ffffffff812288c5>] ? __fget_light+0x25/0x60
[733872.184044] [<ffffffff8106a537>] __do_page_fault+0x197/0x400
[733872.239300] [<ffffffff8106a7c2>] do_page_fault+0x22/0x30
[733872.293001] [<ffffffff81824178>] page_fault+0x28/0x30
[733872.345187] Code: d0 4c 89 f7 e8 95 c7 ff ff 49 8b 84 24 d8 01 00 00 49 8b 76 78 31 d2 49 0f af 86 b0 00 00 00 4c 8b 45 d0 48 8b 4d b0 48 83 c6 01 <48> f7 f6 4c 89 c6 48 89 da 48 8d 3c 01 48 29 c6 e8 de c5 ff ff
[733872.507088] RIP [<ffffffff810b479d>] task_numa_find_cpu+0x2cd/0x710
[733872.559965] RSP <ffff8810cc0a3bd8>
[733872.673773] ---[ end trace aec37273a19e57dc ]---
In the ceph logs for node 1 there is:
./include/interval_set.h: 340: FAILED assert(0)
ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x56042ebdeceb]
2: (()+0x4892b8) [0x56042e9512b8]
3: (boost::statechart::simple_state<ReplicatedPG::WaitingOnReplicas, ReplicatedPG::SnapTrimmer, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)
0>::react_impl(boost::statechart::event_base const&, void const*)+0xb2) [0x56042e97a8d2]
4: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_queue
d_events()+0x127) [0x56042e9646e7]
5: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_event
(boost::statechart::event_base const&)+0x84) [0x56042e9648b4]
6: (ReplicatedPG::snap_trimmer()+0x52c) [0x56042e8eb5dc]
7: (OSD::SnapTrimWQ::_process(PG*)+0x1a) [0x56042e7807da]
8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa56) [0x56042ebcf8d6]
9: (ThreadPool::WorkThread::entry()+0x10) [0x56042ebd0980]
10: (()+0x8184) [0x7f27ecc66184]
11: (clone()+0x6d) [0x7f27eb1d137d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Unfortunately the only way we could get the processes to respond again
was to reboot the systems.
Is there any way of figuring out what went wrong here?
$ lsb_release -rd
Description: Ubuntu 14.04.4 LTS
Release: 14.04
$ dpkg-query -W ceph
ceph 0.94.7-0ubuntu0.15.04.1~cloud0
** Affects: ceph (Ubuntu)
Importance: Undecided
Status: New
** Tags: canonical-bootstack
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to ceph in Ubuntu.
https://bugs.launchpad.net/bugs/1599681
Title:
ceph-osd process hung and blocked ps listings
Status in ceph package in Ubuntu:
New
Bug description:
We ran into a situation over the past couple of days where we had 2
different ceph-osd nodes crash in such a way that they caused ps
listing to hang when enumerating the process. Both had a call trace
associated with them:
Node 1:
Jul 4 07:46:15 provider-cs-03 kernel: [4188396.493011] ceph-osd D ffff882029a67b90 0 5312 1 0x00000004
Jul 4 07:46:15 provider-cs-03 kernel: [4188396.590564] ffff882029a67b90 ffff881037cb8000 ffff8820284f3700 ffff882029a68000
Jul 4 07:46:16 provider-cs-03 kernel: [4188396.688603] ffff88203296e5a8 ffff88203296e5c0 0000000000000015 ffff8820284f3700
Jul 4 07:46:16 provider-cs-03 kernel: [4188396.789329] ffff882029a67ba8 ffffffff817ec495 ffff8820284f3700 ffff882029a67bf8
Jul 4 07:46:16 provider-cs-03 kernel: [4188396.891376] Call Trace:
Jul 4 07:46:16 provider-cs-03 kernel: [4188396.939271] [<ffffffff817ec495>] schedule+0x35/0x80
Jul 4 07:46:16 provider-cs-03 kernel: [4188396.989957] [<ffffffff817eeb6a>] rwsem_down_read_failed+0xea/0x120
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.041502] [<ffffffff813dbd84>] call_rwsem_down_read_failed+0x14/0x30
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.092616] [<ffffffff813dbf15>] ? __clear_user+0x25/0x50
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.141510] [<ffffffff813dbf15>] ? __clear_user+0x25/0x50
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.189877] [<ffffffff817ee1f0>] ? down_read+0x20/0x30
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.237513] [<ffffffff81067f18>] __do_page_fault+0x398/0x430
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.285588] [<ffffffff81067fd2>] do_page_fault+0x22/0x30
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.332936] [<ffffffff817f1e78>] page_fault+0x28/0x30
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.379495] [<ffffffff813dbf15>] ? __clear_user+0x25/0x50
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.426400] [<ffffffff81039c58>] copy_fpstate_to_sigframe+0x118/0x1d0
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.474904] [<ffffffff8102d1fd>] get_sigframe.isra.7.constprop.9+0x12d/0x150
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.563204] [<ffffffff8102d698>] do_signal+0x1e8/0x6d0
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.609783] [<ffffffff816d19f2>] ? __sys_sendmsg+0x42/0x80
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.656633] [<ffffffff811b2ed0>] ? handle_mm_fault+0x250/0x540
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.703785] [<ffffffff8107884c>] exit_to_usermode_loop+0x59/0xa2
Jul 4 07:46:17 provider-cs-03 kernel: [4188397.751367] [<ffffffff81003a6e>] syscall_return_slowpath+0x4e/0x60
Jul 4 07:46:17 provider-cs-03 kernel: [4188397.799369] [<ffffffff817efe58>] int_ret_from_sys_call+0x25/0x8f
Node 2:
[733869.727139] CPU: 17 PID: 1735127 Comm: ceph-osd Not tainted 4.4.0-15-generic #31-Ubuntu
[733869.796954] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.3.6 06/03/2015
[733869.927182] task: ffff881841dc6e00 ti: ffff8810cc0a0000 task.ti: ffff8810cc0a0000
[733870.059139] RIP: 0010:[<ffffffff810b479d>] [<ffffffff810b479d>] task_numa_find_cpu+0x2cd/0x710
[733870.192753] RSP: 0000:ffff8810cc0a3bd8 EFLAGS: 00010257
[733870.260298] RAX: 0000000000000000 RBX: ffff8810cc0a3c78 RCX: 0000000000000012
[733870.389322] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8810210a0e00
[733870.517883] RBP: ffff8810cc0a3c40 R08: 0000000000000006 R09: 000000000000013e
[733870.646335] R10: 00000000000003b4 R11: 000000000000001f R12: ffff881018118000
[733870.774514] R13: 0000000000000006 R14: ffff8810210a0e00 R15: 0000000000000379
[733870.902262] FS: 00007fdcfab03700(0000) GS:ffff88203e600000(0000) knlGS:0000000000000000
[733871.031347] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[733871.097820] CR2: 00007fdcfab02c20 CR3: 0000001029204000 CR4: 00000000001406e0
[733871.223381] Stack:
[733871.282453] ffff8810cc0a3c40 ffffffff811f04ce ffff88102f6e9680 0000000000000012
[733871.404947] 0000000000000077 000000000000008f 0000000000016d40 0000000000000006
[733871.527250] ffff881841dc6e00 ffff8810cc0a3c78 00000000000001ac 00000000000001b8
[733871.649648] Call Trace:
[733871.707884] [<ffffffff811f04ce>] ? migrate_page_copy+0x21e/0x530
[733871.770946] [<ffffffff810b501e>] task_numa_migrate+0x43e/0x9b0
[733871.832808] [<ffffffff811c9700>] ? page_add_anon_rmap+0x10/0x20
[733871.893897] [<ffffffff810b5609>] numa_migrate_preferred+0x79/0x80
[733871.954283] [<ffffffff810b9c24>] task_numa_fault+0x7f4/0xd40
[733872.013128] [<ffffffff811bdf90>] handle_mm_fault+0xbc0/0x1820
[733872.071309] [<ffffffff81101420>] ? do_futex+0x120/0x500
[733872.128149] [<ffffffff812288c5>] ? __fget_light+0x25/0x60
[733872.184044] [<ffffffff8106a537>] __do_page_fault+0x197/0x400
[733872.239300] [<ffffffff8106a7c2>] do_page_fault+0x22/0x30
[733872.293001] [<ffffffff81824178>] page_fault+0x28/0x30
[733872.345187] Code: d0 4c 89 f7 e8 95 c7 ff ff 49 8b 84 24 d8 01 00 00 49 8b 76 78 31 d2 49 0f af 86 b0 00 00 00 4c 8b 45 d0 48 8b 4d b0 48 83 c6 01 <48> f7 f6 4c 89 c6 48 89 da 48 8d 3c 01 48 29 c6 e8 de c5 ff ff
[733872.507088] RIP [<ffffffff810b479d>] task_numa_find_cpu+0x2cd/0x710
[733872.559965] RSP <ffff8810cc0a3bd8>
[733872.673773] ---[ end trace aec37273a19e57dc ]---
In the ceph logs for node 1 there is:
./include/interval_set.h: 340: FAILED assert(0)
ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x56042ebdeceb]
2: (()+0x4892b8) [0x56042e9512b8]
3: (boost::statechart::simple_state<ReplicatedPG::WaitingOnReplicas, ReplicatedPG::SnapTrimmer, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)
0>::react_impl(boost::statechart::event_base const&, void const*)+0xb2) [0x56042e97a8d2]
4: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_queue
d_events()+0x127) [0x56042e9646e7]
5: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_event
(boost::statechart::event_base const&)+0x84) [0x56042e9648b4]
6: (ReplicatedPG::snap_trimmer()+0x52c) [0x56042e8eb5dc]
7: (OSD::SnapTrimWQ::_process(PG*)+0x1a) [0x56042e7807da]
8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa56) [0x56042ebcf8d6]
9: (ThreadPool::WorkThread::entry()+0x10) [0x56042ebd0980]
10: (()+0x8184) [0x7f27ecc66184]
11: (clone()+0x6d) [0x7f27eb1d137d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Unfortunately the only way we could get the processes to respond again
was to reboot the systems.
Is there any way of figuring out what went wrong here?
$ lsb_release -rd
Description: Ubuntu 14.04.4 LTS
Release: 14.04
$ dpkg-query -W ceph
ceph 0.94.7-0ubuntu0.15.04.1~cloud0
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1599681/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list