[Bug 1890334] Re: ceph: nautilus: backport fixes for msgr/eventcenter
Mauricio Faria de Oliveira
1890334 at bugs.launchpad.net
Mon Aug 24 16:56:42 UTC 2020
Steps to build the test case:
$ sudo add-apt-repository -s cloud-archive:train-proposed
$ sudo apt update
$ apt source ceph
$ cd ceph-14.2.11
$ sed 's/-DWITH_TESTS=OFF/-DWITH_TESTS=ON/' -i debian/rules
$ dch -l '+withtestson' 'Enable building tests.'
$ sudo apt build-dep -y ceph
$ DEB_BUILD_OPTIONS=parallel=20 dpkg-buildpackage -us -uc
$ ls -lh obj-x86_64-linux-gnu/bin/ceph_test_rados_api_misc
-rwxrwxr-x 1 ubuntu ubuntu 13M Aug 24 15:36 obj-x86_64-linux-gnu/bin/ceph_test_rados_api_misc
** Attachment added: "ceph_test_rados_api_misc-v14.2.11.xz"
https://bugs.launchpad.net/cloud-archive/+bug/1890334/+attachment/5404188/+files/ceph_test_rados_api_misc-v14.2.11.xz
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1890334
Title:
ceph: nautilus: backport fixes for msgr/eventcenter
Status in Ubuntu Cloud Archive:
Fix Released
Status in Ubuntu Cloud Archive train series:
Fix Committed
Status in Ubuntu Cloud Archive ussuri series:
Fix Released
Status in Ubuntu Cloud Archive victoria series:
Fix Released
Status in ceph package in Ubuntu:
Fix Released
Status in ceph source package in Eoan:
Won't Fix
Status in ceph source package in Focal:
Fix Released
Status in ceph source package in Groovy:
Fix Released
Bug description:
[Impact]
* Ceph Nautilus/14 may hit daemon crashes in msgr/eventcenter
as it lacks backport fixes to properly protect many threads
in the connection close/reset/reuse paths.
* Once a daemon crash occurs, the cluster becomes HEALTH_WARN,
and reports in status: "N daemons have recently crashed"
* Example:
$ juju run --unit ceph-mon/0 "sudo ceph -s"
cluster:
id: ...
health: HEALTH_WARN
1 daemons have recently crashed
[Fix]
* The backport patches in Ceph PR #33820 [1] fix this problem.
* There are 8 patches in it, but only 5 are strictly required
(3 are related to testcases/sanitizers, not used in package),
and 1 is already applied; so actually only 4 patches needed
(the 'msg/async:' patches.)
[1] https://github.com/ceph/ceph/pull/33820
[Test Case]
* The test-case patch in the PR is a reliable reproducer; it
can be applied then built with -DWITH_TESTS=ON in d/rules;
found in 'obj-x86_64-linux-gnu/bin/ceph_test_rados_api_misc'
* On a test ceph cluster (e.g., 1 MON, 3 OSDs) in the mon node:
$ sudo LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/ceph/ \
./ceph_test_rados_api_misc --gtest_filter=LibRadosMisc.ShutdownRace
* This hits segfaults with the stack traces seen by the reporter,
and other traces as well in the original package, and no errors
in the patched package.
* Attached the test-case binary 'ceph_test_rados_api_misc' and
the juju bundle for the test ceph cluster 'ceph-lp1890334.yaml'.
[Regression Potential]
* These patches change the connection close/reset/reuse logic,
so regressions would likely manifest in such functions but
be exposed/hit errors actually in daemon communication.
* There are no further related fixes upstream.
[Other Info]
* Patches already available on Ceph Octopus/15 on Focal.
* Not reporting against Eoan (Train) as it is EOL.
[Original Description]
Ceph Nautilus in bionic-train may hit daemon crashes (e.g., ceph-mgr)
in msgr/eventcenter as it lacks the following set of fixes backports:
https://github.com/ceph/ceph/pull/33820
Reporting the bug against UCA since Ubuntu Eoan (Train) is EOL.
Working on the debdiffs and tests.
Example stack trace as reported by 'ceph crash info' and GDB:
$ sudo ceph crash info <crash ID>
...
"process_name": "ceph-mgr",
...
"backtrace": [
"(()+0x128a0) [0x7f8e4ae928a0]",
"(bool ProtocolV2::append_frame<ceph::msgr::v2::MessageFrame>(ceph::msgr::v2::MessageFrame&)+0x48a) [0x7f8e4bf4219a]",
"(ProtocolV2::write_message(Message*, bool)+0x4dd) [0x7f8e4bf249dd]",
"(ProtocolV2::write_event()+0x2c5) [0x7f8e4bf39d55]",
"(AsyncConnection::handle_write()+0x43) [0x7f8e4bef89e3]",
"(EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xd57) [0x7f8e4bf51157]",
"(()+0x59b848) [0x7f8e4bf55848]",
"(()+0xbd6df) [0x7f8e4a9b06df]",
"(()+0x76db) [0x7f8e4ae876db]",
"(clone()+0x3f) [0x7f8e4a06da3f]"
]
...
(gdb) bt
#0 raise (sig=sig at entry=11) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x000055b9deda9140 in reraise_fatal (signum=11) at ./src/global/signal_handler.cc:81
#2 handle_fatal_signal (signum=11) at ./src/global/signal_handler.cc:326
#3 <signal handler called>
#4 ceph::msgr::v2::Frame<ceph::msgr::v2::MessageFrame, (unsigned short)8, (unsigned short)8, (unsigned short)8, (unsigned short)4096>::get_buffer (session_stream_handlers=..., this=<optimized out>) at ./src/msg/async/frames_v2.h:273
#5 ProtocolV2::append_frame<ceph::msgr::v2::MessageFrame> (this=this at entry=0x55b9e4830680, frame=...) at ./src/msg/async/ProtocolV2.cc:552
#6 0x00007f8e4bf249dd in ProtocolV2::write_message (this=this at entry=0x55b9e4830680, m=m at entry=0x55b9e596da40, more=more at entry=false)
at ./src/msg/async/ProtocolV2.cc:515
#7 0x00007f8e4bf39d55 in ProtocolV2::write_event (this=0x55b9e4830680) at ./src/msg/async/ProtocolV2.cc:627
#8 0x00007f8e4bef89e3 in AsyncConnection::handle_write (this=0x55b9e73ec480) at ./src/msg/async/AsyncConnection.cc:692
#9 0x00007f8e4bf51157 in EventCenter::process_events (this=this at entry=0x55b9e05502c0, timeout_microseconds=<optimized out>,
timeout_microseconds at entry=30000000, working_dur=working_dur at entry=0x7f8e466d5828) at ./src/msg/async/Event.cc:441
#10 0x00007f8e4bf55848 in NetworkStack::<lambda()>::operator() (__closure=0x55b9e05feff8) at ./src/msg/async/Stack.cc:53
#11 std::_Function_handler<void(), NetworkStack::add_thread(unsigned int)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...)
at /usr/include/c++/7/bits/std_function.h:316
#12 0x00007f8e4a9b06df in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#13 0x00007f8e4ae876db in start_thread (arg=0x7f8e466d8700) at pthread_create.c:463
#14 0x00007f8e4a06da3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1890334/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list