[Bug 1890334] Re: ceph: nautilus: backport fixes for msgr/eventcenter

Mauricio Faria de Oliveira 1890334 at bugs.launchpad.net
Mon Aug 24 16:56:42 UTC 2020


Steps to build the test case:

$ sudo add-apt-repository -s cloud-archive:train-proposed
$ sudo apt update

$ apt source ceph
$ cd ceph-14.2.11

$ sed 's/-DWITH_TESTS=OFF/-DWITH_TESTS=ON/' -i debian/rules
$ dch -l '+withtestson' 'Enable building tests.'

$ sudo apt build-dep -y ceph
$ DEB_BUILD_OPTIONS=parallel=20 dpkg-buildpackage -us -uc

$ ls -lh obj-x86_64-linux-gnu/bin/ceph_test_rados_api_misc
-rwxrwxr-x 1 ubuntu ubuntu 13M Aug 24 15:36 obj-x86_64-linux-gnu/bin/ceph_test_rados_api_misc


** Attachment added: "ceph_test_rados_api_misc-v14.2.11.xz"
   https://bugs.launchpad.net/cloud-archive/+bug/1890334/+attachment/5404188/+files/ceph_test_rados_api_misc-v14.2.11.xz

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1890334

Title:
  ceph: nautilus: backport fixes for msgr/eventcenter

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive train series:
  Fix Committed
Status in Ubuntu Cloud Archive ussuri series:
  Fix Released
Status in Ubuntu Cloud Archive victoria series:
  Fix Released
Status in ceph package in Ubuntu:
  Fix Released
Status in ceph source package in Eoan:
  Won't Fix
Status in ceph source package in Focal:
  Fix Released
Status in ceph source package in Groovy:
  Fix Released

Bug description:
  [Impact]

   * Ceph Nautilus/14 may hit daemon crashes in msgr/eventcenter
     as it lacks backport fixes to properly protect many threads
     in the connection close/reset/reuse paths.

   * Once a daemon crash occurs, the cluster becomes HEALTH_WARN,
     and reports in status: "N daemons have recently crashed"

   * Example:
  	
      $ juju run --unit ceph-mon/0 "sudo ceph -s"
      cluster:
      id: ...
      health: HEALTH_WARN
      1 daemons have recently crashed

  [Fix]

   * The backport patches in Ceph PR #33820 [1] fix this problem.

   * There are 8 patches in it, but only 5 are strictly required
     (3 are related to testcases/sanitizers, not used in package),
     and 1 is already applied; so actually only 4 patches needed
     (the 'msg/async:' patches.)

    [1] https://github.com/ceph/ceph/pull/33820

  [Test Case]

   * The test-case patch in the PR is a reliable reproducer; it
     can be applied then built with -DWITH_TESTS=ON in d/rules;
     found in 'obj-x86_64-linux-gnu/bin/ceph_test_rados_api_misc'

   * On a test ceph cluster (e.g., 1 MON, 3 OSDs) in the mon node:

     $ sudo LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/ceph/ \
      ./ceph_test_rados_api_misc --gtest_filter=LibRadosMisc.ShutdownRace

   * This hits segfaults with the stack traces seen by the reporter,
     and other traces as well in the original package, and no errors
     in the patched package.

   * Attached the test-case binary 'ceph_test_rados_api_misc' and
     the juju bundle for the test ceph cluster 'ceph-lp1890334.yaml'.
   
  [Regression Potential]

   * These patches change the connection close/reset/reuse logic,
     so regressions would likely manifest in such functions but
     be exposed/hit errors actually in daemon communication.

   * There are no further related fixes upstream.

  [Other Info]

   * Patches already available on Ceph Octopus/15 on Focal.
   * Not reporting against Eoan (Train) as it is EOL.

  [Original Description]

  Ceph Nautilus in bionic-train may hit daemon crashes (e.g., ceph-mgr)
  in msgr/eventcenter as it lacks the following set of fixes backports:

    https://github.com/ceph/ceph/pull/33820

  Reporting the bug against UCA since Ubuntu Eoan (Train) is EOL.
  Working on the debdiffs and tests.

  Example stack trace as reported by 'ceph crash info' and GDB:

  $ sudo ceph crash info <crash ID>
  ...
      "process_name": "ceph-mgr",
  ...
      "backtrace": [
          "(()+0x128a0) [0x7f8e4ae928a0]",
          "(bool ProtocolV2::append_frame<ceph::msgr::v2::MessageFrame>(ceph::msgr::v2::MessageFrame&)+0x48a) [0x7f8e4bf4219a]",
          "(ProtocolV2::write_message(Message*, bool)+0x4dd) [0x7f8e4bf249dd]",
          "(ProtocolV2::write_event()+0x2c5) [0x7f8e4bf39d55]",
          "(AsyncConnection::handle_write()+0x43) [0x7f8e4bef89e3]",
          "(EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xd57) [0x7f8e4bf51157]",
          "(()+0x59b848) [0x7f8e4bf55848]",
          "(()+0xbd6df) [0x7f8e4a9b06df]",
          "(()+0x76db) [0x7f8e4ae876db]",
          "(clone()+0x3f) [0x7f8e4a06da3f]"
      ]
  ...

  (gdb) bt
  #0  raise (sig=sig at entry=11) at ../sysdeps/unix/sysv/linux/raise.c:51
  #1  0x000055b9deda9140 in reraise_fatal (signum=11) at ./src/global/signal_handler.cc:81
  #2  handle_fatal_signal (signum=11) at ./src/global/signal_handler.cc:326
  #3  <signal handler called>
  #4  ceph::msgr::v2::Frame<ceph::msgr::v2::MessageFrame, (unsigned short)8, (unsigned short)8, (unsigned short)8, (unsigned short)4096>::get_buffer (session_stream_handlers=..., this=<optimized out>) at ./src/msg/async/frames_v2.h:273
  #5  ProtocolV2::append_frame<ceph::msgr::v2::MessageFrame> (this=this at entry=0x55b9e4830680, frame=...) at ./src/msg/async/ProtocolV2.cc:552
  #6  0x00007f8e4bf249dd in ProtocolV2::write_message (this=this at entry=0x55b9e4830680, m=m at entry=0x55b9e596da40, more=more at entry=false)
      at ./src/msg/async/ProtocolV2.cc:515
  #7  0x00007f8e4bf39d55 in ProtocolV2::write_event (this=0x55b9e4830680) at ./src/msg/async/ProtocolV2.cc:627
  #8  0x00007f8e4bef89e3 in AsyncConnection::handle_write (this=0x55b9e73ec480) at ./src/msg/async/AsyncConnection.cc:692
  #9  0x00007f8e4bf51157 in EventCenter::process_events (this=this at entry=0x55b9e05502c0, timeout_microseconds=<optimized out>,
      timeout_microseconds at entry=30000000, working_dur=working_dur at entry=0x7f8e466d5828) at ./src/msg/async/Event.cc:441
  #10 0x00007f8e4bf55848 in NetworkStack::<lambda()>::operator() (__closure=0x55b9e05feff8) at ./src/msg/async/Stack.cc:53
  #11 std::_Function_handler<void(), NetworkStack::add_thread(unsigned int)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...)
      at /usr/include/c++/7/bits/std_function.h:316
  #12 0x00007f8e4a9b06df in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
  #13 0x00007f8e4ae876db in start_thread (arg=0x7f8e466d8700) at pthread_create.c:463
  #14 0x00007f8e4a06da3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1890334/+subscriptions



More information about the Ubuntu-openstack-bugs mailing list