[Bug 2089565] Re: MON and MDS crash upgrading CEPH on ubuntu 24.04 LTS

Maksym Medvied 2089565 at bugs.launchpad.net
Sat Dec 21 20:14:04 UTC 2024


git clone https://git.launchpad.net/ubuntu/+source/ceph
cd ceph

> git grep -n MDSMap::decode
src/mds/FSMap.cc:1086:   * Insert INLINE; see comment in MDSMap::decode.
src/mds/MDSMap.cc:836:void MDSMap::decode(bufferlist::const_iterator& p)

So we're interested in src/mds/MDSMap.cc (if the file was not renamed
and the function was not moved).

Let's get the file for 2 different revisions, extract MDSMap::decode()
function from both and then compare to see the difference.

> git tag | grep 19.2.0-0ubuntu0.24.04.1
applied/19.2.0-0ubuntu0.24.04.1
import/19.2.0-0ubuntu0.24.04.1
> git show applied/19.2.0-0ubuntu0.24.04.1:src/mds/MDSMap.cc > /tmp/MDSMap.cc.new

The old version is 19.2.0~git20240301.4c76c50-0ubuntu6, the closest tag
(by name) in the repo is applied/19.2.0_git20240301.4c76c50-0ubuntu6:

> git show applied/19.2.0_git20240301.4c76c50-0ubuntu6:src/mds/MDSMap.cc
> /tmp/MDSMap.cc.old

After running diff for the files we see that both encode and decode
functions were changed. This is the relevant part for the decode
function:

> diff -u /tmp/MDSMap.cc.old /tmp/MDSMap.cc.new
...
@@ -852,7 +863,8 @@
     decode(cas_pool, p);
   }
 
-  // kclient ignores everything from here
+  // kclient skips most of what's below
+  // see fs/ceph/mdsmap.c for current decoding
   __u16 ev = 1;
   if (struct_v >= 2)
     decode(ev, p);
@@ -949,11 +961,16 @@
   }
 
   if (ev >= 17) {
-    decode(max_xattr_size, p);
+    decode(bal_rank_mask, p);
   }
 
   if (ev >= 18) {
-    decode(bal_rank_mask, p);
+    decode(max_xattr_size, p);
+  }
+
+  if (ev >= 19) {
+    decode(qdb_cluster_leader, p);
+    decode(qdb_cluster_members, p);
   }
 
   /* All MDS since at least v14.0.0 understand INLINE */

We see that the order of fields and the number of fields changed in the
decode() function, and it doesn't seem to be an error handling for the
cases when the format is incorrect.

Now let's explore the binary to see where exactly is the panic in
MDSMap::decode().

We have ceph-mon binary extracted earlier. We could load it in gdb,
which should provide disassembled versions of the functions. We could
also try to load debuginfo and put the source tree at the right place to
get even better symbols and source references.

> gdb ./usr/bin/ceph-mon
...
This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.ubuntu.com>
Enable debuginfod for this session? (y or [n]) y
...
(gdb) start
Downloading source file /usr/src/ceph-19.2.0-0ubuntu0.24.04.1/src/ceph_mon.cc
Temporary breakpoint 1 at 0x32c670: file /usr/src/ceph-19.2.0-0ubuntu0.24.04.1/src/ceph_mon.cc, line 250.
...
Temporary breakpoint 1, main (argc=1, argv=0x7fffffffdf98)
    at /usr/src/ceph-19.2.0-0ubuntu0.24.04.1/src/ceph_mon.cc:250
warning: 250    /usr/src/ceph-19.2.0-0ubuntu0.24.04.1/src/ceph_mon.cc: No such file or directory
(gdb)

Now we know that it's looking for the source tree in
/usr/src/ceph-19.2.0-0ubuntu0.24.04.1/. Let's put the tree there (you
may need to add "deb-src" after "deb" (so it becomes "deb deb-src") in
/etc/apt/sources.list.d/ubuntu.sources):

> cd /usr/src/
> sudo apt source ceph

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to ceph in Ubuntu.
https://bugs.launchpad.net/bugs/2089565

Title:
  MON and MDS crash upgrading  CEPH  on ubuntu 24.04 LTS

Status in ceph package in Ubuntu:
  Confirmed

Bug description:
  This issue is a continuation of
  https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/2065515

  
  On Ubuntu 24.04 lts we did upgrade Ceph to  19.2.0-0ubuntu0.24.04.1

  Previous release is : 19.2.0~git20240301.4c76c50-0ubuntu6

  whenever  upgrading (tested on 2 different clusters)  the ceph-mon
  ends up crashing repeatedly with the below stack error

  ```
   ceph version 19.2.0 (16063ff2022298c9300e49a547a16ffda59baf13) squid (stable)
   1: /lib/x86_64-linux-gnu/libc.so.6(+0x45320) [0x788409245320]
   2: pthread_kill()
   3: gsignal()
   4: abort()
   5: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa5ff5) [0x7884096a5ff5]
   6: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb0da) [0x7884096bb0da]
   7: (std::unexpected()+0) [0x7884096a5a55]
   8: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb391) [0x7884096bb391]
   9: (ceph::buffer::v15_2_0::list::iterator_impl<true>::copy(unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)+0x193) [0x78840a293593]
   10: (MDSMap::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0xca1) [0x78840a4c3ab1]
   11: (Filesystem::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x1c3) [0x78840a4e4303]
   12: (FSMap::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x280) [0x78840a4e6ef0]
   13: (MDSMonitor::update_from_paxos(bool*)+0x291) [0x631ac5dea801]
   14: (Monitor::refresh_from_paxos(bool*)+0x124) [0x631ac5b7a164]
   15: (Monitor::preinit()+0x98e) [0x631ac5bb2fbe]
   16: main()
   17: /lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x78840922a1ca]
   18: __libc_start_main()
   19: _start()
   NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

  ```

  
  mitigation:
  a rollback to the previous release 19.2.0~git20240301.4c76c50-0ubuntu6 is still possible to restore service

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/2089565/+subscriptions




More information about the Ubuntu-openstack-bugs mailing list