[Bug 1921749] Re: nautilus: ceph radosgw beast frontend coroutine stack corruption

Mauricio Faria de Oliveira 1921749 at bugs.launchpad.net
Mon Mar 29 13:59:42 UTC 2021


coredump #4

Shorter stack trace reported in ceph logs than in GDB.

	Oct 23 16:41:28 HOSTNAME radosgw[4319]: *** Caught signal (Segmentation fault) **
	Oct 23 16:41:28 HOSTNAME radosgw[4319]:  in thread 7fb79e999700 thread_name:msgr-worker-2
	Oct 23 16:41:28 HOSTNAME radosgw[4319]:  ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable)
	Oct 23 16:41:28 HOSTNAME radosgw[4319]:  1: (()+0x128a0) [0x7fb7a747d8a0]
	Oct 23 16:41:28 HOSTNAME radosgw[4319]:  2: (tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int)+0xdb) [0x7fb7b223dbcb]
	Oct 23 16:41:28 HOSTNAME radosgw[4319]:  3: (tcmalloc::ThreadCache::ListTooLong(tcmalloc::ThreadCache::FreeList*, unsigned long)+0x1b) [0x7fb7b223dc9b]
	Oct 23 16:41:28 HOSTNAME radosgw[4319]:  4: (cfree()+0x2d5) [0x7fb7b224c6f5]


	#0  raise (sig=sig at entry=11) at ../sysdeps/unix/sysv/linux/raise.c:51
	#1  0x00005565a2ff16b0 in reraise_fatal (signum=11) at ./src/global/signal_handler.cc:81
	#2  handle_fatal_signal (signum=11) at ./src/global/signal_handler.cc:326
	#3  <signal handler called>
	#4  tcmalloc::SLL_Next (t=0x0) at src/linked_list.h:45
	#5  tcmalloc::SLL_PopRange (end=<synthetic pointer>, start=<synthetic pointer>, N=158, head=0x5565a3cd8bf0) at src/linked_list.h:76
	#6  tcmalloc::ThreadCache::FreeList::PopRange (end=<synthetic pointer>, start=<synthetic pointer>, N=158, this=0x5565a3cd8bf0) at src/thread_cache.h:225
	#7  tcmalloc::ThreadCache::ReleaseToCentralCache (this=this at entry=0x5565a3cd8a40, src=src at entry=0x5565a3cd8bf0, cl=<optimized out>, N=158, N at entry=273) at src/thread_cache.cc:195
	#8  0x00007fb7b223dc9b in tcmalloc::ThreadCache::ListTooLong (this=this at entry=0x5565a3cd8a40, list=0x5565a3cd8bf0, cl=<optimized out>) at src/thread_cache.cc:157
	#9  0x00007fb7b224c6f5 in tcmalloc::ThreadCache::Deallocate (cl=<optimized out>, ptr=0x5565a57f5c00, this=0x5565a3cd8a40) at src/thread_cache.h:387
	#10 (anonymous namespace)::do_free_helper (invalid_free_fn=0x7fb7b222cce0 <(anonymous namespace)::InvalidFree(void*)>, size_hint=0, use_hint=false, heap_must_be_valid=true, heap=0x5565a3cd8a40, ptr=0x5565a57f5c00) at src/tcmalloc.cc:1305
	#11 (anonymous namespace)::do_free_with_callback (invalid_free_fn=0x7fb7b222cce0 <(anonymous namespace)::InvalidFree(void*)>, size_hint=0, use_hint=false, ptr=0x5565a57f5c00) at src/tcmalloc.cc:1337
	#12 (anonymous namespace)::do_free (ptr=0x5565a57f5c00) at src/tcmalloc.cc:1345
	#13 tc_free (ptr=0x5565a57f5c00) at src/tcmalloc.cc:1610
	#14 0x00007fb7b1fca164 in __gnu_cxx::new_allocator<OSDOp>::deallocate (this=0x5565a5bf0880, __p=<optimized out>) at /usr/include/c++/7/ext/new_allocator.h:125
	#15 std::allocator_traits<std::allocator<OSDOp> >::deallocate (__a=..., __n=<optimized out>, __p=<optimized out>) at /usr/include/c++/7/bits/alloc_traits.h:462
	#16 std::_Vector_base<OSDOp, std::allocator<OSDOp> >::_M_deallocate (this=0x5565a5bf0880, __n=<optimized out>, __p=<optimized out>) at /usr/include/c++/7/bits/stl_vector.h:180
	#17 std::_Vector_base<OSDOp, std::allocator<OSDOp> >::~_Vector_base (this=0x5565a5bf0880, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/stl_vector.h:162
	#18 std::vector<OSDOp, std::allocator<OSDOp> >::~vector (this=0x5565a5bf0880, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/stl_vector.h:435
	#19 MOSDOp::~MOSDOp (this=0x5565a5bf0600, __in_chrg=<optimized out>) at ./src/messages/MOSDOp.h:195
	#20 MOSDOp::~MOSDOp (this=0x5565a5bf0600, __in_chrg=<optimized out>) at ./src/messages/MOSDOp.h:195
	#21 0x00007fb7a8ca6db7 in RefCountedObject::put (this=0x5565a5bf0600) at ./src/common/RefCountedObj.h:64
	#22 0x00007fb7a8f42d30 in ProtocolV2::write_message (this=this at entry=0x5565a5776000, m=m at entry=0x5565a5bf0600, more=more at entry=false) at ./src/msg/async/ProtocolV2.cc:571
	#23 0x00007fb7a8f56f0b in ProtocolV2::write_event (this=0x5565a5776000) at ./src/msg/async/ProtocolV2.cc:658
	#24 0x00007fb7a8f16263 in AsyncConnection::handle_write (this=0x5565a5763b00) at ./src/msg/async/AsyncConnection.cc:692
	#25 0x00007fb7a8f6a757 in EventCenter::process_events (this=this at entry=0x5565a43f2e00, timeout_microseconds=<optimized out>, timeout_microseconds at entry=30000000, working_dur=working_dur at entry=0x7fb79e996be8) at ./src/msg/async/Event.cc:441
	#26 0x00007fb7a8f6ee48 in NetworkStack::<lambda()>::operator() (__closure=0x5565a44c3958) at ./src/msg/async/Stack.cc:53
	#27 std::_Function_handler<void(), NetworkStack::add_thread(unsigned int)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/7/bits/std_function.h:316
	#28 0x00007fb7a719f6df in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
	#29 0x00007fb7a74726db in start_thread (arg=0x7fb79e999700) at pthread_create.c:463
	#30 0x00007fb7a685ca3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95


Again, interaction between SLL_Pop(Range) and SLL_Next.

	#4  tcmalloc::SLL_Next (t=0x0) at src/linked_list.h:45
	#5  tcmalloc::SLL_PopRange (end=<synthetic pointer>, start=<synthetic pointer>, N=158, head=0x5565a3cd8bf0) at src/linked_list.h:76

Same as previous 2 cases, same function/instruction/register/pointer:

        (gdb) f 4

	(gdb) x/i $rip
	=> 0x7fb7b223dbcb <tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int)+219>:     mov    (%rdx),%rdx

	(gdb) x $rdx
	   0x0: Cannot access memory at address 0x0

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to ceph in Ubuntu.
https://bugs.launchpad.net/bugs/1921749

Title:
  nautilus: ceph radosgw beast frontend coroutine stack corruption

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive train series:
  Confirmed
Status in ceph package in Ubuntu:
  Fix Released

Bug description:
  [Impact]

  The radosgw beast frontend in ceph nautilus might hit coroutine stack
  corruption on startup and requests.

  This is usually observed right at the startup of the ceph-radosgw systemd unit; sometimes 1 minute later.
  But it might occur any time handling requests, depending on coroutine/request's function path/stack size.

  The symptoms are usually a crash with stack trace listing TCMalloc (de)allocate/release to central cache,
  but less rare signs are large allocs in the _terabytes_ range (pointer to stack used as allocation size)
  and stack traces showing function return addresses (RIP) that are actually pointers to an stack address.

  This is not widely hit in Ubuntu as most deployments use the ceph-radosgw charm that hardcodes 'civetweb'
  as rgw frontend, which is _not_ affected; custom/cephadm deployments that choose 'beast' might hit this.

    @ charm-ceph-radosgw/templates/ceph.conf
          rgw frontends = civetweb port={{ port }}

  Let's report this LP bug for documentation and tracking purposes until
  UCA gets the fixes.

  [Fix]

  This has been reported by an Ubuntu Advantage user, and another user in ceph tracker #47910 [1].
  This had been reported and fixed in Octopus [2] (confirmed by UA user; no longer affected.)

  The Nautilus backport has recently been merged [3, 4] and should be
  available in v14.2.19.

  [Test Case]

  The conditions to trigger the bug aren't clear, but apparently related to EC pools w/ very large buckets,
  and of course the radosgw frontend beast being enabled (civetweb is not affected.)

  [Where problems could occur]

  The fixes are restricted to the beast frontend, specifically to the coroutines used to handle requests.
  So problems would probably be seen in request handling only with the beast frontend.
  Workarounds thus include switching back to the civetweb frontend.

  This changes core/base parts of the RGW beast frontend code, but are in place from Octopus released.
  The other user/reporter in the ceph tracker has been using the patches for weeks with no regression;
  the ceph tests have passed and likely serious issues would be caught by ceph CI upstream.

  [1] https://tracker.ceph.com/issues/47910 report tracker (nautilus)
  [2] https://tracker.ceph.com/issues/43739 master tracker (octopus)
  [3] https://tracker.ceph.com/issues/43921 backport tracker (nautilus)
  [4] https://github.com/ceph/ceph/pull/39947 github PR

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1921749/+subscriptions



More information about the Ubuntu-openstack-bugs mailing list