[Bug 1783203] Re: Upgrade to RabbitMQ 3.6.10 causes beam lockup in clustered deployment

Mon Feb 25 11:54:42 UTC 2019

We are also seeing this issue after upgrading OpenStack from Pike to
Queens. It only seems to affect our larger setups, it wasn't seen during
testing on our staging setup. The good thing is that I can confirm that
disabling the management plugin seems to avoid the issue. We also have a
core dump, but since it is from a production environment, I cannot share
the content. However, the trace from thread 1 looks like

#0  0x00007fb09cd565d3 in select () at ../sysdeps/unix/syscall-template.S:84
#1  0x0000000000563c00 in erts_sys_main_thread ()
#2  0x0000000000469860 in erl_start ()
#3  0x000000000042f389 in main ()

and the other threads all seem to be at

#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00000000005c15ed in ethr_event_wait ()
#2  0x000000000051e0a5 in ?? ()
#3  0x00000000005c0da5 in ?? ()
#4  0x00007fb09d2326ba in start_thread (arg=0x7fb098ce6700) at pthread_create.c:333
#5  0x00007fb09cd6041d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Binary is /usr/lib/erlang/erts-7.3/bin/beam.smp from erlang-base 1:18.3
-dfsg-1ubuntu3.1.

Please let me know if you need further information, you can also reach
me as "frickler" in #ubuntu-server.

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to rabbitmq-server in Ubuntu.
https://bugs.launchpad.net/bugs/1783203

Title:
  Upgrade to RabbitMQ 3.6.10 causes beam lockup in clustered deployment

Status in OpenStack rabbitmq-server charm:
  New
Status in rabbitmq-server package in Ubuntu:
  Confirmed

Bug description:
  While performing an openstack release upgrade from Pike to Queens
  following the charmers guide, we have upgraded Ceph-* and Mysql.
  After setting source=cloud:xenial-queens on the RabbitMQ-Server charm
  and the cluster re-stabilizes, rabbitmq beam processes lock up on one
  cluster node causing complete denial of service on the openstack vhost
  across all 3 members of the cluster.  Killing the beam process on that
  node causes another node to lock up within a short timeframe.

  We have reproduced this twice in the same environment by re-deploying
  a fresh pike rabbitmq cluster and upgrading to queens.  The issue is
  not reproducable with generic workloads such as creating/deleting nova
  instances and creating/attaching/detaching cinder volumes, however,
  when running a full heat stack, we can reproduce this issue.

  This is happening on two of the three clouds on this site when RMQ is
  upgraded to Queens.  The third cloud was able to upgrade to Queens
  without issue but was upgraded on 18.02 charms.  Heat templates
  forthcoming.

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-rabbitmq-server/+bug/1783203/+subscriptions