[Bug 1993149] Re: VMs stay stuck in scheduling when rabbitmq leader unit is down

Mon Jun 12 02:22:06 UTC 2023

Reviewed:  https://review.opendev.org/c/openstack/oslo.messaging/+/883538
Committed: https://opendev.org/openstack/oslo.messaging/commit/f20a905ea6f41399c1723f8f1cbd0bc1097b8672
Submitter: "Zuul (22348)"
Branch:    stable/yoga

commit f20a905ea6f41399c1723f8f1cbd0bc1097b8672
Author: Andrew Bogott <abogott at wikimedia.org>
Date:   Mon Dec 5 09:25:00 2022 -0600

    Increase ACK_REQUEUE_EVERY_SECONDS_MAX to exceed default kombu_reconnect_delay

    Previously the two values were the same; this caused us
    to always exceed the timeout limit ACK_REQUEUE_EVERY_SECONDS_MAX
    which results in various code paths never being traversed
    due to premature timeout exceptions.

    Also apply min/max values to kombu_reconnect_delay so it doesn't
    exceed ACK_REQUEUE_EVERY_SECONDS_MAX and break things again.

    Closes-Bug: #1993149
    Change-Id: I103d2aa79b4bd2c331810583aeca53e22ee27a49
    (cherry picked from commit 0602d1a10ac20c48fa35ad711355c79ee5b0ec77)
    (cherry picked from commit b4b49248bcfcb169f96ab2d47b5d207b1354ffa8)
    (cherry picked from commit fa3195a3459cae3f4e9be43f114ee2d5eb7a60f1)

** Tags added: in-stable-yoga

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1993149

Title:
  VMs stay stuck in scheduling when rabbitmq leader unit is down

Status in OpenStack RabbitMQ Server Charm:
  Invalid
Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive yoga series:
  Fix Released
Status in Ubuntu Cloud Archive zed series:
  Fix Released
Status in oslo.messaging:
  Fix Released
Status in python-oslo.messaging package in Ubuntu:
  Fix Released
Status in python-oslo.messaging source package in Jammy:
  Fix Released
Status in python-oslo.messaging source package in Kinetic:
  Fix Released

Bug description:
  When testing rabbitmq-server HA in our OpenStack Yoga cloud
  environment (Rabbitmq Server release 3.9/stable) we faced the
  following issues:

  - When the leader unit is down we are unable to launch any VMs and the
  launched ones stay stuck in the 'BUILD' state.

  - While checking the logs we see that several OpenStack services has
  issues in communicating with the rabbitmq-server

  - After restarting all the services using rabbitmq (like Nova, Cinder,
  Neutron etc) the issue gets resolved and the VMs can be launched
  successfully

  The corresponding logs are available at:
  https://pastebin.ubuntu.com/p/Bk3yktR8tp/

  We also observed the same for rabbitmq-server unit which is first in
  the list of 'nova.conf' file, and after restarting the concerned
  rabbitmq unit we see that scheduling of VMs work fine again.

  As this can be seen from this part of the log as well:
  "Reconnected to AMQP server on 192.168.34.251:5672 via [amqp] client with port 41922."

  ====== Ubuntu SRU Details =======

  [Impact]
  Active/active HA for rabbitmq is broken when a node goes down. 

  [Test Case]
  Deploy openstack with 3 units of rabbitmq in active/active HA.

  [Regression Potential]
  Due to the criticality of this issue, I've decided to revert the upstream change that is causing the problem as a stop-gap until a proper fix is in place. That fix came in via https://bugs.launchpad.net/oslo.messaging/+bug/1935864. As a result we may see performance degradation in polling as described in that bug.

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-rabbitmq-server/+bug/1993149/+subscriptions