[Bug 1993149] Re: VMs stay stuck in scheduling when rabbitmq leader unit is down
Andrew Bogott
1993149 at bugs.launchpad.net
Mon Dec 5 02:20:03 UTC 2022
tl;dr: Adding this config seems to resolve the issue for me:
[oslo_messaging_rabbit]
kombu_reconnect_delay=0.1
long version:
I've been staring at [bdcf915e] off and on for several days, and it looks right to me, in theory. That section of code consists of rather a lot of nested timeouts, and this bug looks to be like an issue of having inner-loop timouts fire before their outer-loop timeouts have a chance to.
In particular, I think the issue is in this scrap of
kombu.connection._ensure_connection:
def on_error(exc, intervals, retries, interval=0):
round = self.completes_cycle(retries)
if round:
interval = next(intervals)
if errback:
errback(exc, interval)
self.maybe_switch_next() # select next host
return interval if round else 0
If errback (a callback passed in by the oslo driver) throws an exception
100% of the time (as it seems to post-[bdcf915e]) then failover never
happens. I can prevent that ensuring that
oslo_messaging_rabbit->kombu_reconnect_delay is less than
ACK_REQUEUE_EVERY_SECONDS_MAX (which is now one of our max timeouts
thanks to [bdcf915e].)
I'm not 100% convinced that this is the correct fix since it's easy to
luck your way out of a timing bug, but it has the advantage of not
require a package upgrade.
I also note that kombu_reconnect_delay is only used in one section of
code, prefaced with:
# TODO(sileht): Check if this is useful since we
# use kombu for HA connection, the interval_step
# should sufficient, because the underlying kombu transport
# connection object freed.
...so maybe we can rip out that code and remove kombu_reconnect_delay
entirely (which would also resolve the timeout contention).
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1993149
Title:
VMs stay stuck in scheduling when rabbitmq leader unit is down
Status in OpenStack RabbitMQ Server Charm:
Triaged
Status in Ubuntu Cloud Archive:
Fix Released
Status in Ubuntu Cloud Archive yoga series:
Fix Released
Status in Ubuntu Cloud Archive zed series:
Fix Released
Status in oslo.messaging:
New
Status in python-oslo.messaging package in Ubuntu:
Fix Released
Status in python-oslo.messaging source package in Jammy:
Fix Released
Status in python-oslo.messaging source package in Kinetic:
Fix Released
Bug description:
When testing rabbitmq-server HA in our OpenStack Yoga cloud
environment (Rabbitmq Server release 3.9/stable) we faced the
following issues:
- When the leader unit is down we are unable to launch any VMs and the
launched ones stay stuck in the 'BUILD' state.
- While checking the logs we see that several OpenStack services has
issues in communicating with the rabbitmq-server
- After restarting all the services using rabbitmq (like Nova, Cinder,
Neutron etc) the issue gets resolved and the VMs can be launched
successfully
The corresponding logs are available at:
https://pastebin.ubuntu.com/p/Bk3yktR8tp/
We also observed the same for rabbitmq-server unit which is first in
the list of 'nova.conf' file, and after restarting the concerned
rabbitmq unit we see that scheduling of VMs work fine again.
As this can be seen from this part of the log as well:
"Reconnected to AMQP server on 192.168.34.251:5672 via [amqp] client with port 41922."
====== Ubuntu SRU Details =======
[Impact]
Active/active HA for rabbitmq is broken when a node goes down.
[Test Case]
Deploy openstack with 3 units of rabbitmq in active/active HA.
[Regression Potential]
Due to the criticality of this issue, I've decided to revert the upstream change that is causing the problem as a stop-gap until a proper fix is in place. That fix came in via https://bugs.launchpad.net/oslo.messaging/+bug/1935864. As a result we may see performance degradation in polling as described in that bug.
To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-rabbitmq-server/+bug/1993149/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list