[Bug 1905965] Re: n-cpu raising MessageUndeliverable when replying to RPC call
Corey Bryant
1905965 at bugs.launchpad.net
Tue Aug 24 17:56:35 UTC 2021
This is being fixed via LP:1940858
** Changed in: python-oslo.messaging (Ubuntu Focal)
Status: Triaged => Fix Committed
** Changed in: cloud-archive/victoria
Status: Triaged => Fix Committed
** Changed in: cloud-archive/ussuri
Status: Triaged => Fix Committed
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1905965
Title:
n-cpu raising MessageUndeliverable when replying to RPC call
Status in Ubuntu Cloud Archive:
Fix Released
Status in Ubuntu Cloud Archive ussuri series:
Fix Committed
Status in Ubuntu Cloud Archive victoria series:
Fix Committed
Status in oslo.messaging:
Confirmed
Status in python-oslo.messaging package in Ubuntu:
Fix Released
Status in python-oslo.messaging source package in Focal:
Fix Committed
Bug description:
Summary
=======
Recently, on train/OSP16.1 we noticed `MessageUndeliverable` when
replying to RPC calls on nova [1]. I think that those exception raised
within nova aren't legit and are a new bug in oslo.messaging.
Indeed, on oslo.messaging, in a normal situation,
`MessageUndeliverable` are raised when a message is sent to a reply
queue that doesn't exist for some reason. This new behaviour was
introduced with oslo.messaging 10.2.0, by adding the mandatory flag
for direct sending [2][3]. This is an expected behaviour to early
detect an error in case the reply queue does not exist.
However, I think that those raised within nova are due to a limitation
of RabbitMQ's RPC direct reply-to feature.
Also, I think that this feature (MessageUndeliverable to the mandatory
flag) introduced unexpected side effects on Nova where RPC client
Timed out due to server's reply messages which finished non routed and
so underlivered to the client.
Observed Bug
============
Here in Nova on the server side (nova-compute) we can observe the
following traceback:
```
2020-10-30 16:32:54.059 8 ERROR oslo_messaging.rpc.server [req-99a5cda6-7c8e-4cba-88f4-37b3447d4dbd c767e1727b1348449b355ea2f6c529f3 0b0b4de19ae94554a8f8b2c949306456 - default default] MessageUndeliverable error, source exception: Basic.return: (312) NO_ROUTE, routing_key: reply_d2ac09b0671840d39da6a9c718b5a63f, exchange: : : oslo_messaging.exceptions.MessageUndeliverable
```
Still on Nova and on the client (nova-api) we can observe the
following traceback:
```
c767e1727b1348449b355ea2f6c529f3 0b0b4de19ae94554a8f8b2c949306456 - default default] Unexpected exception in API method: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID cb84a7fc264748aa9f19ddab48311975
```
The client never received the response and then reached a timeout
because the message wasn't routed "MessageUndeliverable error, source
exception: Basic.return: (312) NO_ROUTE, routing_key:
reply_d2ac09b0671840d39da6a9c718b5a63f, exchange: : : ".
That lead us to a Nova issue where Volume attachment is failing, on
Cinder it shows as available but nova shows it as attached. nova-api
calling through RPC `reserve_block_device_name` on the compute to and
then timing out after 10mins because the server never respond to the
call. In fact the server respond but the message is non routed to its
destination (c.f the server traceback above).
This leaves a block_device_mapping record in the database, leaving it
listed in the os-volume_attachments API and `nova volume-attachments`
output. This also causes the subsequent attempts to attach the volume
to fail as we already have an active bdm record for it.
Details about the `direct_mandatory_flag` feature of oslo.messaging
===================================================================
In a normal situation this mandatory flag tells the server how to
react if a message cannot be routed to a queue. Specifically, if
mandatory is set and after running the bindings the message was placed
on zero queues then the message is returned to the sender (with a
basic.return). If mandatory had not been set under the same
circumstances the server would silently drop the message.
By disabling the mandatory flag if a reply_queue doesn't exist you
will fall in MessagingTimeout . if `direct_mandatory_flag` [4] is
equal to True then `MessageUndeliverable` is raised immediatly when a
reply queue doesn't exist, by default if this option
(`direct_mandatory_flag`) doesn't exist (oslo.messaging < 10.2.0) or
equal to False then you have to wait the default timeout.
The Root Cause
==============
I think that with RPC server's direct reply the
`direct_mandatory_flag` feature doesn't work as expected.
First let's start by describing a bit how RPC server works with
RabbitMQ.
The RPC server(s) consume requests from this queue and then send
replies to each client using the queue named by the client in the
reply-to header.
A client have two options:
- declare a single-use queue for each request-response pair.
- can create a long-lived queue for its replies.
The direct reply-to feature allows RPC clients to receive replies
directly from their RPC server, without going through a reply queue.
"Directly" here still means going through the same connection and a
RabbitMQ node; there is no direct network connection between RPC
client and RPC server processes.
The RPC server will then see a reply-to property with a generated
name. It should publish to the default exchange ("") with the routing
key set to this value (i.e. just as if it were sending to a reply
queue as usual). The message will then be sent straight to the client
consumer.
However this feature have some caveats and limitations [5]...
Especially the fact that the name `amq.rabbitmq.reply-to` is used in
`basic.consume` and the `reply-to` property as if it were a queue;
however it is not. It cannot be deleted, and does not appear in the
management plugin or `rabbitmqctl list_queues`.
If the RPC server publishes with the mandatory flag set then
`amq.rabbitmq.reply-to.*` is treated as not a queue [5]; i.e. if the
server only publishes to this name then the message will be considered
"not routed" [5][6]; a `basic.return` will be sent if the mandatory
flag was set.
And we are now back to our previously observed behaviour... nova-
compute's RPC server by replying by sending a direct `reply-to` to the
client saw these message non routed (c.f the server traceback above).
After awhile the client reached the timeout as the message was never
delivered to him, and then this had for side effect to left a
block_device_mapping record in the database, leaving it listed in the
os-volume_attachments API and `nova volume-attachments` output. This
also causes the subsequent attempts to attach the volume to fail as we
already have an active bdm record for it.
Solutions
=========
Workaround
~~~~~~~~~~
I think this bug could be easily solved by disabling the
`direct_mandatory_flag` option, it could be a simple workaround to
unlock those who face a similar issue.
If the customers/operators seeing this repeatedly in their environment
then they could try to disable the
`[oslo_messaging_rabbit]/direct_mandatory_flag` option in the computes
`nova.conf`.
Short Term Solution
~~~~~~~~~~~~~~~~~~~
The `direct_mandatory_flag` should be disabled by default to ensure
that we wouldn't encounter an assault of similar issues from other
side that Nova. Possibilly each user of oslo.messaging's RPC server
can face a similar issue as this the default behavior, AFAIK server
will always send a direct response with a `reply-to`.
Middle Term Solution
~~~~~~~~~~~~~~~~~~~~
I think that the `direct_mandatory_flag` usage should be
removed/disabled of the `direct_send` [7] until we are capable to
determine if we use the `reply-to` feature in parallel. I think we are
in a grey area here and it looks like that the direct messaging of
oslo.messaging implicitly introduce the usage of the `reply-to` [8],
however isn't 100% clear to me, so feedback from more skilled person
would appreciated.
Anyway I think that explicit is better than implicit and if some doubt
remain then they must be cleared up.
Long Term Solution
~~~~~~~~~~~~~~~~~~
I think that we shouldn't rely on something else that real queues.
Real queues are a bit more costly to use in term of performance, each
solution have some drawbacks:
- single-use queue for each request-response pair can be expensive to create and then delete.
- long-lived queue for replying can be fiddly to manage, especially if the client itself is not long-lived.
However I think we should avoid using "non real queue". I think we
should give the priority to more reliance/stability than performance.
Also real queues could be more easily monitored that direct reply-to.
It could allow to operators to become a bit more proactive on similar
issue by monitoring reply queues as soon as strange behaviour appear
between RPC client/server.
RabbitMQ offer to us many HA features [9] that we could take benefit.
Especially, the Quorum queues [10], maybe it could be a track to
follow to allow to us to use real queue with RPC responses, and
monitor if everything is OK by continuing to use the
`direct_mandatory_flag`.
This could lead us to important changes in our design, so I think this
should be discussed through a dedicated blueprint to allow us to bring
the better solution possible.
Conclusion
==========
I think we will soon see appear similar issues even outside of Nova
also due to the described issue. However a mere workaround is
available for now.
I think that if service start to observe similar coarst cues, then
they must starts to disable the `direct_mandatory_flag` feature in
their config ASAP.
I don't think it's necessary to blacklist the versions of
oslo.messaging that contains this feature, because it can be disable,
and that will deprive us of other needed bugfix released since.
Fortunatelly some tracks to follow are available to improve the
things.
Hopefully it will help us to surround this corner case.
Thanks for your reading!
Hervé Beraud (hberaud)
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1898578
[2] https://github.com/openstack/oslo.messaging/commit/b7e9faf6590086b79f9696183b5c296ecc12b7b6
[3] https://docs.openstack.org/oslo.messaging/latest/admin/rabbit.html#exchange
[4] https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L172,L177
[5] https://www.rabbitmq.com/direct-reply-to.html#limitations
[6] https://www.rabbitmq.com/amqp-0-9-1-reference.html#constants
[7] https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L1313,L1324
[8] https://docs.openstack.org/oslo.messaging/latest/admin/AMQP1.0.html#direct-messaging
[9] https://www.rabbitmq.com/ha.html
[10] https://www.rabbitmq.com/quorum-queues.html
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1905965/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list