[Bug 1905965] Re: n-cpu raising MessageUndeliverable when replying to RPC call

Tue Aug 24 17:56:35 UTC 2021

This is being fixed via LP:1940858

** Changed in: python-oslo.messaging (Ubuntu Focal)
       Status: Triaged => Fix Committed

** Changed in: cloud-archive/victoria
       Status: Triaged => Fix Committed

** Changed in: cloud-archive/ussuri
       Status: Triaged => Fix Committed

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1905965

Title:
  n-cpu raising MessageUndeliverable when replying to RPC call

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive ussuri series:
  Fix Committed
Status in Ubuntu Cloud Archive victoria series:
  Fix Committed
Status in oslo.messaging:
  Confirmed
Status in python-oslo.messaging package in Ubuntu:
  Fix Released
Status in python-oslo.messaging source package in Focal:
  Fix Committed

Bug description:
  Summary
  =======

  Recently, on train/OSP16.1 we noticed `MessageUndeliverable` when
  replying to RPC calls on nova [1]. I think that those exception raised
  within nova aren't legit and are a new bug in oslo.messaging.

  Indeed, on oslo.messaging, in a normal situation,
  `MessageUndeliverable` are raised when a message is sent to a reply
  queue that doesn't exist for some reason. This new behaviour was
  introduced with oslo.messaging 10.2.0, by adding the mandatory flag
  for direct sending [2][3]. This is an expected behaviour to early
  detect an error in case the reply queue does not exist.

  However, I think that those raised within nova are due to a limitation
  of RabbitMQ's RPC direct reply-to feature.

  Also, I think that this feature (MessageUndeliverable to the mandatory
  flag) introduced unexpected side effects on Nova where RPC client
  Timed out due to server's reply messages which finished non routed and
  so underlivered to the client.

  Observed Bug
  ============

  Here in Nova on the server side (nova-compute) we can observe the
  following traceback:

  ```
  2020-10-30 16:32:54.059 8 ERROR oslo_messaging.rpc.server [req-99a5cda6-7c8e-4cba-88f4-37b3447d4dbd c767e1727b1348449b355ea2f6c529f3 0b0b4de19ae94554a8f8b2c949306456 - default default] MessageUndeliverable error, source exception: Basic.return: (312) NO_ROUTE, routing_key: reply_d2ac09b0671840d39da6a9c718b5a63f, exchange: : : oslo_messaging.exceptions.MessageUndeliverable
  ```

  Still on Nova and on the client (nova-api) we can observe the
  following traceback:

  ```
  c767e1727b1348449b355ea2f6c529f3 0b0b4de19ae94554a8f8b2c949306456 - default default] Unexpected exception in API method: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID cb84a7fc264748aa9f19ddab48311975
  ```

  The client never received the response and then reached a timeout
  because the message wasn't routed "MessageUndeliverable error, source
  exception: Basic.return: (312) NO_ROUTE, routing_key:
  reply_d2ac09b0671840d39da6a9c718b5a63f, exchange: : : ".

  That lead us to a Nova issue where Volume attachment is failing, on
  Cinder it shows as available but nova shows it as attached. nova-api
  calling through RPC `reserve_block_device_name` on the compute to and
  then timing out after 10mins because the server never respond to the
  call. In fact the server respond but the message is non routed to its
  destination (c.f the server traceback above).

  This leaves a block_device_mapping record in the database, leaving it
  listed in the os-volume_attachments API and `nova volume-attachments`
  output. This also causes the subsequent attempts to attach the volume
  to fail as we already have an active bdm record for it.

  Details about the `direct_mandatory_flag` feature of oslo.messaging
  ===================================================================

  In a normal situation this mandatory flag tells the server how to
  react if a message cannot be routed to a queue. Specifically, if
  mandatory is set and after running the bindings the message was placed
  on zero queues then the message is returned to the sender (with a
  basic.return). If mandatory had not been set under the same
  circumstances the server would silently drop the message.

  By disabling the mandatory flag if a reply_queue doesn't exist you
  will fall in MessagingTimeout . if `direct_mandatory_flag` [4] is
  equal to True then `MessageUndeliverable` is raised immediatly when a
  reply queue doesn't exist, by default if this option
  (`direct_mandatory_flag`) doesn't exist (oslo.messaging < 10.2.0) or
  equal to False then you have to wait the default timeout.

  The Root Cause
  ==============

  I think that with RPC server's direct reply the
  `direct_mandatory_flag` feature doesn't work as expected.

  First let's start by describing a bit how RPC server works with
  RabbitMQ.

  The RPC server(s) consume requests from this queue and then send
  replies to each client using the queue named by the client in the
  reply-to header.

  A client have two options:
  - declare a single-use queue for each request-response pair.
  - can create a long-lived queue for its replies. 

  The direct reply-to feature allows RPC clients to receive replies
  directly from their RPC server, without going through a reply queue.
  "Directly" here still means going through the same connection and a
  RabbitMQ node; there is no direct network connection between RPC
  client and RPC server processes.

  The RPC server will then see a reply-to property with a generated
  name. It should publish to the default exchange ("") with the routing
  key set to this value (i.e. just as if it were sending to a reply
  queue as usual). The message will then be sent straight to the client
  consumer.

  However this feature have some caveats and limitations [5]...

  Especially the fact that the name `amq.rabbitmq.reply-to` is used in
  `basic.consume` and the `reply-to` property as if it were a queue;
  however it is not. It cannot be deleted, and does not appear in the
  management plugin or `rabbitmqctl list_queues`.

  If the RPC server publishes with the mandatory flag set then
  `amq.rabbitmq.reply-to.*` is treated as not a queue [5]; i.e. if the
  server only publishes to this name then the message will be considered
  "not routed" [5][6]; a `basic.return` will be sent if the mandatory
  flag was set.

  And we are now back to our previously observed behaviour... nova-
  compute's RPC server by replying by sending a direct `reply-to` to the
  client saw these message non routed (c.f the server traceback above).
  After awhile the client reached the timeout as the message was never
  delivered to him, and then this had for side effect to left a
  block_device_mapping record in the database, leaving it listed in the
  os-volume_attachments API and `nova volume-attachments` output. This
  also causes the subsequent attempts to attach the volume to fail as we
  already have an active bdm record for it.

  Solutions
  =========

  Workaround
  ~~~~~~~~~~

  I think this bug could be easily solved by disabling the
  `direct_mandatory_flag` option, it could be a simple workaround to
  unlock those who face a similar issue.

  If the customers/operators seeing this repeatedly in their environment
  then they could try to disable the
  `[oslo_messaging_rabbit]/direct_mandatory_flag` option in the computes
  `nova.conf`.

  Short Term Solution
  ~~~~~~~~~~~~~~~~~~~

  The `direct_mandatory_flag` should be disabled by default to ensure
  that we wouldn't encounter an assault of similar issues from other
  side that Nova. Possibilly each user of oslo.messaging's RPC server
  can face a similar issue as this the default behavior, AFAIK server
  will always send a direct response with a `reply-to`.

  Middle Term Solution
  ~~~~~~~~~~~~~~~~~~~~

  I think that the `direct_mandatory_flag` usage should be
  removed/disabled of the `direct_send`  [7] until we are capable to
  determine if we use the `reply-to` feature in parallel. I think we are
  in a grey area here and it looks like that the direct messaging of
  oslo.messaging implicitly introduce the usage of the `reply-to` [8],
  however isn't 100% clear to me, so feedback from more skilled person
  would appreciated.

  Anyway I think that explicit is better than implicit and if some doubt
  remain then they must be cleared up.

  Long Term Solution
  ~~~~~~~~~~~~~~~~~~

  I think that we shouldn't rely on something else that real queues.
  Real queues are a bit more costly to use in term of performance, each
  solution have some drawbacks:

  - single-use queue for each request-response pair can be expensive to create and then delete.
  - long-lived queue for replying can be fiddly to manage, especially if the client itself is not long-lived.

  However I think we should avoid using "non real queue". I think we
  should give the priority to more reliance/stability than performance.

  Also real queues could be more easily monitored that direct reply-to.
  It could allow to operators to become a bit more proactive on similar
  issue by monitoring reply queues as soon as strange behaviour appear
  between RPC client/server.

  RabbitMQ offer to us many HA features [9] that we could take benefit.
  Especially, the Quorum queues [10], maybe it could be a track to
  follow to allow to us to use real queue with RPC responses, and
  monitor if everything is OK by continuing to use the
  `direct_mandatory_flag`.

  This could lead us to important changes in our design, so I think this
  should be discussed through a dedicated blueprint to allow us to bring
  the better solution possible.

  Conclusion
  ==========

  I think we will soon see appear similar issues even outside of Nova
  also due to the described issue. However a mere workaround is
  available for now.

  I think that if service start to observe similar coarst cues, then
  they must starts to disable the `direct_mandatory_flag` feature in
  their config ASAP.

  I don't think it's necessary to blacklist the versions of
  oslo.messaging that contains this feature, because it can be disable,
  and that will deprive us of other needed bugfix released since.

  Fortunatelly some tracks to follow are available to improve the
  things.

  Hopefully it will help us to surround this corner case.

  Thanks for your reading!

  Hervé Beraud (hberaud)

  [1] https://bugzilla.redhat.com/show_bug.cgi?id=1898578
  [2] https://github.com/openstack/oslo.messaging/commit/b7e9faf6590086b79f9696183b5c296ecc12b7b6
  [3] https://docs.openstack.org/oslo.messaging/latest/admin/rabbit.html#exchange
  [4] https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L172,L177
  [5] https://www.rabbitmq.com/direct-reply-to.html#limitations
  [6] https://www.rabbitmq.com/amqp-0-9-1-reference.html#constants
  [7] https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L1313,L1324
  [8] https://docs.openstack.org/oslo.messaging/latest/admin/AMQP1.0.html#direct-messaging
  [9] https://www.rabbitmq.com/ha.html
  [10] https://www.rabbitmq.com/quorum-queues.html

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1905965/+subscriptions