[Bug 1448650] [NEW] rpc.server do not consume messages after message acknowledge failure

Thu Jun 25 07:52:45 UTC 2015

You have been subscribed to a public bug by Billy Olsen (billy-olsen):

def start(self):

    @excutils.forever_retry_uncaught_exceptions
    def _executor_thread():
        try:
         while self._running:
         incoming = self.listener.poll()
         if incoming is not None:
      self._dispatch(incoming)
        except greenlet.GreenletExit:
     return

class Connection did not a lot work to ensure the operation on a connection can recovered after a reconnection. But after we get the incoming message, connection error on  message acknowledgement can be raised and caught by the excutils.forever_retry_uncaught_exceptions. At this time, do_consume will be False, which means connection will drain_events acrocss "registering" consumer on the queues.  kombu.Connection.drain_events establish a connection instead of raising a connection error.
Kombu related code is listed below.
def drain_events(self, **kwargs):
    return self.transport.drain_events(self.connection, **kwargs)

@property
def connection(self):
    if not self._closed:
        if not self.connected:
            self.declared_entities.clear()
            self._default_channel = None
            self._connection = self._establish_connection()
            self._closed = False
        return self._connection

---------------------------

[Impact]

This patch addresses an issue where the underlying kombu library disconnects from the rabbitmq-servers, which prevents oslo.messaging
from properly going through the reconnect sequence including the recreation of expected queues. This causes messages to be lost and a generally dysfunctional cloud without restarting services.

[Test Case]

Note steps are for trusty-icehouse, including latest oslo.messaging
library (1.3.0-0ubuntu1.1 at the time of this writing).

Deploy an OpenStack cloud w/ multiple rabbit nodes and then abruptly
kill one of the rabbit nodes (e.g. force panic, etc). Observe that the
nova services do detect that the node went down and report that they are
reconnected, but messages are still reporting as timed out, nova
service-list still reports compute nodes as down, etc.

[Regression Potential]

There is the possibility that there will be more reconnect attempts from
the oslo.messaging library if there is a false positive in the
underlying kombu connection reported as disconnected. This should be
unlikely since this is bringing the oslo.messaging code into sync with
the underlying library, but it is a possibility.

[Other Info]

The attempt to drive reconnection logic was fixed in a recent SRU of
oslo.messaging (version 1.3.0-0ubuntu1.1). This is an additional fix
that is required in order to allow the oslo.messaging library to not go
into a zombie-fied connection state.

** Affects: oslo.messaging
     Importance: Medium
     Assignee: Mehdi Abaakouk (sileht)
         Status: Fix Released

** Affects: oslo.messaging (Ubuntu)
     Importance: Undecided
         Status: New

-- 
rpc.server do not consume messages after message acknowledge failure
https://bugs.launchpad.net/bugs/1448650
You received this bug notification because you are a member of Ubuntu Sponsors Team, which is subscribed to the bug report.