[Bug 1448650] Re: rpc.server do not consume messages after message acknowledge failure

Thu Jun 25 04:44:24 UTC 2015

** Description changed:

  def start(self):
  
-     @excutils.forever_retry_uncaught_exceptions
-     def _executor_thread():
-         try:
-     	    while self._running:
- 	        incoming = self.listener.poll()
- 	        if incoming is not None:
- 		    self._dispatch(incoming)
-         except greenlet.GreenletExit:
- 	    return
+     @excutils.forever_retry_uncaught_exceptions
+     def _executor_thread():
+         try:
+          while self._running:
+          incoming = self.listener.poll()
+          if incoming is not None:
+       self._dispatch(incoming)
+         except greenlet.GreenletExit:
+      return
  
  class Connection did not a lot work to ensure the operation on a connection can recovered after a reconnection. But after we get the incoming message, connection error on  message acknowledgement can be raised and caught by the excutils.forever_retry_uncaught_exceptions. At this time, do_consume will be False, which means connection will drain_events acrocss "registering" consumer on the queues.  kombu.Connection.drain_events establish a connection instead of raising a connection error.
  Kombu related code is listed below.
  def drain_events(self, **kwargs):
-     return self.transport.drain_events(self.connection, **kwargs)
+     return self.transport.drain_events(self.connection, **kwargs)
  
  @property
  def connection(self):
-     if not self._closed:
-         if not self.connected:
-             self.declared_entities.clear()
-             self._default_channel = None
-             self._connection = self._establish_connection()
-             self._closed = False
-         return self._connection
+     if not self._closed:
+         if not self.connected:
+             self.declared_entities.clear()
+             self._default_channel = None
+             self._connection = self._establish_connection()
+             self._closed = False
+         return self._connection
+ 
+ ---------------------------
+ 
+ [Impact]
+ 
+ This patch addresses an issue where the underlying kombu library disconnects from the rabbitmq-servers, which prevents oslo.messaging
+ from properly going through the reconnect sequence including the recreation of expected queues. This causes messages to be lost and a generally dysfunctional cloud without restarting services.
+ 
+ [Test Case]
+ 
+ Note steps are for trusty-icehouse, including latest oslo.messaging
+ library (1.3.0-0ubuntu1.1 at the time of this writing).
+ 
+ Deploy an OpenStack cloud w/ multiple rabbit nodes and then abruptly
+ kill one of the rabbit nodes (e.g. force panic, etc). Observe that the
+ nova services do detect that the node went down and report that they are
+ reconnected, but messages are still reporting as timed out, nova
+ service-list still reports compute nodes as down, etc.
+ 
+ [Regression Potential]
+ 
+ There is the possibility that there will be more reconnect attempts from
+ the oslo.messaging library if there is a false positive in the
+ underlying kombu connection reported as disconnected. This should be
+ unlikely since this is bringing the oslo.messaging code into sync with
+ the underlying library, but it is a possibility.
+ 
+ [Other Info]
+ 
+ The attempt to drive reconnection logic was fixed in a recent SRU of
+ oslo.messaging (version 1.3.0-0ubuntu1.1). This is an additional fix
+ that is required in order to allow the oslo.messaging library to not go
+ into a zombie-fied connection state.

** Also affects: oslo.messaging (Ubuntu)
   Importance: Undecided
       Status: New

** No longer affects: python-oslo.messaging (Ubuntu)

-- 
You received this bug notification because you are a member of Ubuntu
Server Team, which is subscribed to python-oslo.messaging in Ubuntu.
https://bugs.launchpad.net/bugs/1448650

Title:
  rpc.server do not consume messages after message acknowledge failure

To manage notifications about this bug go to:
https://bugs.launchpad.net/oslo.messaging/+bug/1448650/+subscriptions