[Bug 1657444] Re: Can't failover when rabbit_hosts is configured as 3 hosts

Thu Jan 11 19:16:32 UTC 2018

** Description changed:

+ [Impact]
+ 
+ When the heartbeat connection times out it is not treated as a
+ recoverable error nor attempts to reconnect calling ensure_connection().
+ This leaves the heartbeat thread attempting to reconnect to the same
+ host over and over again.
+ 
+ [Test Case]
+ 
+ * deploy openstack
+   bzr branch lp:openstack-charm-testing
+   cd openstack-charm-testing
+   juju deployer -c default.yaml -d -v artful-pike
+   juju add-unit rabbitmq-server
+ * Force timeout using iptables in a rabbitmq-server node
+   sudo iptables -I INPUT -p tcp --dport 5672 -j DROP
+ 
+ Expected result:
+ once the timeout happens, the heartbeat thread reconnects (picking a new rabbit host if needed).
+ 
+ Actual result:
+ the heartbeat thread is left in a loop (connect, socket closed, retry, connect...)
+ 
+ [Regression Potential]
+ 
+ Without this patch when the heartbeat connection times out, and it does
+ not attempt to connect to the next configured rabbit host. So the risk
+ is that situations where currently the daemons using this library made
+ it to reconnect to the same host (e.g. the disconnection from the host
+ is only for a few seconds) with this change they will reconnect to the
+ next host, so users may see the connections flapping between two (or
+ more) rabbit hosts.
+ 
+ [Other Info]
  I have a rabbitmq cluster of 3 nodes

  root at 47704165d2bb:/# rabbitmqctl cluster_status
  Cluster status of node rabbit at 47704165d2bb ...
  [{nodes,[{disc,[rabbit at 0482398a286e,rabbit at 3709521b608a,
-                 rabbit at 47704165d2bb]}]},
-  {running_nodes,[rabbit at 0482398a286e,rabbit at 3709521b608a,rabbit at 47704165d2bb]},
-  {cluster_name,<<"rabbit at 47704165d2bb">>},
-  {partitions,[]},
-  {alarms,[{rabbit at 0482398a286e,[]},
-           {rabbit at 3709521b608a,[]},
-           {rabbit at 47704165d2bb,[]}]}]
- root at 47704165d2bb:/# rabbitmqctl list_policies      
+                 rabbit at 47704165d2bb]}]},
+  {running_nodes,[rabbit at 0482398a286e,rabbit at 3709521b608a,rabbit at 47704165d2bb]},
+  {cluster_name,<<"rabbit at 47704165d2bb">>},
+  {partitions,[]},
+  {alarms,[{rabbit at 0482398a286e,[]},
+           {rabbit at 3709521b608a,[]},
+           {rabbit at 47704165d2bb,[]}]}]
+ root at 47704165d2bb:/# rabbitmqctl list_policies
  Listing policies ...
  /       ha-all  all     ^ha\\.  {"ha-mode":"all"}       0
- 

  My oslo_message client configuration
  [oslo_messaging_rabbit]
  rabbit_hosts=120.0.0.56:5671,120.0.0.57:5671,120.0.0.55:5671
  rabbit_userid=cloud
  rabbit_password=cloud
  rabbit_ha_queues=True
  rabbit_retry_interval=1
  rabbit_retry_backoff=2
  rabbit_max_retries=0
  rabbit_durable_queues=False

  When I run "service rabbitmq-server stop" on one node to simulating a
  failure, I got following error logs, and the consumer can't failover
  from the bad node. It will reconnect the failure node forever instead of
  other nodes. "kombu_failover_strategy" is default value of "round-
  robin".

- 
  2009-01-13 18:32:42.785 17 ERROR oslo.messaging._drivers.impl_rabbit [-] [4e976d46-ceee-4617-b9be-5e4821990738] AMQP server 120.0.0.56:5671 closed the connection. Check login credentials: Socket closed
  2009-01-13 18:32:43.819 17 ERROR oslo.messaging._drivers.impl_rabbit [-] Unable to connect to AMQP server on 120.0.0.56:5671 after None tries: Socket closed
  2009-01-13 18:32:43.819 17 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...
  2009-01-13 18:32:58.874 17 ERROR oslo.messaging._drivers.impl_rabbit [-] [4e976d46-ceee-4617-b9be-5e4821990738] AMQP server 120.0.0.56:5671 closed the connection. Check login credentials: Socket closed
  2009-01-13 18:32:59.907 17 ERROR oslo.messaging._drivers.impl_rabbit [-] Unable to connect to AMQP server on 120.0.0.56:5671 after None tries: Socket closed
  2009-01-13 18:32:59.907 17 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...

- 
  Who can help me. Thanks!

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to python-oslo.messaging in Ubuntu.
https://bugs.launchpad.net/bugs/1657444

Title:
  Can't failover when rabbit_hosts is configured as 3 hosts

Status in Ubuntu Cloud Archive:
  Invalid
Status in Ubuntu Cloud Archive pike series:
  Triaged
Status in oslo.messaging:
  Fix Released
Status in python-oslo.messaging package in Ubuntu:
  Invalid
Status in python-oslo.messaging source package in Artful:
  Triaged

Bug description:
  [Impact]

  When the heartbeat connection times out it is not treated as a
  recoverable error nor attempts to reconnect calling
  ensure_connection(). This leaves the heartbeat thread attempting to
  reconnect to the same host over and over again.

  [Test Case]

  * deploy openstack
    bzr branch lp:openstack-charm-testing
    cd openstack-charm-testing
    juju deployer -c default.yaml -d -v artful-pike
    juju add-unit rabbitmq-server
  * Force timeout using iptables in a rabbitmq-server node
    sudo iptables -I INPUT -p tcp --dport 5672 -j DROP

  Expected result:
  once the timeout happens, the heartbeat thread reconnects (picking a new rabbit host if needed).

  Actual result:
  the heartbeat thread is left in a loop (connect, socket closed, retry, connect...)

  [Regression Potential]

  Without this patch when the heartbeat connection times out, and it
  does not attempt to connect to the next configured rabbit host. So the
  risk is that situations where currently the daemons using this library
  made it to reconnect to the same host (e.g. the disconnection from the
  host is only for a few seconds) with this change they will reconnect
  to the next host, so users may see the connections flapping between
  two (or more) rabbit hosts.

  [Other Info]
  I have a rabbitmq cluster of 3 nodes

  root at 47704165d2bb:/# rabbitmqctl cluster_status
  Cluster status of node rabbit at 47704165d2bb ...
  [{nodes,[{disc,[rabbit at 0482398a286e,rabbit at 3709521b608a,
                  rabbit at 47704165d2bb]}]},
   {running_nodes,[rabbit at 0482398a286e,rabbit at 3709521b608a,rabbit at 47704165d2bb]},
   {cluster_name,<<"rabbit at 47704165d2bb">>},
   {partitions,[]},
   {alarms,[{rabbit at 0482398a286e,[]},
            {rabbit at 3709521b608a,[]},
            {rabbit at 47704165d2bb,[]}]}]
  root at 47704165d2bb:/# rabbitmqctl list_policies
  Listing policies ...
  /       ha-all  all     ^ha\\.  {"ha-mode":"all"}       0

  My oslo_message client configuration
  [oslo_messaging_rabbit]
  rabbit_hosts=120.0.0.56:5671,120.0.0.57:5671,120.0.0.55:5671
  rabbit_userid=cloud
  rabbit_password=cloud
  rabbit_ha_queues=True
  rabbit_retry_interval=1
  rabbit_retry_backoff=2
  rabbit_max_retries=0
  rabbit_durable_queues=False

  When I run "service rabbitmq-server stop" on one node to simulating a
  failure, I got following error logs, and the consumer can't failover
  from the bad node. It will reconnect the failure node forever instead
  of other nodes. "kombu_failover_strategy" is default value of "round-
  robin".

  2009-01-13 18:32:42.785 17 ERROR oslo.messaging._drivers.impl_rabbit [-] [4e976d46-ceee-4617-b9be-5e4821990738] AMQP server 120.0.0.56:5671 closed the connection. Check login credentials: Socket closed
  2009-01-13 18:32:43.819 17 ERROR oslo.messaging._drivers.impl_rabbit [-] Unable to connect to AMQP server on 120.0.0.56:5671 after None tries: Socket closed
  2009-01-13 18:32:43.819 17 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...
  2009-01-13 18:32:58.874 17 ERROR oslo.messaging._drivers.impl_rabbit [-] [4e976d46-ceee-4617-b9be-5e4821990738] AMQP server 120.0.0.56:5671 closed the connection. Check login credentials: Socket closed
  2009-01-13 18:32:59.907 17 ERROR oslo.messaging._drivers.impl_rabbit [-] Unable to connect to AMQP server on 120.0.0.56:5671 after None tries: Socket closed
  2009-01-13 18:32:59.907 17 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...

  Who can help me. Thanks!

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1657444/+subscriptions