[Bug 1538759] [NEW] When RabbitMQ cluster service restarts, other OpenStack services do not gracefully recover

Wed Jan 27 21:43:02 UTC 2016

Public bug reported:

In MOS 6.1:

When RabbitMQ cluster recovers from a failure (whatever the case may be), other OpenStack services like the following had to be restarted as well to get our environment stable again:
- nova-conductor
- nova-scheduler
- nova-compute (on compute nodes)
- ceilometer-collector

As of now I believe the issue might be due to the potential disabling of
heartbeats. If heartbeats are disabled then if RabbitMQ goes down
ungracefully all nova services have no way of knowing RabbitMQ went
down. When a connection to a socket is cut off completely, the receiving
side doesn't know that the connection has dropped, so you can end up
with a half-open connection. The general solution for this in linux is
to turn on TCP_KEEPALIVES but given that RabbitMQ has the heartbeat
feature built in I think enabling this would be the way to go.

Perhaps building upon this bug would be a wise idea:
https://bugs.launchpad.net/fuel/+bug/1447559

Alternatively, I have found a solution provided by an escalations
engineer in a custom patch to a customer.

This patch  would be applied to the oslo.utils (on which oslo.messaging
depends) library on compute nodes and controllers.

File to update: /usr/lib/python2.7/dist-packages/oslo.utils/excutils.py

This could be done with going to /usr/lib/python2.7/dist-
packages/oslo.utils/ and running patch < oslo_utils2.diff

After that restart nova-compute service by running/etc/init.d/nova-
compute restart.

My proposal is that we investigate the reasoning behind the first
solution. Additionally, I think this patch should make its way to MOS
6.1 MU5 or other.

** Affects: mos
     Importance: Undecided
         Status: New

** Tags: nova oslo.messaging

** Attachment added: "oslo-patch"
   https://bugs.launchpad.net/bugs/1538759/+attachment/4557961/+files/oslo_utils2.diff

** Description changed:

- When RabbitMQ cluster recovers from a failure (whatever the case may be), other OpenStack services like the following had to be restarted as well to get our environment stable again: 
- - nova-conductor 
- - nova-scheduler 
- - nova-compute (on compute nodes) 
+ When RabbitMQ cluster recovers from a failure (whatever the case may be), other OpenStack services like the following had to be restarted as well to get our environment stable again:
+ - nova-conductor
+ - nova-scheduler
+ - nova-compute (on compute nodes)
  - ceilometer-collector

  As of now I believe the issue might be due to the potential disabling of
  heartbeats. If heartbeats are disabled then if RabbitMQ goes down
  ungracefully all nova services have no way of knowing RabbitMQ went
  down. When a connection to a socket is cut off completely, the receiving
  side doesn't know that the connection has dropped, so you can end up
  with a half-open connection. The general solution for this in linux is
  to turn on TCP_KEEPALIVES but given that RabbitMQ has the heartbeat
  feature built in I think enabling this would be the way to go.

  Perhaps building upon this bug would be a wise idea:
  https://bugs.launchpad.net/fuel/+bug/1447559

  Alternatively, I have found a solution provided by an escalations
  engineer in a custom patch to a customer.

  This patch  would be applied to the oslo.utils (on which oslo.messaging
  depends) library on compute nodes and controllers.

  File to update: /usr/lib/python2.7/dist-packages/oslo.utils/excutils.py

  This could be done with going to /usr/lib/python2.7/dist-
  packages/oslo.utils/ and running patch < oslo_utils2.diff

  After that restart nova-compute service by running/etc/init.d/nova-
  compute restart.

- 
- My proposal is that we investigate the reasoning behind the first solution. Additionally, I think this patch should make its way to MOS 6.1 MU5 or other.
+ My proposal is that we investigate the reasoning behind the first
+ solution. Additionally, I think this patch should make its way to MOS
+ 6.1 MU5 or other.

** Also affects: oslo.messaging (Ubuntu)
   Importance: Undecided
       Status: New

** No longer affects: oslo.messaging (Ubuntu)

** Description changed:

+ In MOS 6.1:
+ 
  When RabbitMQ cluster recovers from a failure (whatever the case may be), other OpenStack services like the following had to be restarted as well to get our environment stable again:
  - nova-conductor
  - nova-scheduler
  - nova-compute (on compute nodes)
  - ceilometer-collector

  As of now I believe the issue might be due to the potential disabling of
  heartbeats. If heartbeats are disabled then if RabbitMQ goes down
  ungracefully all nova services have no way of knowing RabbitMQ went
  down. When a connection to a socket is cut off completely, the receiving
  side doesn't know that the connection has dropped, so you can end up
  with a half-open connection. The general solution for this in linux is
  to turn on TCP_KEEPALIVES but given that RabbitMQ has the heartbeat
  feature built in I think enabling this would be the way to go.

  Perhaps building upon this bug would be a wise idea:
  https://bugs.launchpad.net/fuel/+bug/1447559

  Alternatively, I have found a solution provided by an escalations
  engineer in a custom patch to a customer.

  This patch  would be applied to the oslo.utils (on which oslo.messaging
  depends) library on compute nodes and controllers.

  File to update: /usr/lib/python2.7/dist-packages/oslo.utils/excutils.py

  This could be done with going to /usr/lib/python2.7/dist-
  packages/oslo.utils/ and running patch < oslo_utils2.diff

  After that restart nova-compute service by running/etc/init.d/nova-
  compute restart.

  My proposal is that we investigate the reasoning behind the first
  solution. Additionally, I think this patch should make its way to MOS
  6.1 MU5 or other.

-- 
You received this bug notification because you are a member of Ubuntu
Server Team, which is subscribed to oslo.messaging in Ubuntu.
https://bugs.launchpad.net/bugs/1538759

Title:
  When RabbitMQ cluster service restarts, other OpenStack services do
  not gracefully recover

To manage notifications about this bug go to:
https://bugs.launchpad.net/mos/+bug/1538759/+subscriptions