[Bug 1538759] [NEW] When RabbitMQ cluster service restarts, other OpenStack services do not gracefully recover
Javier Diaz Jr
javierdiazcharles at gmail.com
Wed Jan 27 21:43:02 UTC 2016
Public bug reported:
In MOS 6.1:
When RabbitMQ cluster recovers from a failure (whatever the case may be), other OpenStack services like the following had to be restarted as well to get our environment stable again:
- nova-conductor
- nova-scheduler
- nova-compute (on compute nodes)
- ceilometer-collector
As of now I believe the issue might be due to the potential disabling of
heartbeats. If heartbeats are disabled then if RabbitMQ goes down
ungracefully all nova services have no way of knowing RabbitMQ went
down. When a connection to a socket is cut off completely, the receiving
side doesn't know that the connection has dropped, so you can end up
with a half-open connection. The general solution for this in linux is
to turn on TCP_KEEPALIVES but given that RabbitMQ has the heartbeat
feature built in I think enabling this would be the way to go.
Perhaps building upon this bug would be a wise idea:
https://bugs.launchpad.net/fuel/+bug/1447559
Alternatively, I have found a solution provided by an escalations
engineer in a custom patch to a customer.
This patch would be applied to the oslo.utils (on which oslo.messaging
depends) library on compute nodes and controllers.
File to update: /usr/lib/python2.7/dist-packages/oslo.utils/excutils.py
This could be done with going to /usr/lib/python2.7/dist-
packages/oslo.utils/ and running patch < oslo_utils2.diff
After that restart nova-compute service by running/etc/init.d/nova-
compute restart.
My proposal is that we investigate the reasoning behind the first
solution. Additionally, I think this patch should make its way to MOS
6.1 MU5 or other.
** Affects: mos
Importance: Undecided
Status: New
** Tags: nova oslo.messaging
** Attachment added: "oslo-patch"
https://bugs.launchpad.net/bugs/1538759/+attachment/4557961/+files/oslo_utils2.diff
** Description changed:
- When RabbitMQ cluster recovers from a failure (whatever the case may be), other OpenStack services like the following had to be restarted as well to get our environment stable again:
- - nova-conductor
- - nova-scheduler
- - nova-compute (on compute nodes)
+ When RabbitMQ cluster recovers from a failure (whatever the case may be), other OpenStack services like the following had to be restarted as well to get our environment stable again:
+ - nova-conductor
+ - nova-scheduler
+ - nova-compute (on compute nodes)
- ceilometer-collector
As of now I believe the issue might be due to the potential disabling of
heartbeats. If heartbeats are disabled then if RabbitMQ goes down
ungracefully all nova services have no way of knowing RabbitMQ went
down. When a connection to a socket is cut off completely, the receiving
side doesn't know that the connection has dropped, so you can end up
with a half-open connection. The general solution for this in linux is
to turn on TCP_KEEPALIVES but given that RabbitMQ has the heartbeat
feature built in I think enabling this would be the way to go.
Perhaps building upon this bug would be a wise idea:
https://bugs.launchpad.net/fuel/+bug/1447559
Alternatively, I have found a solution provided by an escalations
engineer in a custom patch to a customer.
This patch would be applied to the oslo.utils (on which oslo.messaging
depends) library on compute nodes and controllers.
File to update: /usr/lib/python2.7/dist-packages/oslo.utils/excutils.py
This could be done with going to /usr/lib/python2.7/dist-
packages/oslo.utils/ and running patch < oslo_utils2.diff
After that restart nova-compute service by running/etc/init.d/nova-
compute restart.
-
- My proposal is that we investigate the reasoning behind the first solution. Additionally, I think this patch should make its way to MOS 6.1 MU5 or other.
+ My proposal is that we investigate the reasoning behind the first
+ solution. Additionally, I think this patch should make its way to MOS
+ 6.1 MU5 or other.
** Also affects: oslo.messaging (Ubuntu)
Importance: Undecided
Status: New
** No longer affects: oslo.messaging (Ubuntu)
** Description changed:
+ In MOS 6.1:
+
When RabbitMQ cluster recovers from a failure (whatever the case may be), other OpenStack services like the following had to be restarted as well to get our environment stable again:
- nova-conductor
- nova-scheduler
- nova-compute (on compute nodes)
- ceilometer-collector
As of now I believe the issue might be due to the potential disabling of
heartbeats. If heartbeats are disabled then if RabbitMQ goes down
ungracefully all nova services have no way of knowing RabbitMQ went
down. When a connection to a socket is cut off completely, the receiving
side doesn't know that the connection has dropped, so you can end up
with a half-open connection. The general solution for this in linux is
to turn on TCP_KEEPALIVES but given that RabbitMQ has the heartbeat
feature built in I think enabling this would be the way to go.
Perhaps building upon this bug would be a wise idea:
https://bugs.launchpad.net/fuel/+bug/1447559
Alternatively, I have found a solution provided by an escalations
engineer in a custom patch to a customer.
This patch would be applied to the oslo.utils (on which oslo.messaging
depends) library on compute nodes and controllers.
File to update: /usr/lib/python2.7/dist-packages/oslo.utils/excutils.py
This could be done with going to /usr/lib/python2.7/dist-
packages/oslo.utils/ and running patch < oslo_utils2.diff
After that restart nova-compute service by running/etc/init.d/nova-
compute restart.
My proposal is that we investigate the reasoning behind the first
solution. Additionally, I think this patch should make its way to MOS
6.1 MU5 or other.
--
You received this bug notification because you are a member of Ubuntu
Server Team, which is subscribed to oslo.messaging in Ubuntu.
https://bugs.launchpad.net/bugs/1538759
Title:
When RabbitMQ cluster service restarts, other OpenStack services do
not gracefully recover
To manage notifications about this bug go to:
https://bugs.launchpad.net/mos/+bug/1538759/+subscriptions
More information about the Ubuntu-server-bugs
mailing list