[Bug 1318441] Re: Precise corosync dies if failed_to_recv is set
Rafael David Tinoco
rafael.tinoco at canonical.com
Mon May 12 14:18:19 UTC 2014
######## Tests before the patch:
#
# NODE 1
#
--- MARKER --- ./failed-to-receive-crash.sh at 2014-05-09-17:33:04 --- MARKER ---
May 09 17:33:04 corosync [MAIN]: ] Corosync Cluster Engine ('1.4.2'): started and ready to provide service.
May 09 17:33:04 corosync [MAIN]: ] Corosync built-in features: nss
May 09 17:33:04 corosync [MAIN]: ] Successfully read main configuration file '/etc/corosync/corosync.conf'.
May 09 17:33:04 corosync [TOTEM]: ] Initializing transport (UDP/IP Multicast).
May 09 17:33:04 corosync [TOTEM]: ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
May 09 17:33:04 corosync [TOTEM]: ] The network interface [192.168.168.1] is now up.
May 09 17:33:04 corosync [SERV]: ] Service engine loaded: openais checkpoint service B.01.01
May 09 17:33:04 corosync [SERV]: ] Service engine loaded: corosync extended virtual synchrony service
May 09 17:33:04 corosync [SERV]: ] Service engine loaded: corosync configuration service
May 09 17:33:04 corosync [SERV]: ] Service engine loaded: corosync cluster closed process group service v1.01
May 09 17:33:04 corosync [SERV]: ] Service engine loaded: corosync cluster config database access v1.01
May 09 17:33:04 corosync [SERV]: ] Service engine loaded: corosync profile loading service
May 09 17:33:04 corosync [SERV]: ] Service engine loaded: corosync cluster quorum service v0.1
May 09 17:33:04 corosync [MAIN]: ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine.
May 09 17:33:04 corosync [TOTEM]: ] A processor joined or left the membership and a new membership was formed.
May 09 17:33:04 corosync [CPG]: ] chosen downlist: sender r(0) ip(192.168.168.1) ; members(old:0 left:0)
May 09 17:33:04 corosync [MAIN]: ] Completed service synchronization, ready to provide service.
May 09 17:33:05 corosync [TOTEM]: ] A processor joined or left the membership and a new membership was formed.
May 09 17:33:05 corosync [CPG]: ] chosen downlist: sender r(0) ip(192.168.168.1) ; members(old:1 left:0)
May 09 17:33:05 corosync [MAIN]: ] Completed service synchronization, ready to provide service.
May 09 17:33:10 corosync [TOTEM]: ] FAILED TO RECEIVE
# COROSYNC HAS DIED BEFORE TEST CASE TRIES TO STOP IT
root at precise-cluster-01:~# ps -ef | grep corosync
root 1414 1306 0 17:31 pts/0 00:00:00 tail -f /var/log/cluster/corosync.log
root 4712 1306 0 17:33 pts/0 00:00:00 grep --color=auto corosync
######## Tests after the patch:
May 11 22:27:48 corosync [MAIN]: ] Corosync Cluster Engine ('1.4.2'): started and ready to provide service.
May 11 22:27:48 corosync [MAIN]: ] Corosync built-in features: nss
May 11 22:27:48 corosync [MAIN]: ] Successfully read main configuration file '/etc/corosync/corosync.conf'.
May 11 22:27:48 corosync [TOTEM]: ] Initializing transport (UDP/IP Multicast).
May 11 22:27:48 corosync [TOTEM]: ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
May 11 22:27:48 corosync [TOTEM]: ] The network interface [192.168.168.1] is now up.
May 11 22:27:48 corosync [SERV]: ] Service engine loaded: openais checkpoint service B.01.01
May 11 22:27:48 corosync [SERV]: ] Service engine loaded: corosync extended virtual synchrony service
May 11 22:27:48 corosync [SERV]: ] Service engine loaded: corosync configuration service
May 11 22:27:48 corosync [SERV]: ] Service engine loaded: corosync cluster closed process group service v1.01
May 11 22:27:48 corosync [SERV]: ] Service engine loaded: corosync cluster config database access v1.01
May 11 22:27:49 corosync [SERV]: ] Service engine loaded: corosync profile loading service
May 11 22:27:49 corosync [SERV]: ] Service engine loaded: corosync cluster quorum service v0.1
May 11 22:27:49 corosync [MAIN]: ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine.
May 11 22:27:49 corosync [TOTEM]: ] A processor joined or left the membership and a new membership was formed.
May 11 22:27:49 corosync [CPG]: ] chosen downlist: sender r(0) ip(192.168.168.1) ; members(old:0 left:0)
May 11 22:27:49 corosync [MAIN]: ] Completed service synchronization, ready to provide service.
May 11 22:27:49 corosync [TOTEM]: ] A processor joined or left the membership and a new membership was formed.
May 11 22:27:49 corosync [CPG]: ] chosen downlist: sender r(0) ip(192.168.168.1) ; members(old:1 left:0)
May 11 22:27:49 corosync [MAIN]: ] Completed service synchronization, ready to provide service.
May 11 22:27:54 corosync [TOTEM]: ] FAILED TO RECEIVE
May 11 22:27:55 corosync [TOTEM]: ] A processor joined or left the membership and a new membership was formed.
May 11 22:27:55 corosync [CPG]: ] chosen downlist: sender r(0) ip(192.168.168.1) ; members(old:2 left:1)
May 11 22:27:55 corosync [MAIN]: ] Completed service synchronization, ready to provide service.
May 11 22:27:57 corosync [TOTEM]: ] A processor joined or left the membership and a new membership was formed.
May 11 22:27:57 corosync [CPG]: ] chosen downlist: sender r(0) ip(192.168.168.1) ; members(old:1 left:0)
May 11 22:27:57 corosync [MAIN]: ] Completed service synchronization, ready to provide service.
May 11 22:27:59 corosync [TOTEM]: ] A processor joined or left the membership and a new membership was formed.
May 11 22:28:01 corosync [TOTEM]: ] FAILED TO RECEIVE
########
Different from the first time, corosync daemon stayed running and
alternating between a single node membership and a two node membership
(when connection was restored and before it was broke again by the
testcase). This is the expected and correct behavior corosync should
have.
** Description changed:
- If node detects itself not able to receive message it asserts the number
- of failed members considering itself and dies.
+ [Impact]
- -> Testing bugfix. To be released soon.
+ * On certain conditions corosync daemon may quit if it detects itself as not
+ being able to receive messages. The logic asserts the existence of at least
+ one functional node but the node is marking itself as a failed node (not
+ following the specification). It is safe not to assert this if failed_to_recv
+ is set.
+
+ [Test Case]
+
+ * Using "corosync test suite" on precise-test machine:
+
+ - Make sure to set ssh keys so precise-test can access precise-cluster-{01,02}.
+ - Make sure only failed-to-receive-crash.sh is executable on "tests" dir.
+ - Make sure precise-cluster-{01,02} nodes have build-dep for corosync installed.
+ - sudo ./run-tests.sh -c flatiron -n "precise-cluster-01 precise-cluster-02"
+ - Check corosync log messages to see precise-cluster-01 corosync dieing.
+
+ [Regression Potential]
+
+ * We are not asserting the existence of at least 1 node in corosync cluster
+ anymore. Since there is always 1 node in the cluster (the node itself) it
+ is very unlikely this change alters corosync logic for membership. If it
+ does it is likely corosync will recover from the error and reestablish new
+ membership (with 1 or more nodes).
+
+ [Other Info]
+
+ * n/a
--
You received this bug notification because you are a member of Ubuntu
Server Team, which is subscribed to corosync in Ubuntu.
https://bugs.launchpad.net/bugs/1318441
Title:
Precise corosync dies if failed_to_recv is set
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1318441/+subscriptions
More information about the Ubuntu-server-bugs
mailing list