[Bug 1318441] Re: Precise corosync dies if failed_to_recv is set

Mon May 12 14:18:19 UTC 2014

######## Tests before the patch:

#
# NODE 1
#

--- MARKER --- ./failed-to-receive-crash.sh at 2014-05-09-17:33:04 --- MARKER --- 
May 09 17:33:04 corosync [MAIN]:  ] Corosync Cluster Engine ('1.4.2'): started and ready to provide service. 
May 09 17:33:04 corosync [MAIN]:  ] Corosync built-in features: nss 
May 09 17:33:04 corosync [MAIN]:  ] Successfully read main configuration file '/etc/corosync/corosync.conf'. 
May 09 17:33:04 corosync [TOTEM]: ] Initializing transport (UDP/IP Multicast). 
May 09 17:33:04 corosync [TOTEM]: ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). 
May 09 17:33:04 corosync [TOTEM]: ] The network interface [192.168.168.1] is now up. 
May 09 17:33:04 corosync [SERV]:  ] Service engine loaded: openais checkpoint service B.01.01 
May 09 17:33:04 corosync [SERV]:  ] Service engine loaded: corosync extended virtual synchrony service 
May 09 17:33:04 corosync [SERV]:  ] Service engine loaded: corosync configuration service 
May 09 17:33:04 corosync [SERV]:  ] Service engine loaded: corosync cluster closed process group service v1.01 
May 09 17:33:04 corosync [SERV]:  ] Service engine loaded: corosync cluster config database access v1.01 
May 09 17:33:04 corosync [SERV]:  ] Service engine loaded: corosync profile loading service 
May 09 17:33:04 corosync [SERV]:  ] Service engine loaded: corosync cluster quorum service v0.1 
May 09 17:33:04 corosync [MAIN]:  ] Compatibility mode set to whitetank.  Using V1 and V2 of the synchronization engine. 
May 09 17:33:04 corosync [TOTEM]: ] A processor joined or left the membership and a new membership was formed. 
May 09 17:33:04 corosync [CPG]:   ] chosen downlist: sender r(0) ip(192.168.168.1) ; members(old:0 left:0) 
May 09 17:33:04 corosync [MAIN]:  ] Completed service synchronization, ready to provide service. 
May 09 17:33:05 corosync [TOTEM]: ] A processor joined or left the membership and a new membership was formed. 
May 09 17:33:05 corosync [CPG]:   ] chosen downlist: sender r(0) ip(192.168.168.1) ; members(old:1 left:0) 
May 09 17:33:05 corosync [MAIN]:  ] Completed service synchronization, ready to provide service. 
May 09 17:33:10 corosync [TOTEM]: ] FAILED TO RECEIVE 

# COROSYNC HAS DIED BEFORE TEST CASE TRIES TO STOP IT

root at precise-cluster-01:~# ps -ef | grep corosync
root      1414  1306  0 17:31 pts/0    00:00:00 tail -f /var/log/cluster/corosync.log
root      4712  1306  0 17:33 pts/0    00:00:00 grep --color=auto corosync

######## Tests after the patch:

May 11 22:27:48 corosync [MAIN]:  ] Corosync Cluster Engine ('1.4.2'): started and ready to provide service. 
May 11 22:27:48 corosync [MAIN]:  ] Corosync built-in features: nss 
May 11 22:27:48 corosync [MAIN]:  ] Successfully read main configuration file '/etc/corosync/corosync.conf'. 
May 11 22:27:48 corosync [TOTEM]: ] Initializing transport (UDP/IP Multicast). 
May 11 22:27:48 corosync [TOTEM]: ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). 
May 11 22:27:48 corosync [TOTEM]: ] The network interface [192.168.168.1] is now up. 
May 11 22:27:48 corosync [SERV]:  ] Service engine loaded: openais checkpoint service B.01.01 
May 11 22:27:48 corosync [SERV]:  ] Service engine loaded: corosync extended virtual synchrony service 
May 11 22:27:48 corosync [SERV]:  ] Service engine loaded: corosync configuration service 
May 11 22:27:48 corosync [SERV]:  ] Service engine loaded: corosync cluster closed process group service v1.01 
May 11 22:27:48 corosync [SERV]:  ] Service engine loaded: corosync cluster config database access v1.01 
May 11 22:27:49 corosync [SERV]:  ] Service engine loaded: corosync profile loading service 
May 11 22:27:49 corosync [SERV]:  ] Service engine loaded: corosync cluster quorum service v0.1 
May 11 22:27:49 corosync [MAIN]:  ] Compatibility mode set to whitetank.  Using V1 and V2 of the synchronization engine. 
May 11 22:27:49 corosync [TOTEM]: ] A processor joined or left the membership and a new membership was formed. 
May 11 22:27:49 corosync [CPG]:   ] chosen downlist: sender r(0) ip(192.168.168.1) ; members(old:0 left:0) 
May 11 22:27:49 corosync [MAIN]:  ] Completed service synchronization, ready to provide service. 
May 11 22:27:49 corosync [TOTEM]: ] A processor joined or left the membership and a new membership was formed. 
May 11 22:27:49 corosync [CPG]:   ] chosen downlist: sender r(0) ip(192.168.168.1) ; members(old:1 left:0) 
May 11 22:27:49 corosync [MAIN]:  ] Completed service synchronization, ready to provide service. 
May 11 22:27:54 corosync [TOTEM]: ] FAILED TO RECEIVE 
May 11 22:27:55 corosync [TOTEM]: ] A processor joined or left the membership and a new membership was formed. 
May 11 22:27:55 corosync [CPG]:   ] chosen downlist: sender r(0) ip(192.168.168.1) ; members(old:2 left:1) 
May 11 22:27:55 corosync [MAIN]:  ] Completed service synchronization, ready to provide service. 
May 11 22:27:57 corosync [TOTEM]: ] A processor joined or left the membership and a new membership was formed. 
May 11 22:27:57 corosync [CPG]:   ] chosen downlist: sender r(0) ip(192.168.168.1) ; members(old:1 left:0) 
May 11 22:27:57 corosync [MAIN]:  ] Completed service synchronization, ready to provide service. 
May 11 22:27:59 corosync [TOTEM]: ] A processor joined or left the membership and a new membership was formed. 
May 11 22:28:01 corosync [TOTEM]: ] FAILED TO RECEIVE 

########

Different from the first time, corosync daemon stayed running and
alternating between a single node membership and a two node membership
(when connection was restored and before it was broke again by the
testcase). This is the expected and correct behavior corosync should
have.

** Description changed:

- If node detects itself not able to receive message it asserts the number
- of failed members considering itself and dies.
+ [Impact]

- -> Testing bugfix. To be released soon.
+  * On certain conditions corosync daemon may quit if it detects itself as not
+    being able to receive messages. The logic asserts the existence of at least
+    one functional node but the node is marking itself as a failed node (not
+    following the specification). It is safe not to assert this if failed_to_recv
+    is set.
+ 
+ [Test Case]
+ 
+  * Using "corosync test suite" on precise-test machine:
+ 
+    - Make sure to set ssh keys so precise-test can access precise-cluster-{01,02}.
+    - Make sure only failed-to-receive-crash.sh is executable on "tests" dir.
+    - Make sure precise-cluster-{01,02} nodes have build-dep for corosync installed.
+    - sudo ./run-tests.sh -c flatiron -n "precise-cluster-01 precise-cluster-02"
+    - Check corosync log messages to see precise-cluster-01 corosync dieing. 
+ 
+ [Regression Potential]
+ 
+  * We are not asserting the existence of at least 1 node in corosync cluster
+    anymore. Since there is always 1 node in the cluster (the node itself) it
+    is very unlikely this change alters corosync logic for membership. If it 
+    does it is likely corosync will recover from the error and reestablish new 
+    membership (with 1 or more nodes).
+ 
+ [Other Info]
+ 
+  * n/a

-- 
You received this bug notification because you are a member of Ubuntu
Server Team, which is subscribed to corosync in Ubuntu.
https://bugs.launchpad.net/bugs/1318441

Title:
  Precise corosync dies if failed_to_recv is set

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1318441/+subscriptions