[Bug 1439649] Re: Pacemaker unable to communicate with corosync on restart under lxc

Felipe Reyes 1439649 at bugs.launchpad.net
Wed May 6 21:43:45 UTC 2015


I'm seeing this problem in another environment, similar deployment (3
lxc containers)

Apr 20 16:39:26 juju-machine-3-lxc-4 crm_verify[31774]:   notice: crm_log_args: Invoked: crm_verify -V -p 
Apr 20 16:39:27 juju-machine-3-lxc-4 cibadmin[31786]:   notice: crm_log_args: Invoked: cibadmin -p -P 
Apr 20 16:50:01 juju-machine-3-lxc-4 cib[780]:    error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Apr 20 16:50:01 juju-machine-3-lxc-4 cib[780]:    error: cib_cs_destroy: Corosync connection lost!  Exiting.
Apr 20 16:50:01 juju-machine-3-lxc-4 crmd[785]:    error: crmd_quorum_destroy: connection terminated
Apr 20 16:50:01 juju-machine-3-lxc-4 attrd[783]:    error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Apr 20 16:50:01 juju-machine-3-lxc-4 stonith-ng[781]:    error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Apr 20 16:50:01 juju-machine-3-lxc-4 crmd[785]:   notice: crmd_exit: Forcing immediate exit: Link has been severed (67)
Apr 20 16:50:01 juju-machine-3-lxc-4 lrmd[782]:  warning: qb_ipcs_event_sendv: new_event_notification (782-785-6): Bad file descriptor (9)
Apr 20 16:50:01 juju-machine-3-lxc-4 lrmd[782]:  warning: send_client_notify: Notification of client crmd/8ad990ba-cf09-4ba3-b74b-a7d05d377a1b failed
Apr 20 16:50:01 juju-machine-3-lxc-4 lrmd[782]:    error: crm_abort: crm_glib_handler: Forked child 760 to record non-fatal assert at logging.c:63 : Source ID 4601370 was not found when attempting to remove it
Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]:    error: pcmk_child_exit: Child process cib (780) exited: Invalid argument (22)
Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]:   notice: pcmk_process_exit: Respawning failed child process: cib
Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]:    error: pcmk_child_exit: Child process crmd (785) exited: Link has been severed (67)
Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]:   notice: pcmk_process_exit: Respawning failed child process: crmd
Apr 20 16:50:01 juju-machine-3-lxc-4 attrd[783]:     crit: attrd_cs_destroy: Lost connection to Corosync service!
Apr 20 16:50:01 juju-machine-3-lxc-4 attrd[783]:   notice: main: Exiting...
Apr 20 16:50:01 juju-machine-3-lxc-4 attrd[783]:   notice: main: Disconnecting client 0x7ff985e478e0, pid=785...
Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]:    error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]:    error: mcp_cpg_destroy: Connection destroyed
Apr 20 16:50:01 juju-machine-3-lxc-4 attrd[783]:    error: attrd_cib_connection_destroy: Connection to the CIB terminated...
Apr 20 16:50:01 juju-machine-3-lxc-4 cib[761]:    debug: crm_update_callsites: Enabling callsites based on priority=7, files=(null), functions=(null), formats=(null), tags=(null)
Apr 20 16:50:01 juju-machine-3-lxc-4 crmd[767]:    debug: crm_update_callsites: Enabling callsites based on priority=7, files=(null), functions=(null), formats=(null), tags=(null)
Apr 20 16:50:01 juju-machine-3-lxc-4 crmd[767]:   notice: main: CRM Git Version: 42f2063
Apr 20 16:50:01 juju-machine-3-lxc-4 stonith-ng[781]:    error: stonith_peer_cs_destroy: Corosync connection terminated
Apr 20 16:50:01 juju-machine-3-lxc-4 cib[761]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Apr 20 16:50:01 juju-machine-3-lxc-4 cib[761]:    error: cluster_connect_cpg: Could not connect to the Cluster Process Group API: 2
Apr 20 16:50:01 juju-machine-3-lxc-4 cib[761]:     crit: cib_init: Cannot sign in to the cluster... terminating
Apr 20 16:50:02 juju-machine-3-lxc-4 crmd[767]:  warning: do_cib_control: Couldn't complete CIB registration 1 times... pause and retry
Apr 20 16:50:05 juju-machine-3-lxc-4 crmd[767]:  warning: do_cib_control: Couldn't complete CIB registration 2 times... pause and retry

These are the only processes running in one of the nodes:

root       782  0.0  0.0  81464  1828 ?        Ss   Feb12  25:13 /usr/lib/pacemaker/lrmd
haclust+   784  0.0  0.0  73920   776 ?        Ss   Feb12   8:25 /usr/lib/pacemaker/pengine
root       780  0.8  0.0 130256  4152 ?        Ssl  16:50   0:00 /usr/sbin/corosync


A possible explanation could be: http://thread.gmane.org/gmane.linux.highavailability.corosync/592/focus=639

I only have logs for one of the nodes, I'm trying to get logs of the
other 2 nodes to get a better understanding of what was happening with
the communication.

-- 
You received this bug notification because you are a member of Ubuntu
Server Team, which is subscribed to lxc in Ubuntu.
https://bugs.launchpad.net/bugs/1439649

Title:
  Pacemaker unable to communicate with corosync on restart under lxc

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1439649/+subscriptions



More information about the Ubuntu-server-bugs mailing list