[Bug 1740892] Re: corosync upgrade on 2018-01-02 caused pacemaker to fail

Mon Jan 8 18:35:12 UTC 2018

On Mon, Jan 8, 2018 at 8:48 AM, Victor Tapia <victor.tapia at canonical.com> wrote:
> As mentioned by Mario @ #10, stopping corosync while pacemaker runs
> throws the same error as the upgrade. Syslog from Xenial +
> corosync=2.3.5-3ubuntu1:
>
> Jan  8 16:24:37 xenial-corosync systemd[1]: Stopping Pacemaker High Availability Cluster Manager...
> Jan  8 16:24:37 xenial-corosync pacemakerd[28747]:   notice: Invoking handler for signal 15: Terminated
> Jan  8 16:24:37 xenial-corosync crmd[28753]:   notice: Invoking handler for signal 15: Terminated
> Jan  8 16:24:37 xenial-corosync crmd[28753]:   notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_SHUTDOWN cause=C_SHUTDOWN origin=crm_shutdown ]
> Jan  8 16:24:37 xenial-corosync pengine[28752]:   notice: Delaying fencing operations until there are resources to manage
> Jan  8 16:24:37 xenial-corosync pengine[28752]:   notice: Scheduling Node xenial-corosync for shutdown
> Jan  8 16:24:37 xenial-corosync pengine[28752]:   notice: Calculated Transition 1: /var/lib/pacemaker/pengine/pe-input-52.bz2
> Jan  8 16:24:37 xenial-corosync crmd[28753]:   notice: Transition 1 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-52.bz2): Complete
> Jan  8 16:24:37 xenial-corosync crmd[28753]:   notice: Disconnecting from Corosync
> Jan  8 16:24:37 xenial-corosync cib[28748]:  warning: new_event_notification (28748-28753-12): Broken pipe (32)
> Jan  8 16:24:37 xenial-corosync pengine[28752]:   notice: Invoking handler for signal 15: Terminated
> Jan  8 16:24:37 xenial-corosync attrd[28751]:   notice: Invoking handler for signal 15: Terminated
> Jan  8 16:24:37 xenial-corosync lrmd[28750]:   notice: Invoking handler for signal 15: Terminated
> Jan  8 16:24:37 xenial-corosync stonith-ng[28749]:   notice: Invoking handler for signal 15: Terminated
> Jan  8 16:24:37 xenial-corosync cib[28748]:   notice: Invoking handler for signal 15: Terminated
> Jan  8 16:24:37 xenial-corosync cib[28748]:   notice: Disconnecting from Corosync
> Jan  8 16:24:37 xenial-corosync cib[28748]:   notice: Disconnecting from Corosync
> Jan  8 16:24:37 xenial-corosync systemd[1]: Stopped Pacemaker High Availability Cluster Manager.
>
>
> Pacemakerd shuts down sending SIGTERM to its components, but after the install, corosync does not start pacemaker. BTW, "systemctl restart corosync" restarts both services perfectly
>
> I think that the option A from James Page (#11) is the way to go

Hrm, sorry, the above isn't exactly what is relevant here. That is, if
you only issued  stop on corosync, then the above is fully correct
(and I'm not seeing the 'same' error you mention (which would be about
the library or so).

It is fully correct based upon the unit files that if you issue
`systemctl stop corosync` that you will see in the logs that pacemaker
is stopped first (the last line above shows this happening). It is not
expected that stopping corosync would cause pacemaker to restart.

If you, in the above env, issue `systemctl start corosync`, does
pacemaker start? Here is what I see in a Xenial LXD:

root at splendid-viper:~# systemctl status corosync
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled;
vendor preset: enabled)
   Active: active (running) since Mon 2018-01-08 18:23:34 UTC; 18s ago
 Main PID: 3619 (corosync)
    Tasks: 2
   Memory: 42.8M
      CPU: 329ms
   CGroup: /system.slice/corosync.service
           └─3619 /usr/sbin/corosync -f

Jan 08 18:23:34 splendid-viper corosync[3619]:   [TOTEM ] Initializing
transport (UDP/IP Multicast).
Jan 08 18:23:34 splendid-viper corosync[3619]:   [TOTEM ] Initializing
transmit/receive security (NSS) crypto: none hash: none
Jan 08 18:23:34 splendid-viper corosync[3619]:   [TOTEM ] The network
interface [127.0.0.1] is now up.
Jan 08 18:23:34 splendid-viper corosync[3619]:   [QB    ] server name: cmap
Jan 08 18:23:34 splendid-viper corosync[3619]:   [QB    ] server name: cfg
Jan 08 18:23:34 splendid-viper corosync[3619]:   [QB    ] server name: cpg
Jan 08 18:23:34 splendid-viper corosync[3619]:   [QB    ] server name:
votequorum
Jan 08 18:23:34 splendid-viper corosync[3619]:   [QB    ] server name: quorum
Jan 08 18:23:34 splendid-viper corosync[3619]:   [TOTEM ] A new
membership (127.0.0.1:16) was formed. Members joined: 2130706433
Jan 08 18:23:34 splendid-viper systemd[1]: Started Corosync Cluster Engine.
root at splendid-viper:~# systemctl restart corosync
root at splendid-viper:~# systemctl status corosync
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled;
vendor preset: enabled)
   Active: active (running) since Mon 2018-01-08 18:24:10 UTC; 15s ago
 Main PID: 3640 (corosync)
    Tasks: 2
   Memory: 42.1M
      CPU: 263ms
   CGroup: /system.slice/corosync.service
           └─3640 /usr/sbin/corosync -f

Jan 08 18:24:10 splendid-viper corosync[3640]:   [TOTEM ] Initializing
transport (UDP/IP Multicast).
Jan 08 18:24:10 splendid-viper corosync[3640]:   [TOTEM ] Initializing
transmit/receive security (NSS) crypto: none hash: none
Jan 08 18:24:10 splendid-viper corosync[3640]:   [TOTEM ] The network
interface [127.0.0.1] is now up.
Jan 08 18:24:10 splendid-viper corosync[3640]:   [QB    ] server name: cmap
Jan 08 18:24:10 splendid-viper corosync[3640]:   [QB    ] server name: cfg
Jan 08 18:24:10 splendid-viper corosync[3640]:   [QB    ] server name: cpg
Jan 08 18:24:10 splendid-viper corosync[3640]:   [QB    ] server name:
votequorum
Jan 08 18:24:10 splendid-viper corosync[3640]:   [QB    ] server name: quorum
Jan 08 18:24:10 splendid-viper corosync[3640]:   [TOTEM ] A new
membership (127.0.0.1:20) was formed. Members joined: 2130706433
Jan 08 18:24:10 splendid-viper systemd[1]: Started Corosync Cluster Engine.
root at splendid-viper:~# systemctl status pacemaker
● pacemaker.service - Pacemaker High Availability Cluster Manager
   Loaded: loaded (/lib/systemd/system/pacemaker.service; enabled;
vendor preset: enabled)
   Active: active (running) since Mon 2018-01-08 18:24:10 UTC; 19s ago
 Main PID: 3642 (pacemakerd)
    Tasks: 7
   Memory: 26.0M
      CPU: 1.161s
   CGroup: /system.slice/pacemaker.service
           ├─3642 /usr/sbin/pacemakerd -f
           ├─3643 /usr/lib/pacemaker/cib
           ├─3644 /usr/lib/pacemaker/stonithd
           ├─3645 /usr/lib/pacemaker/lrmd
           ├─3646 /usr/lib/pacemaker/attrd
           ├─3647 /usr/lib/pacemaker/pengine
           └─3648 /usr/lib/pacemaker/crmd

Jan 08 18:24:11 splendid-viper cib[3643]:   notice:
crm_update_peer_proc: Node (null)[2130706433] - state is now member
(was (null))
Jan 08 18:24:11 splendid-viper cib[3643]:   notice: Defaulting to
uname -n for the local corosync node name
Jan 08 18:24:12 splendid-viper crmd[3648]:   notice: Connecting to
cluster infrastructure: corosync
Jan 08 18:24:12 splendid-viper crmd[3648]:   notice: Could not obtain
a node name for corosync nodeid 2130706433
Jan 08 18:24:12 splendid-viper crmd[3648]:   notice: Defaulting to
uname -n for the local corosync node name
Jan 08 18:24:12 splendid-viper crmd[3648]:   notice: Quorum lost
Jan 08 18:24:12 splendid-viper crmd[3648]:   notice:
pcmk_quorum_notification: Node splendid-viper[2130706433] - state is
now member (was (null))
Jan 08 18:24:12 splendid-viper cib[3643]:   notice: Defaulting to
uname -n for the local corosync node name
Jan 08 18:24:12 splendid-viper crmd[3648]:   notice: Defaulting to
uname -n for the local corosync node name
Jan 08 18:24:12 splendid-viper crmd[3648]:   notice: State transition
S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL
origin=do_started ]

So restarting corosync does restart pacemaker, here.

root at splendid-viper:~# systemctl stop corosync
root at splendid-viper:~# systemctl status corosync
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled;
vendor preset: enabled)
   Active: inactive (dead) since Mon 2018-01-08 18:25:28 UTC; 2s ago
  Process: 3640 ExecStart=/usr/sbin/corosync -f $OPTIONS (code=exited,
status=0/SUCCESS)
 Main PID: 3640 (code=exited, status=0/SUCCESS)

Jan 08 18:24:10 splendid-viper corosync[3640]:   [QB    ] server name: quorum
Jan 08 18:24:10 splendid-viper corosync[3640]:   [TOTEM ] A new
membership (127.0.0.1:20) was formed. Members joined: 2130706433
Jan 08 18:24:10 splendid-viper systemd[1]: Started Corosync Cluster Engine.
Jan 08 18:25:27 splendid-viper systemd[1]: Stopping Corosync Cluster Engine...
Jan 08 18:25:27 splendid-viper corosync[3640]:   [QB    ] withdrawing
server sockets
Jan 08 18:25:27 splendid-viper corosync[3640]:   [QB    ] withdrawing
server sockets
Jan 08 18:25:28 splendid-viper corosync[3640]:   [QB    ] withdrawing
server sockets
Jan 08 18:25:28 splendid-viper corosync[3640]:   [QB    ] withdrawing
server sockets
Jan 08 18:25:28 splendid-viper corosync[3640]:   [QB    ] withdrawing
server sockets
Jan 08 18:25:28 splendid-viper systemd[1]: Stopped Corosync Cluster Engine.
root at splendid-viper:~# systemctl status pacemaker
● pacemaker.service - Pacemaker High Availability Cluster Manager
   Loaded: loaded (/lib/systemd/system/pacemaker.service; enabled;
vendor preset: enabled)
   Active: inactive (dead) since Mon 2018-01-08 18:25:27 UTC; 4s ago
  Process: 3642 ExecStart=/usr/sbin/pacemakerd -f (code=exited,
status=0/SUCCESS)
 Main PID: 3642 (code=exited, status=0/SUCCESS)

Jan 08 18:25:27 splendid-viper crmd[3648]:   notice: Disconnecting from Corosync
Jan 08 18:25:27 splendid-viper cib[3643]:  warning:
new_event_notification (3643-3648-11): Broken pipe (32)
Jan 08 18:25:27 splendid-viper pengine[3647]:   notice: Invoking
handler for signal 15: Terminated
Jan 08 18:25:27 splendid-viper attrd[3646]:   notice: Invoking handler
for signal 15: Terminated
Jan 08 18:25:27 splendid-viper lrmd[3645]:   notice: Invoking handler
for signal 15: Terminated
Jan 08 18:25:27 splendid-viper stonith-ng[3644]:   notice: Invoking
handler for signal 15: Terminated
Jan 08 18:25:27 splendid-viper cib[3643]:   notice: Invoking handler
for signal 15: Terminated
Jan 08 18:25:27 splendid-viper cib[3643]:   notice: Disconnecting from Corosync
Jan 08 18:25:27 splendid-viper cib[3643]:   notice: Disconnecting from Corosync
Jan 08 18:25:27 splendid-viper systemd[1]: Stopped Pacemaker High
Availability Cluster Manager.

Stopping corosync stops pacemaker.

root at splendid-viper:~# systemctl start corosync
root at splendid-viper:~# systemctl status corosync
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled;
vendor preset: enabled)
   Active: active (running) since Mon 2018-01-08 18:25:56 UTC; 7s ago
 Main PID: 3663 (corosync)
    Tasks: 2
   Memory: 10.5M
      CPU: 110ms
   CGroup: /system.slice/corosync.service
           └─3663 /usr/sbin/corosync -f

Jan 08 18:25:56 splendid-viper corosync[3663]:   [TOTEM ] Initializing
transport (UDP/IP Multicast).
Jan 08 18:25:56 splendid-viper corosync[3663]:   [TOTEM ] Initializing
transmit/receive security (NSS) crypto: none hash: none
Jan 08 18:25:56 splendid-viper corosync[3663]:   [TOTEM ] The network
interface [127.0.0.1] is now up.
Jan 08 18:25:56 splendid-viper corosync[3663]:   [QB    ] server name: cmap
Jan 08 18:25:56 splendid-viper corosync[3663]:   [QB    ] server name: cfg
Jan 08 18:25:56 splendid-viper corosync[3663]:   [QB    ] server name: cpg
Jan 08 18:25:56 splendid-viper corosync[3663]:   [QB    ] server name:
votequorum
Jan 08 18:25:56 splendid-viper corosync[3663]:   [QB    ] server name: quorum
Jan 08 18:25:56 splendid-viper corosync[3663]:   [TOTEM ] A new
membership (127.0.0.1:24) was formed. Members joined: 2130706433
Jan 08 18:25:56 splendid-viper systemd[1]: Started Corosync Cluster Engine.
root at splendid-viper:~# systemctl status pacemaker
● pacemaker.service - Pacemaker High Availability Cluster Manager
   Loaded: loaded (/lib/systemd/system/pacemaker.service; enabled;
vendor preset: enabled)
   Active: inactive (dead) since Mon 2018-01-08 18:25:27 UTC; 30s ago
  Process: 3642 ExecStart=/usr/sbin/pacemakerd -f (code=exited,
status=0/SUCCESS)
 Main PID: 3642 (code=exited, status=0/SUCCESS)

Jan 08 18:25:27 splendid-viper crmd[3648]:   notice: Disconnecting from Corosync
Jan 08 18:25:27 splendid-viper cib[3643]:  warning:
new_event_notification (3643-3648-11): Broken pipe (32)
Jan 08 18:25:27 splendid-viper pengine[3647]:   notice: Invoking
handler for signal 15: Terminated
Jan 08 18:25:27 splendid-viper attrd[3646]:   notice: Invoking handler
for signal 15: Terminated
Jan 08 18:25:27 splendid-viper lrmd[3645]:   notice: Invoking handler
for signal 15: Terminated
Jan 08 18:25:27 splendid-viper stonith-ng[3644]:   notice: Invoking
handler for signal 15: Terminated
Jan 08 18:25:27 splendid-viper cib[3643]:   notice: Invoking handler
for signal 15: Terminated
Jan 08 18:25:27 splendid-viper cib[3643]:   notice: Disconnecting from Corosync
Jan 08 18:25:27 splendid-viper cib[3643]:   notice: Disconnecting from Corosync
Jan 08 18:25:27 splendid-viper systemd[1]: Stopped Pacemaker High
Availability Cluster Manager.

So is the package installation only doing a stop then start, rather
than a restart? I guess that's necessary because the unit file may
have changed on the package upgrade?

Adding a "Wants=pacemaker.service" to corosync.service, I get the
expected behavior:

root at splendid-viper:~# systemctl daemon-reload
root at splendid-viper:~# systemctl stop corosync
root at splendid-viper:~# systemctl stop pacemaker
oot at splendid-viper:~# systemctl start corosync
root at splendid-viper:~# systemctl status corosync
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled;
vendor preset: enabled)
   Active: active (running) since Mon 2018-01-08 18:32:48 UTC; 5s ago
 Main PID: 3795 (corosync)
    Tasks: 2
   Memory: 40.2M
      CPU: 199ms
   CGroup: /system.slice/corosync.service
           └─3795 /usr/sbin/corosync -f

Jan 08 18:32:48 splendid-viper corosync[3795]:   [TOTEM ] Initializing
transport (UDP/IP Multicast).
Jan 08 18:32:48 splendid-viper corosync[3795]:   [TOTEM ] Initializing
transmit/receive security (NSS) crypto: none hash: none
Jan 08 18:32:48 splendid-viper corosync[3795]:   [TOTEM ] The network
interface [127.0.0.1] is now up.
Jan 08 18:32:48 splendid-viper corosync[3795]:   [QB    ] server name: cmap
Jan 08 18:32:48 splendid-viper corosync[3795]:   [QB    ] server name: cfg
Jan 08 18:32:48 splendid-viper corosync[3795]:   [QB    ] server name: cpg
Jan 08 18:32:48 splendid-viper corosync[3795]:   [QB    ] server name:
votequorum
Jan 08 18:32:48 splendid-viper corosync[3795]:   [QB    ] server name: quorum
Jan 08 18:32:48 splendid-viper corosync[3795]:   [TOTEM ] A new
membership (127.0.0.1:36) was formed. Members joined: 2130706433
Jan 08 18:32:48 splendid-viper systemd[1]: Started Corosync Cluster Engine.
root at splendid-viper:~# systemctl status pacemaker
● pacemaker.service - Pacemaker High Availability Cluster Manager
   Loaded: loaded (/lib/systemd/system/pacemaker.service; enabled;
vendor preset: enabled)
   Active: active (running) since Mon 2018-01-08 18:32:48 UTC; 8s ago
 Main PID: 3797 (pacemakerd)
    Tasks: 7
   Memory: 25.7M
      CPU: 1.135s
   CGroup: /system.slice/pacemaker.service
           ├─3797 /usr/sbin/pacemakerd -f
           ├─3798 /usr/lib/pacemaker/cib
           ├─3799 /usr/lib/pacemaker/stonithd
           ├─3800 /usr/lib/pacemaker/lrmd
           ├─3801 /usr/lib/pacemaker/attrd
           ├─3802 /usr/lib/pacemaker/pengine
           └─3803 /usr/lib/pacemaker/crmd

Jan 08 18:32:48 splendid-viper cib[3798]:   notice: Defaulting to
uname -n for the local corosync node name
Jan 08 18:32:48 splendid-viper pengine[3802]:   notice: Additional
logging available in /var/log/pacemaker.log
Jan 08 18:32:49 splendid-viper crmd[3803]:   notice: Connecting to
cluster infrastructure: corosync
Jan 08 18:32:49 splendid-viper crmd[3803]:   notice: Could not obtain
a node name for corosync nodeid 2130706433
Jan 08 18:32:49 splendid-viper crmd[3803]:   notice: Defaulting to
uname -n for the local corosync node name
Jan 08 18:32:49 splendid-viper crmd[3803]:   notice: Quorum lost
Jan 08 18:32:49 splendid-viper crmd[3803]:   notice:
pcmk_quorum_notification: Node splendid-viper[2130706433] - state is
now member (was (null))
Jan 08 18:32:49 splendid-viper cib[3798]:   notice: Defaulting to
uname -n for the local corosync node name
Jan 08 18:32:49 splendid-viper crmd[3803]:   notice: Defaulting to
uname -n for the local corosync node name
Jan 08 18:32:49 splendid-viper crmd[3803]:   notice: State transition
S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL
origin=do_started ]

-- 
You received this bug notification because you are a member of Ubuntu
Server Team, which is subscribed to pacemaker in Ubuntu.
https://bugs.launchpad.net/bugs/1740892

Title:
  corosync upgrade on 2018-01-02 caused pacemaker to fail

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-hacluster/+bug/1740892/+subscriptions