[Bug 1353473] [NEW] Pacemaker "crm node standby" stops resource successfully, but lrmd still monitors it and causes "Failed actions"

Fri Sep 12 02:11:58 UTC 2014

You have been subscribed to a public bug by Nobuto MURATA (nobuto):

[Impact]

 * Whenever a user uses "crm node standby" the code can make lrmd still
   try to monitor resource put into stand-by and cause error messages.

[Test Case]

 * To use "crm node standby" and check lrmd does not stop monitoring
   not set to stand-by.

[Regression Potential]

 * users already tested and are using in production.
 * based on upstream fixes for lrmd monitoring.
 * potential race conditions (based on upstream history).

[Other Info]

 * Original bug description:

----------------

It was brought to me (~inaddy) the following situation:

""""""

* Environment
Ubuntu 14.04 LTS
Pacemaker 1.1.10+git20130802-1ubuntu2

* Priority
High

* Issue
I used "crm node standby" and the resource(haproxy) was stopped successfully. But lrmd still monitors it and causes "Failed actions".

---------------------------------------
Node A1LB101 (167969461): standby
Online: [ A1LB102 ]

Resource Group: grpHaproxy
vip-internal (ocf::heartbeat:IPaddr2): Started A1LB102
vip-external (ocf::heartbeat:IPaddr2): Started A1LB102
vip-nfs (ocf::heartbeat:IPaddr2): Started A1LB102
vip-iscsi (ocf::heartbeat:IPaddr2): Started A1LB102
Resource Group: grpStonith1
prmStonith1-1 (stonith:external/stonith-helper): Started A1LB102
Clone Set: clnHaproxy [haproxy]
Started: [ A1LB102 ]
Stopped: [ A1LB101 ]
Clone Set: clnPing [ping]
Started: [ A1LB102 ]
Stopped: [ A1LB101 ]

Node Attributes:
* Node A1LB101:
* Node A1LB102:
+ default_ping_set : 400

Migration summary:
* Node A1LB101:
haproxy: migration-threshold=1 fail-count=18 last-failure='Mon Jul 7 20:28:58 2014'
* Node A1LB102:

Failed actions:
haproxy_monitor_10000 (node=A1LB101, call=2332, rc=7, status=complete, last-rc-change=Mon Jul 7 20:28:58 2014
, queued=0ms, exec=0ms
): not running
---------------------------------------

Abstract from log (ha-log.node1)
Jul 7 20:28:50 A1LB101 crmd[6364]: notice: te_rsc_command: Initiating action 42: stop haproxy_stop_0 on A1LB101 (local)
Jul 7 20:28:50 A1LB101 crmd[6364]: info: match_graph_event: Action haproxy_stop_0 (42) confirmed on A1LB101 (rc=0)
Jul 7 20:28:58 A1LB101 crmd[6364]: notice: process_lrm_event: A1LB101-haproxy_monitor_10000:1372 [ haproxy not running.\n ]

""""""

I wasn't able to reproduce this error so far but the fix seems a
straightforward cherry-picking from upstream patch set fix:

48f90f6 Fix: services: Do not allow duplicate recurring op entries
c29ab27 High: lrmd: Merge duplicate recurring monitor operations
348bb51 Fix: lrmd: Cancel recurring operations before stop action is executed

So I'm assuming (and testing right now) this will fix the issue...
Opening the public bug for the fix I'll provide after tests, and to ask
others to test the fix also.

** Affects: pacemaker (Ubuntu)
     Importance: Undecided
     Assignee: Rafael David Tinoco (inaddy)
         Status: Fix Released

** Affects: pacemaker (Debian)
     Importance: Unknown
         Status: New

-- 
Pacemaker "crm node standby" stops resource successfully, but lrmd still monitors it and causes "Failed actions"
https://bugs.launchpad.net/bugs/1353473
You received this bug notification because you are a member of Ubuntu Sponsors Team, which is subscribed to the bug report.