[Bug 1312156] [NEW] [Precise] Potential for data corruption

Tue May 6 22:20:46 UTC 2014

You have been subscribed to a public bug by Rafael David Tinoco (inaddy):

[Impact]

 * Pacemaker designated controller can make wrong decisions based on
uncleared node status on a rare specific situation. This situation can
make the same resource starts on two nodes at the same time, resulting
in data corruption.

[Test Case]

 * The bug trigger is very hard hard to achieve:

1) If stonith was successful on fencing a node (any node was fenced).
2) If the target and origin are the same (node killed itself).
3) If we do not have a dc or the fenced node is our dc (our dc killed itself).
4) If the executor is not this node (at least 3 nodes).
5) If this node is elected new DC anytime in the future.
7) If a policy engine was not yet scheduled.
8) If takeover runs before policy engine.

 * The bug couldn't be reproduced so far: the patch was made based on a
community report (https://www.mail-
archive.com/pacemaker at oss.clusterlabs.org/msg19509.html) analyzed by
upstream code developer (Andrew Beekhof).

[Regression Potential]

 * On logic before commit 82aa2d8d17 the node responsible for fencing
(executioner) the dc was responsible also for updating cib. If this
update failed (due to a executioner fail, for ex) the dc would be fenced
a second time because the cluster would not know about fencing result.
On upstream commit 82aa2d8d17, a logic trying to avoid this second dc
fencing was introduced. This logic by itself is buggy.

 * To minimize any kind of regression, instead of going forward on
pacemaker versions, it was decided to go backwards removing only this
piece of code.

 * It is much more acceptable for SRU to restore old behavior, known to
be safe even if it implies killing dc twice, than to backport several
pieces of code to implement a logic that was not there on the stable
version release.

[Other Info / Original Description]

Under certain conditions there is faulty logic in function
tengine_stonith_notify() which can incorrectly add successfully fenced
nodes to a list, causing Pacemaker to subsequently erase that node’s
status section when the next DC (Designated Controller) election occurs.
With the status section erased, the cluster considers that node is down
and starts corresponding services on other nodes.  Multiple instances of
the same service can cause data corruption.

Conditions:

1. fenced node must have been the previous DC and been sufficiently functional to request its own fencing
2. fencing notification must arrive after the new DC has been elected but before it invokes the policy engine

Pacemaker versions affected:

1.1.6 - 1.1.9

Stable Ubuntu releases affected:

Ubuntu 12.04 LTS
Ubuntu 12.10 (EOL?)

Fix:

https://github.com/ClusterLabs/pacemaker/commit/f30e1e43

References:

https://www.mail-archive.com/pacemaker@oss.clusterlabs.org/msg19509.html
http://blog.clusterlabs.org/blog/2014/potential-for-data-corruption-in-pacemaker-1-dot-1-6-through-1-dot-1-9/

** Affects: pacemaker (Ubuntu)
     Importance: Undecided
         Status: Fix Released

** Affects: pacemaker (Ubuntu Precise)
     Importance: Medium
     Assignee: Rafael David Tinoco (inaddy)
         Status: In Progress

** Tags: precise quantal
-- 
[Precise] Potential for data corruption
https://bugs.launchpad.net/bugs/1312156
You received this bug notification because you are a member of Ubuntu Sponsors Team, which is subscribed to the bug report.