[Bug 1412962] Re: Pacemaker (stonith) can seg fault in Trusty and Utopic after following message: Source ID XX was not found when attempting to remove it
Chris J Arges
1412962 at bugs.launchpad.net
Wed Feb 4 22:03:34 UTC 2015
Sponsored for Vivid/Utopic/Trusty.
--
You received this bug notification because you are a member of Ubuntu
Sponsors Team, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/1412962
Title:
Pacemaker (stonith) can seg fault in Trusty and Utopic after following
message: Source ID XX was not found when attempting to remove it
Status in pacemaker package in Ubuntu:
In Progress
Status in pacemaker source package in Trusty:
New
Status in pacemaker source package in Utopic:
New
Bug description:
[IMPACT]
- Pacemaker seg fault (stonith and lrmd) because:
- Newer glib versions uses hash_table to find GSources
- Glib can try to assert source being removed multiple times
[TEST CASE]
- Described by user
[REGRESSION POTENTIAL]
- Based on small fixes made by upstream commits
- User reports problem has been fixed
[OTHER INFO]
It was brought to my attention the following situation:
"""
lrmd process crashed when repeating "crm node standby" and "crm node online"
It was brought to my attention that pacemaker could seg fault (stonith) on some conditions. This problem
was brought to me when solving the following bug:
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/
So you can check the problem here:
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/34
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/35
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/36
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/37
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/38
And possible explanation here:
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/39
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/40
(Copy and pasting here):
So the cherry-pick (for version
trusty_pacemaker_1.1.10+git20130802-1ubuntu2.2, based on a upstream
commit) seems ok since it makes lrmd (services, services_linux) to
avoid repeating a timer when the source was already removed from glib
main loop context:
example:
+ if (op->opaque->repeat_timer) {
+ g_source_remove(op->opaque->repeat_timer);
++ op->opaque->repeat_timer = 0;
etc...
This actually solved lrmd crashes I was getting with the testcase
(explained inside this bug summary).
===
Explanation:
g_source_remove -> http://oss.clusterlabs.org/pipermail/pacemaker/2014-October/022690.html
libglib2 changes -> http://oss.clusterlabs.org/pipermail/pacemaker/2014-October/022699.html
===
Analyzing your crash file (from stonith and not lrm), it looks like we
have the following scenario:
==============
exited = child_waitpid(child, WNOHANG);
|_> child->callback(child, child->pid, core, signo, exitcode);
|_> stonith_action_async_done (stack shows: stonith_action_destroy()) <----> call g_resource_remove 2 times
|_> stonith_action_clear_tracking_data(action);
|_> g_source_remove(action->timer_sigterm);
|_> g_critical ("Source ID %u was not found when attempting to remove it", tag);
WHERE
==============
Child here is the "monitor" (0x7f1f63a08b70 "monitor"): /usr/sbin/fence_legacy
"Helper that presents a RHCS-style interface for Linux-HA stonith plugins"
This is the script responsible to monitor a stonith resource and it
has returned (triggering monitor callback) with the following data on
it:
------ data (begin) ------
agent=fence_legacy
action=monitor
plugin=external/ssh
hostlist=kjpnode2
timeout=20
async=1
tries=1
remaining_timeout=20
timer_sigterm=13
timer_sigkill=14
max_retries=2
pid=1464
rc=0 (RETURN CODE)
string buffer: "Performing: stonith -t external/ssh -S\nsuccess: 0\n"
------ data (end) ------
OBS: This means that fence_legacy returned, after checking that
st_kjpnode2 was ok, and its cleanup operation (callback) caused
the problem we faced.
As soon as it dies, the callback for this process is called:
if (child->callback) {
child->callback(child, child->pid, core, signo, exitcode);
In our case, callback is:
0x7f1f6189cec0 <stonith_action_async_done> which calls
0x7f1f6189af10 <stonith_action_destroy> and then
0x7f1f6189ae60 <stonith_action_clear_tracking_data> generating the 2nd removal (g_source_remove)
with the 2nd call to g_source_remove, after glib2.0 change explained
before this comment, we get a
g_critical ("Source ID %u was not found when attempting to remove it",
tag);
and this generates the crash (since g_glob is called with a critical
log_level causing crm_abort to be called).
POSSIBLE CAUSE:
==============
Under <stonith_action_async_done> we have:
stonith_action_t *action = 0x7f1f639f5b50.
if (action->timer_sigterm > 0) {
g_source_remove(action->timer_sigterm);
}
if (action->timer_sigkill > 0) {
g_source_remove(action->timer_sigkill);
}
Under <stonith_action_destroy> we have stonith_action_t *action = 0x7f1f639f5b50.
and a call to: stonith_action_clear_tracking_data(action);
Under stonith_action_clear_tracking_data(stonith_action_t * action) we
have AGAIN:
stonith_action_t *action = 0x7f1f639f5b50.
if (action->timer_sigterm > 0) {
g_source_remove(action->timer_sigterm);
action->timer_sigterm = 0;
}
if (action->timer_sigkill > 0) {
g_source_remove(action->timer_sigkill);
action->timer_sigkill = 0;
}
This logic probably triggered the same problem the cherry pick
addressed for lrmd, but now for stonith (calling g_source_remove 2
times for the same source after glib2.0 was changed).
##############
commit 0326f05c9e26f39a394fa30830e31a76306f49c7
Author: Andrew Beekhof <andrew at beekhof.net>
Date: Thu Aug 7 13:49:24 2014 +1000
Fix: stonith-ng: Reset mainloop source IDs after removing them
diff --git a/lib/fencing/st_client.c b/lib/fencing/st_client.c
index 64bd8f3..2837682 100644
--- a/lib/fencing/st_client.c
+++ b/lib/fencing/st_client.c
@@ -663,9 +663,11 @@ stonith_action_async_done(mainloop_child_t * p, pid_t pid, int core, int signo,
if (action->timer_sigterm > 0) {
g_source_remove(action->timer_sigterm);
+ action->timer_sigterm = 0;
}
if (action->timer_sigkill > 0) {
g_source_remove(action->timer_sigkill);
+ action->timer_sigkill = 0;
}
if (action->last_timeout_signo) {
##############
under <stonith_action_async_done>.
Will provide you a hotfix with this fix and ask for feedback.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1412962/+subscriptions
More information about the Ubuntu-sponsors
mailing list