[Bug 1412962] Re: Pacemaker (stonith) can seg fault in Trusty and Utopic after following message: Source ID XX was not found when attempting to remove it

Tue Jan 27 12:38:03 UTC 2015

** Description changed:

+ 
+ [IMPACT]
+ 
+   - Pacemaker seg fault (stonith and lrmd) because:
+       - Newer glib versions uses hash_table to find GSources
+       - Glib can try to assert source being removed multiple times
+ 
+ [TEST CASE]
+ 
+   - Described by user
+ 
+ [REGRESSION POTENTIAL]
+ 
+   - Based on small fixes made by upstream commits
+   - User reports problem has been fixed
+ 
+ [OTHER INFO]
+ 
+ It was brought to my attention the following situation:
+ 
+ """
+ lrmd process crashed when repeating "crm node standby" and "crm node online"
  It was brought to my attention that pacemaker could seg fault (stonith) on some conditions. This problem
  was brought to me when solving the following bug:

  https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/

  So you can check the problem here:

  https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/34
  https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/35
  https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/36
  https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/37
  https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/38

  And possible explanation here:

  https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/39
  https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/40

  (Copy and pasting here):

  So the cherry-pick (for version
  trusty_pacemaker_1.1.10+git20130802-1ubuntu2.2, based on a upstream
  commit) seems ok since it makes lrmd (services, services_linux) to avoid
  repeating a timer when the source was already removed from glib main
  loop context:

  example:

  + if (op->opaque->repeat_timer) {
  + g_source_remove(op->opaque->repeat_timer);
  ++ op->opaque->repeat_timer = 0;

  etc...

  This actually solved lrmd crashes I was getting with the testcase
  (explained inside this bug summary).

  ===
  Explanation:
  g_source_remove -> http://oss.clusterlabs.org/pipermail/pacemaker/2014-October/022690.html
  libglib2 changes -> http://oss.clusterlabs.org/pipermail/pacemaker/2014-October/022699.html
  ===

  Analyzing your crash file (from stonith and not lrm), it looks like we
  have the following scenario:

  ==============

  exited = child_waitpid(child, WNOHANG);
  |_> child->callback(child, child->pid, core, signo, exitcode);
-     |_> stonith_action_async_done (stack shows: stonith_action_destroy()) <----> call g_resource_remove 2 times
-         |_> stonith_action_clear_tracking_data(action);
-             |_> g_source_remove(action->timer_sigterm);
-                 |_> g_critical ("Source ID %u was not found when attempting to remove it", tag);
+     |_> stonith_action_async_done (stack shows: stonith_action_destroy()) <----> call g_resource_remove 2 times
+         |_> stonith_action_clear_tracking_data(action);
+             |_> g_source_remove(action->timer_sigterm);
+                 |_> g_critical ("Source ID %u was not found when attempting to remove it", tag);

  WHERE
  ==============

- Child here is the "monitor" (0x7f1f63a08b70 "monitor"): /usr/sbin/fence_legacy 
+ Child here is the "monitor" (0x7f1f63a08b70 "monitor"): /usr/sbin/fence_legacy
  "Helper that presents a RHCS-style interface for Linux-HA stonith plugins"

  This is the script responsible to monitor a stonith resource and it has
  returned (triggering monitor callback) with the following data on it:

  ------ data (begin) ------
  agent=fence_legacy
  action=monitor
  plugin=external/ssh
  hostlist=kjpnode2
  timeout=20
  async=1
  tries=1
  remaining_timeout=20
  timer_sigterm=13
  timer_sigkill=14
  max_retries=2
  pid=1464
  rc=0 (RETURN CODE)
  string buffer: "Performing: stonith -t external/ssh -S\nsuccess: 0\n"
  ------ data (end) ------

  OBS: This means that fence_legacy returned, after checking that
  st_kjpnode2 was ok, and its cleanup operation (callback) caused
  the problem we faced.

  As soon as it dies, the callback for this process is called:

-     if (child->callback) {
-         child->callback(child, child->pid, core, signo, exitcode);
+     if (child->callback) {
+         child->callback(child, child->pid, core, signo, exitcode);

  In our case, callback is:

  0x7f1f6189cec0 <stonith_action_async_done> which calls
  0x7f1f6189af10 <stonith_action_destroy> and then
  0x7f1f6189ae60 <stonith_action_clear_tracking_data> generating the 2nd removal (g_source_remove)

  with the 2nd call to g_source_remove, after glib2.0 change explained
  before this comment, we get a

  g_critical ("Source ID %u was not found when attempting to remove it",
  tag);

  and this generates the crash (since g_glob is called with a critical
  log_level causing crm_abort to be called).

  POSSIBLE CAUSE:
  ==============

  Under <stonith_action_async_done> we have:

  stonith_action_t *action = 0x7f1f639f5b50.

-     if (action->timer_sigterm > 0) {
-         g_source_remove(action->timer_sigterm);
-     }
-     if (action->timer_sigkill > 0) {
-         g_source_remove(action->timer_sigkill);
-     }
+     if (action->timer_sigterm > 0) {
+         g_source_remove(action->timer_sigterm);
+     }
+     if (action->timer_sigkill > 0) {
+         g_source_remove(action->timer_sigkill);
+     }

  Under <stonith_action_destroy> we have stonith_action_t *action = 0x7f1f639f5b50.
  and a call to: stonith_action_clear_tracking_data(action);

  Under stonith_action_clear_tracking_data(stonith_action_t * action) we
  have AGAIN:

  stonith_action_t *action = 0x7f1f639f5b50.

-     if (action->timer_sigterm > 0) {
-         g_source_remove(action->timer_sigterm);
-         action->timer_sigterm = 0;
-     }
-     if (action->timer_sigkill > 0) {
-         g_source_remove(action->timer_sigkill);
-         action->timer_sigkill = 0;
-     }
+     if (action->timer_sigterm > 0) {
+         g_source_remove(action->timer_sigterm);
+         action->timer_sigterm = 0;
+     }
+     if (action->timer_sigkill > 0) {
+         g_source_remove(action->timer_sigkill);
+         action->timer_sigkill = 0;
+     }

  This logic probably triggered the same problem the cherry pick addressed
  for lrmd, but now for stonith (calling g_source_remove 2 times for the
  same source after glib2.0 was changed).

  ##############

  commit 0326f05c9e26f39a394fa30830e31a76306f49c7
  Author: Andrew Beekhof <andrew at beekhof.net>
  Date: Thu Aug 7 13:49:24 2014 +1000

-     Fix: stonith-ng: Reset mainloop source IDs after removing them
+     Fix: stonith-ng: Reset mainloop source IDs after removing them

  diff --git a/lib/fencing/st_client.c b/lib/fencing/st_client.c
  index 64bd8f3..2837682 100644
  --- a/lib/fencing/st_client.c
  +++ b/lib/fencing/st_client.c
  @@ -663,9 +663,11 @@ stonith_action_async_done(mainloop_child_t * p, pid_t pid, int core, int signo,

-      if (action->timer_sigterm > 0) {
-          g_source_remove(action->timer_sigterm);
+      if (action->timer_sigterm > 0) {
+          g_source_remove(action->timer_sigterm);
  + action->timer_sigterm = 0;
-      }
-      if (action->timer_sigkill > 0) {
-          g_source_remove(action->timer_sigkill);
+      }
+      if (action->timer_sigkill > 0) {
+          g_source_remove(action->timer_sigkill);
  + action->timer_sigkill = 0;
-      }
+      }

-      if (action->last_timeout_signo) {
+      if (action->last_timeout_signo) {

  ##############

  under <stonith_action_async_done>.

  Will provide you a hotfix with this fix and ask for feedback.

-- 
You received this bug notification because you are a member of Ubuntu
Server Team, which is subscribed to pacemaker in Ubuntu.
https://bugs.launchpad.net/bugs/1412962

Title:
  Pacemaker (stonith) can seg fault in Trusty and Utopic after following
  message: Source ID XX was not found when attempting to remove it

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1412962/+subscriptions