[Bug 1524907] Re: Race condition in SIGTERM signal handler

Tue Sep 13 11:35:02 UTC 2016

** Description changed:

+ [Impact]
+ 
+  * See bug description. We are seeing this in a Liberty production
+    environment and (at least) nova-conductor services are failing to
+    restart properly.
+ 
+  * this fix just missed the version of python-oslo.service we have in the
+    Liberty UCA so queueing up for backport
+ 
+ [Test Case]
+ 
+  * Start a service that has a high number of workers, check that all
+    are up then do a service stop (or killall -s SIGTERM nova-conductor)
+    and check that all workers/process are stopped.
+ 
+ [Regression Potential]
+ 
+  * none
+ 
+ 
  If the process launcher gets a SIGTERM signal, it calls _sigterm() to
  handle it. This function calls SignalHandler() singleton to get the
  instance of SignalHandler. This singleton acquires a lock to ensure
  that the singleton is unique.

  Problem arises when the process launcher gets a second SIGTERM while
  the singleton lock (called 'singleton_lock') is locked. _sigterm() is
  called again (reentrant call!), but we enter a dead lock. If eventlet
  is used, eventlet fails on an assertion error: "Cannot switch to
  MAINLOOP from MAINLOOP".

  The bug can occurs with SIGTERM and SIGHUP signals.

  I saw this issue with OpenStack services managed by systemd with a wrong
  configuration: SIGTERM is sent to all processes of the cgroups, instead
  of only sending the SIGTERM signal to the "main" process ("Main PID" in
  systemd). When the process launcher gets a SIGTERM, it sends a new
  SIGTERM signal to each child process. If systemd already sent a first
  SIGTERM to child processes, they now get two SIGTERM "shortly".

  For OpenStack services managed by systemd, the service file must contain
  "KillMode=process" to only send SIGTERM to the main process ("Main
  PID").

** Summary changed:

- Race condition in SIGTERM signal handler
+ [SRU] Race condition in SIGTERM signal handler

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1524907

Title:
  [SRU] Race condition in SIGTERM signal handler

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive liberty series:
  In Progress
Status in oslo.service:
  Fix Released
Status in python-oslo.service package in Ubuntu:
  Fix Released
Status in python-oslo.service source package in Wily:
  Won't Fix
Status in python-oslo.service source package in Xenial:
  Fix Released
Status in python-oslo.service source package in Yakkety:
  Fix Released

Bug description:
  [Impact]

   * See bug description. We are seeing this in a Liberty production
     environment and (at least) nova-conductor services are failing to
     restart properly.

   * this fix just missed the version of python-oslo.service we have in the
     Liberty UCA so queueing up for backport

  [Test Case]

   * Start a service that has a high number of workers, check that all
     are up then do a service stop (or killall -s SIGTERM nova-conductor)
     and check that all workers/process are stopped.

  [Regression Potential]

   * none

  If the process launcher gets a SIGTERM signal, it calls _sigterm() to
  handle it. This function calls SignalHandler() singleton to get the
  instance of SignalHandler. This singleton acquires a lock to ensure
  that the singleton is unique.

  Problem arises when the process launcher gets a second SIGTERM while
  the singleton lock (called 'singleton_lock') is locked. _sigterm() is
  called again (reentrant call!), but we enter a dead lock. If eventlet
  is used, eventlet fails on an assertion error: "Cannot switch to
  MAINLOOP from MAINLOOP".

  The bug can occurs with SIGTERM and SIGHUP signals.

  I saw this issue with OpenStack services managed by systemd with a
  wrong configuration: SIGTERM is sent to all processes of the cgroups,
  instead of only sending the SIGTERM signal to the "main" process
  ("Main PID" in systemd). When the process launcher gets a SIGTERM, it
  sends a new SIGTERM signal to each child process. If systemd already
  sent a first SIGTERM to child processes, they now get two SIGTERM
  "shortly".

  For OpenStack services managed by systemd, the service file must
  contain "KillMode=process" to only send SIGTERM to the main process
  ("Main PID").

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1524907/+subscriptions