[RFC] [PATCH] notify init daemon when children are reparented to it

Scott James Remnant scott at canonical.com
Tue Dec 16 22:01:48 UTC 2008

Thanks for the feedback, this is exactly what I was after!

On Tue, 2008-12-16 at 17:04 +0000, Andy Whitcroft wrote:

> On Tue, Dec 16, 2008 at 04:27:26PM +0000, Scott James Remnant wrote:

> > The init daemon requests notification of process adoption by realtime
> > signal, and then assumedly uses sigaction or signalfd to read the
> > siginfo_t structures.
> Do these need to be any specific signal.  I thought that sigaction
> allowed you to specify queing semantics for any signal.
It sadly doesn't, the behaviour of whether an additional copy of a
signal is queued if one of the same type is already pending is entirely

        return (sig < SIGRTMIN) && sigismember(&signals->signal, sig);

> > It tracks the pid of any process it spawns.
> > 
> > Should that process die, it will receive SIGCHLD.  However also pending
> > will be the requested signal (SIGRTMIN?) with si_status set to the pid
> > of the original SIGCHLD.
> Which order do these arrive in.  If both are posted nearly at once we
> may take them in a specific order defined by the kernel.  How do you
> cope if they are the unhelpful way round.  We don't want to re-start a
> parent which was really just a setsid proxy shell.
Well, there's two different orders going on here; and we're not in the
BKL so delivery to userspace will entirely depend on sunspot activity.
Happily there are guarantees.

The adoption signal is made pending before the child signal for the
terminating process.  If the terminating process's parent was init, then
you have both signals queued for init at the same time.

The order the signals are *delivered* is different, and depends on the
method of delivery and other various pieces.  For example with
signalfd(), the kernel will deliver any waiting SIGCHLD *before* the
waiting SIGRTMIN (it does them in numerical order, heh).

Let's illustrate the scenarios, there are some fundamental ones.

 init spawns process,
 process spawns child,
 process exits
   child is reparented to init, signal sent and received
   sigchld sent and received
   process zombie cleaned up

This is the ideal situation, our process read from the signalfd fast
enough that it received the notification of reparent first.  It was able
to add the Process record for the new pid *before* the Process record of
the old pid was cleaned up.

A full daemonise:

 init spawns process,
 process spawns intermediate,
 intermediate calls setsid and spawns child,
 process exits
    intermediate is reparented to init, signal sent and received
    sigchld sent and received
    process zombie cleaned up
 intermediate exits
    child is reparented to init, signal sent and received
    sigchld sent and received
    intermediate zombie cleaned up

Again this ideal, we received the notification in the right order and we
were able to follow the pids in the wrong order.

What happens if we receive the signals in the wrong order?

init spawns process,
process spawns child,
process exits
   child is reparented to init, adopt signal sent
   sigchld sent and received
   process zombie cleaned up
   adopt signal received

This is possible if we weren't context switched and both signals were
pending, the SIGCHLD will be read first.  But the clue here is that
*both* signals are pending, if SIGCHLD is pending, the adopt signal will
also be pending for that process.

Thus the init daemon gets SIGCHLD, marks the Process record as dead and
cleaned up, then it receives the adopt signal, creates a new Process
record and replaces the previous one (which is still in the list as

We only clean up dead Process records when we're out of the signalfd
loop; we *know* that if we leave that loop, we have received any adopt
signals for any children we've reaped.

Much more interesting, what happens if the daemonising processes exit in
the wrong order? :p

init spawns process,
process spawns intermediate,
intermediate calls setsid and spawns child,
intermediate exits
    child is reparented to init (but init doesn't even know about
      intermediate yet!)
    sigchld sent (but process doesn't handle it or wait)
    intermediate *stays as a zombie*
process exits
    the intermediate zombie is reparented to init (yes, this happens!),
     init gets a signal sent and received
    sigchld redelivered to init for intermediate (kernel does this too)
    intermediate zombie cleaned up
    sigchld sent and received for process
    process zombie cleaned up

The magic here is that zombies get reparented to init, and that SIGCHLD
is resent to init for them.  And again, this happens in a predictable

init gets the reparent record, so adds a Process but doesn't attach it
to any particular Job.  ("hey, I have a child pid 4412 but god knows
where it belogs").

init gets a second reparent record, this matches up the Process for the
Job to the Process record it has in its "somewhere" list, so it is able
to move that Process to the job.

init gets two sigchlds, and is able to delete two of the three Process

Thus we have one Process left for the job, which is the pid of the

> > The init daemon reaps the original child, and updates its pid to that of
> > the new child obtained from the si_pid of the requested signal.
> So init spawns a process and remembers its pid.  Later that process
> again forks and exits.  At this point the we get an adoption notice for
> its child and a death for the parent.  The adoption notice indicates the
> pid we knew before so we can replace the first pid by the new pid of the
> child.  That child will then do the same thing and we again will follow
> the new pid.  Ok so far so good.  We have the new real top-level pid.
> How do we know that _this_ one is really the top level for the daemon?
We don't really need to worry about it.

We just keep track of all the processes that we can see; obviously if
something we run breaks into *two* independant daemons, then we will not
believe that the service has died until both daemons are dead.

But then I think that's acceptable behaviour.

More interesting are the cases where children are left around, like if
the apache root process dies.  But again, you kinda don't want to
respawn apache while its in that state - it's enough to know that
there's something wrong, and know the pids of the processing remaining
-- that way at least Upstart knows what to kill() when the sysadmin
issues the stop command.

> > And thus we have simple, race-free supervision of daemon processes by an
> > init daemon.
> I think we need to ensure that this mechanism is sufficient for us to
> know that which of these exit/adopts represent the real service parent.
> By which I mean we are recieving pairs of SIGCHLD and SIGADOPT [sic].
> Now if we got through the normal double fork dance, then later start
> a child for something and then die.  How does init know that this is a
> parent failure rather than a special initialisation dance which has three
> forks in it?  I am worried you would have to assume the normal double
> fork dance and be vunerable to any daemon which works a different way?
> Do you have a clever way to know?
Again, I think we don't need to know.  Provided some side-effect of the
service is still running, we want to treat it as running - and know the
pids that are running with it.

Triple-forks are entirely common

  end script

We need to be able to hand that, so don't really want an upward limit;
hell, quadruple forks are common too

    /etc/init.d/idiotic start
  end script

Once we know the pids though, we can apply some heuristics to weed out
known services.  We would want to respawn openssh if the primary daemon
died, without killing off the user login processes.

That could be done with a "different user and different session" match;
having the pid to be able to do that lookup is the key.

> Comments on the code itself inline.
> > diff --git a/include/linux/prctl.h b/include/linux/prctl.h
> > index 48d887e..1fa1b75 100644
> > --- a/include/linux/prctl.h
> > +++ b/include/linux/prctl.h
> > @@ -85,4 +85,8 @@
> >  #define PR_SET_TIMERSLACK 29
> >  #define PR_GET_TIMERSLACK 30
> >  
> > +/* Set/get notification of adoption by signal */
> > +#define PR_SET_ADOPTSIG 31  /* Second arg is a signal */
> > +#define PR_GET_ADOPTSIG 32  /* Second arg is a ptr to return the signal */
> > +
> Are we considering this for upstream or something specific to Ubuntu
> kernels?  We are likely to be offering generic kernels for install,
> mostly to help debugging problems.
Definitely upstream.

I'm hoping that the collective advice from you guys will enable me to
get a patch that'll breeze through upstream -- with an accompanying mail
that will help.

> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index 4530fc6..40228e2 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -1474,6 +1474,43 @@ static void do_notify_parent_cldstop(struct task_struct *tsk, int why)
> >  	spin_unlock_irqrestore(&sighand->siglock, flags);
> >  }
> >  
> > +/* Let init know that it has adopted a new child */
> > +void do_notify_parent_adopted(struct task_struct *tsk, struct task_struct *father)
> > +{
> > +	struct siginfo info;
> > +	unsigned long flags;
> > +	struct task_struct *reaper;
> > +	struct sighand_struct *sighand;
> > +	int ret;
> > +
> > +	reaper = tsk->real_parent;
> > +
> > +	memset (&info, 0, sizeof info);
> > +	info.si_signo = reaper->adopt_signal;
> > +	/*
> > +	 * set code to the same range as SIGCHLD so the right bits of
> > +	 * siginfo_t get copied, to userspace this will appear as si_code=0
> > +	 */
> > +	info.si_code = __SI_CHLD;
> > +	/*
> > +	 * see comment in do_notify_parent() about the following 4 lines
> > +	 */
> > +	rcu_read_lock();
> > +	info.si_pid = task_pid_nr_ns(tsk, reaper->nsproxy->pid_ns);
> > +	info.si_status = task_pid_nr_ns(father, reaper->nsproxy->pid_ns);
> Is there any guarentee that this second pid fits into the si_status
> entry here?  Cirtainly they are not the same type right now:
>                         pid_t _pid;             /* which child */
>                         int _status;            /* exit code */
> To do this 'right' we are probabally forced to make a new entry in teh
> siginfo union for this type of info.
I avoided that because I thought the larger patch that touches
everything in the arch/ tree (they all reimplement copy_siginfo_to_user
*sigh*) might have less chance of being accepted.

It was my initial thought though, to add an __SI_ADOPT and si_parent

Scott James Remnant
scott at canonical.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20081216/a4d0765f/attachment.sig>

More information about the kernel-team mailing list