[RFC] [PATCH] notify init daemon when children are reparented to it

Tue Dec 16 20:38:57 UTC 2008

On Tuesday 16 December 2008 08:27:26 Scott James Remnant wrote:
> Please review and comment on the attached patch.
>
> Background (UNIX 101):
>
> All processes must have a parent.  When a child dies, the parent is
> notified by SIGCHLD and must use the wait() system call to reap the
> remaining zombie.
>
> When a process dies, its children are reparented to the init daemon so
> that there's always a process to be notified of their eventual death.
>
> (The init daemon cannot die.)
>
>
> As well as a parent, processes also have a process group and a session.
> This is quite complicated, so much so that it takes up an entire chapter
> of Stevens which few people claim to have read, let alone understood.
>
> It all comes down to connecting the life of a process to a life of a
> terminal.  Daemons don't want to be such connected, so they perform a
> little dance:
>
>  - they fork(), creating a child
>  - the original process (child of the shell) exits, the child carries on
>    but is now reparented to init
>  - the process calls setsid() to change to a new session and process
>    group, it's now completely unconnected from the shell or terminal
>  - *but* due to a quirk of POSIX, if it were to be made to open() a tty
>    device, it would end up owning it!  FAIL.
>  - so the process fork()s again, creating a new child
>  - the process (child of the child of the shell) exits, the new child
>    carries on and is reparented to init
>
> Thus the daemon is a child of init, and in its own process group and
> session which is not connected to any shell or terminal.  Win.
>
>
> Well, almost a win.  The trouble is that this dance also happens to
> completely disconnect it from any kind of process supervisor.
>
> It wouldn't be so bad, except that most well-written daemons don't
> actually daemonise until after they've finished initialisation - they're
> even usually listening on the right socket and everything.  The
> daemonisation is more than just an escape from the shell and terminal,
> it's notification that they are ready.
>
> We want to be able to supervise daemons.
>
>
> Init has a head-start; it's the eventual parent of daemon processes
> anyway, so it will be notified of their death by SIGCHLD and receives
> their exit status information through wait().
>
> So you can't escape from init.  But this isn't ideal, while init can see
> the process death, it has no idea what process that was, and what it was
> supposed to do about it.
>
> If there's two apache2 daemons running (in different chroots, or for
> different IPs or ports?), it doesn't know which of the two died because
> the PID that died is unknown to it.
>
> Likewise it can't provide status information as to whether either is
> running or not, since the only PIDs it knew exited immediately after it
> ran them.
>
>
> Why do it in the kernel?:
>
> Frankly because this cannot be done in userspace without the kernel's
> help, or without modifying daemon code to behave differently (and
> incompatibly with other systems).
>
> The closest I've come to a race-free way to do this so far is by having
> init ptrace() every process it runs so it can follow calls to fork() and
> exec().
>
> People look at me strangely when they find out about that (plus it
> doesn't work so well).
>
>
> About the patch:
>
> The patch adds a new PR_{GET,SET}_ADOPTSIG prctl, similar to the
> existing PR_{GET,SET}_PDEATHSIG control and with similar semantics.
>
>  - When non-zero, the process will receive the given signal if another
>    process is reparented to it.
>
>  - This signal has the pid of the reparented process in the si_pid field
>    of the siginfo_t.
>
>  - The signal also has the pid of the *previous* parent process in the
>    si_status field.
>
>  - Notification is disabled after exec() or setsid().
>
>
> The functionality only affects the init daemon, and only if the init
> daemon activates the prctl().  [There is already other init-daemon
> specific code in the kernel, and there are already other specialist
> signals activated by prctl() - so this is consistent].
>
> Since the siginfo_t contains useful information, the signal should
> generally be >= SIGRTMIN; otherwise only the information from the first
> will be received.
>
>
> From userspace:
>
> The init daemon requests notification of process adoption by realtime
> signal, and then assumedly uses sigaction or signalfd to read the
> siginfo_t structures.
>
> It tracks the pid of any process it spawns.
>
> Should that process die, it will receive SIGCHLD.  However also pending
> will be the requested signal (SIGRTMIN?) with si_status set to the pid
> of the original SIGCHLD.
>
> The init daemon reaps the original child, and updates its pid to that of
> the new child obtained from the si_pid of the requested signal.
>
>
> And thus we have simple, race-free supervision of daemon processes by an
> init daemon.
>
>
> Scott
This is a good idea as we talked about it at UDS.  I have a few problems with
the implementation:

1. Adding anything to task_struct consumes memory given the number of
  these things floating around in a system, especially since this overhead
  only applies to a small number of procs.

2. Signals are ugly at many levels.  The handler is heavy, asynchronous,
  and information content free.  Overlaying stuff on siginfo is limited.

3. Hijacking an RT signal interferes with glibc.  The pthreads code has
   already done so btw.

4. You only get one event.

5. You can't tell if the kernel has the patch or not.

I suggest a combination of netlink and a smaller change to task_struct.
Instead of adding adopt_signal, use an unused bit, namely:

	unsigned did_exec:1;
+	unsigned init_watch:1;  /* init is keeping an eye on me */

This does not increase task_struct size.  Add a /proc/<pid>/init_watch
file to control this.  1 == trace, 0 == don't with a default of 1.  The
state of this boolean is propagated to children on clone.  Once 
upstart gets tired of seeing stuff from a daemon, it can simply
write a "0" to the /proc file and kill all future chatter.  You can
even be pre-emptive on services you daemon and/or user logins.
Just be careful about RCU... There are no races here because
being late with a "0" only results in extra msgs.

Use netlink to send upstart messages on any transition you want.
You can now stuff anything you want including the whole fork/exec/setsid
chain.  Netlink has the advantage of asynchronous notify but synchronous
reception.  You also don't lose events where you could with back-to-back
signals (unless you constructed some threadsafe queue).  Netlink may be 
out of fashion with some folks but it would not be as bad as an RT signal
overlay and task_struct bloat.  You also have the advantage of either
restricting the netlink to only send to pid 1 or allow other procs to listen
in as well for debugging/auditing.

As for detecting whether the kernel is patched or not, simply open/read
/proc/self/init_watch on yourself.  If open returns ENOENT, the patch is
not in this kernel and you fall back to ptrace.  If it is there and != 0, you 
are golden.  You could even have the default == 0 and have upstart
set it if it is new enough to know about netlink.  Otherwise, there is
no overhead at all, e.g. "if(unlikely(current->init_watch) ... "

Jim
-- 
Jim Lieb
Ubuntu Kernel Team
Canonical Ltd.