[RFC] [PATCH] notify init daemon when children are reparented to it

Andy Whitcroft apw at canonical.com
Tue Dec 16 17:04:20 UTC 2008


On Tue, Dec 16, 2008 at 04:27:26PM +0000, Scott James Remnant wrote:
> Please review and comment on the attached patch.
> 
> Background (UNIX 101):
> 
> All processes must have a parent.  When a child dies, the parent is
> notified by SIGCHLD and must use the wait() system call to reap the
> remaining zombie.
> 
> When a process dies, its children are reparented to the init daemon so
> that there's always a process to be notified of their eventual death.
> 
> (The init daemon cannot die.)
> 
> 
> As well as a parent, processes also have a process group and a session.
> This is quite complicated, so much so that it takes up an entire chapter
> of Stevens which few people claim to have read, let alone understood.
> 
> It all comes down to connecting the life of a process to a life of a
> terminal.  Daemons don't want to be such connected, so they perform a
> little dance:
> 
>  - they fork(), creating a child
>  - the original process (child of the shell) exits, the child carries on
>    but is now reparented to init
>  - the process calls setsid() to change to a new session and process
>    group, it's now completely unconnected from the shell or terminal
>  - *but* due to a quirk of POSIX, if it were to be made to open() a tty
>    device, it would end up owning it!  FAIL.
>  - so the process fork()s again, creating a new child
>  - the process (child of the child of the shell) exits, the new child
>    carries on and is reparented to init
> 
> Thus the daemon is a child of init, and in its own process group and
> session which is not connected to any shell or terminal.  Win.
> 
> 
> Well, almost a win.  The trouble is that this dance also happens to
> completely disconnect it from any kind of process supervisor.
> 
> It wouldn't be so bad, except that most well-written daemons don't
> actually daemonise until after they've finished initialisation - they're
> even usually listening on the right socket and everything.  The
> daemonisation is more than just an escape from the shell and terminal,
> it's notification that they are ready.
> 
> We want to be able to supervise daemons.
> 
> 
> Init has a head-start; it's the eventual parent of daemon processes
> anyway, so it will be notified of their death by SIGCHLD and receives
> their exit status information through wait().
> 
> So you can't escape from init.  But this isn't ideal, while init can see
> the process death, it has no idea what process that was, and what it was
> supposed to do about it.
> 
> If there's two apache2 daemons running (in different chroots, or for
> different IPs or ports?), it doesn't know which of the two died because
> the PID that died is unknown to it.
> 
> Likewise it can't provide status information as to whether either is
> running or not, since the only PIDs it knew exited immediately after it
> ran them.
> 
> 
> Why do it in the kernel?:
> 
> Frankly because this cannot be done in userspace without the kernel's
> help, or without modifying daemon code to behave differently (and
> incompatibly with other systems).
> 
> The closest I've come to a race-free way to do this so far is by having
> init ptrace() every process it runs so it can follow calls to fork() and
> exec().
> 
> People look at me strangely when they find out about that (plus it
> doesn't work so well).
> 
> 
> About the patch:
> 
> The patch adds a new PR_{GET,SET}_ADOPTSIG prctl, similar to the
> existing PR_{GET,SET}_PDEATHSIG control and with similar semantics.
> 
>  - When non-zero, the process will receive the given signal if another
>    process is reparented to it.
> 
>  - This signal has the pid of the reparented process in the si_pid field
>    of the siginfo_t.
> 
>  - The signal also has the pid of the *previous* parent process in the
>    si_status field.
> 
>  - Notification is disabled after exec() or setsid().
> 
> 
> The functionality only affects the init daemon, and only if the init
> daemon activates the prctl().  [There is already other init-daemon
> specific code in the kernel, and there are already other specialist
> signals activated by prctl() - so this is consistent].
> 
> Since the siginfo_t contains useful information, the signal should
> generally be >= SIGRTMIN; otherwise only the information from the first
> will be received.
> 
> 
> From userspace:
> 
> The init daemon requests notification of process adoption by realtime
> signal, and then assumedly uses sigaction or signalfd to read the
> siginfo_t structures.

Do these need to be any specific signal.  I thought that sigaction
allowed you to specify queing semantics for any signal.

> It tracks the pid of any process it spawns.
> 
> Should that process die, it will receive SIGCHLD.  However also pending
> will be the requested signal (SIGRTMIN?) with si_status set to the pid
> of the original SIGCHLD.

Which order do these arrive in.  If both are posted nearly at once we
may take them in a specific order defined by the kernel.  How do you
cope if they are the unhelpful way round.  We don't want to re-start a
parent which was really just a setsid proxy shell.

> The init daemon reaps the original child, and updates its pid to that of
> the new child obtained from the si_pid of the requested signal.

So init spawns a process and remembers its pid.  Later that process
again forks and exits.  At this point the we get an adoption notice for
its child and a death for the parent.  The adoption notice indicates the
pid we knew before so we can replace the first pid by the new pid of the
child.  That child will then do the same thing and we again will follow
the new pid.  Ok so far so good.  We have the new real top-level pid.
How do we know that _this_ one is really the top level for the daemon?

> And thus we have simple, race-free supervision of daemon processes by an
> init daemon.

I think we need to ensure that this mechanism is sufficient for us to
know that which of these exit/adopts represent the real service parent.
By which I mean we are recieving pairs of SIGCHLD and SIGADOPT [sic].
Now if we got through the normal double fork dance, then later start
a child for something and then die.  How does init know that this is a
parent failure rather than a special initialisation dance which has three
forks in it?  I am worried you would have to assume the normal double
fork dance and be vunerable to any daemon which works a different way?
Do you have a clever way to know?

Comments on the code itself inline.

> -- 
> Scott James Remnant
> scott at canonical.com

> diff --git a/fs/exec.c b/fs/exec.c
> index c5f1a92..07a8782 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1011,6 +1011,7 @@ int flush_old_exec(struct linux_binprm * bprm)
>  		suid_keys(current);
>  		set_dumpable(current->mm, suid_dumpable);
>  		current->pdeath_signal = 0;
> +		current->adopt_signal = 0;
>  	} else if (file_permission(bprm->file, MAY_READ) ||
>  			(bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP)) {
>  		suid_keys(current);
> @@ -1099,6 +1100,7 @@ void compute_creds(struct linux_binprm *bprm)
>  	if (bprm->e_uid != current->uid) {
>  		suid_keys(current);
>  		current->pdeath_signal = 0;
> +		current->adopt_signal = 0;
>  	}
>  	exec_keys(current);
>  
> diff --git a/include/linux/prctl.h b/include/linux/prctl.h
> index 48d887e..1fa1b75 100644
> --- a/include/linux/prctl.h
> +++ b/include/linux/prctl.h
> @@ -85,4 +85,8 @@
>  #define PR_SET_TIMERSLACK 29
>  #define PR_GET_TIMERSLACK 30
>  
> +/* Set/get notification of adoption by signal */
> +#define PR_SET_ADOPTSIG 31  /* Second arg is a signal */
> +#define PR_GET_ADOPTSIG 32  /* Second arg is a ptr to return the signal */
> +

Are we considering this for upstream or something specific to Ubuntu
kernels?  We are likely to be offering generic kernels for install,
mostly to help debugging problems.

>  #endif /* _LINUX_PRCTL_H */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 55e30d1..bcd2af3 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1133,6 +1133,7 @@ struct task_struct {
>  	int exit_state;
>  	int exit_code, exit_signal;
>  	int pdeath_signal;  /*  The signal sent when the parent dies  */
> +	int adopt_signal;  /*  The signal sent when a process is reparented  */
>  	/* ??? */
>  	unsigned int personality;
>  	unsigned did_exec:1;
> @@ -1829,6 +1830,7 @@ extern int kill_pgrp(struct pid *pid, int sig, int priv);
>  extern int kill_pid(struct pid *pid, int sig, int priv);
>  extern int kill_proc_info(int, struct siginfo *, pid_t);
>  extern int do_notify_parent(struct task_struct *, int);
> +extern void do_notify_parent_adopted(struct task_struct *, struct task_struct *);
>  extern void force_sig(int, struct task_struct *);
>  extern void force_sig_specific(int, struct task_struct *);
>  extern int send_sig(int, struct task_struct *, int);
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 2d8be7e..813a232 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -813,6 +813,9 @@ static void reparent_thread(struct task_struct *p, struct task_struct *father)
>  		/* We already hold the tasklist_lock here.  */
>  		group_send_sig_info(p->pdeath_signal, SEND_SIG_NOINFO, p);
>  
> +	if (p->real_parent->adopt_signal)
> +		do_notify_parent_adopted(p, father);
> +
>  	list_move_tail(&p->sibling, &p->real_parent->children);
>  
>  	/* If this is a threaded reparent there is no need to
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 4530fc6..40228e2 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1474,6 +1474,43 @@ static void do_notify_parent_cldstop(struct task_struct *tsk, int why)
>  	spin_unlock_irqrestore(&sighand->siglock, flags);
>  }
>  
> +/* Let init know that it has adopted a new child */
> +void do_notify_parent_adopted(struct task_struct *tsk, struct task_struct *father)
> +{
> +	struct siginfo info;
> +	unsigned long flags;
> +	struct task_struct *reaper;
> +	struct sighand_struct *sighand;
> +	int ret;
> +
> +	reaper = tsk->real_parent;
> +
> +	memset (&info, 0, sizeof info);
> +	info.si_signo = reaper->adopt_signal;
> +	/*
> +	 * set code to the same range as SIGCHLD so the right bits of
> +	 * siginfo_t get copied, to userspace this will appear as si_code=0
> +	 */
> +	info.si_code = __SI_CHLD;
> +	/*
> +	 * see comment in do_notify_parent() about the following 4 lines
> +	 */
> +	rcu_read_lock();
> +	info.si_pid = task_pid_nr_ns(tsk, reaper->nsproxy->pid_ns);
> +	info.si_status = task_pid_nr_ns(father, reaper->nsproxy->pid_ns);

Is there any guarentee that this second pid fits into the si_status
entry here?  Cirtainly they are not the same type right now:

                        pid_t _pid;             /* which child */
                        int _status;            /* exit code */

To do this 'right' we are probabally forced to make a new entry in teh
siginfo union for this type of info.

> +	rcu_read_unlock();
> +
> +	info.si_uid = tsk->uid;
> +
> +	info.si_utime = cputime_to_clock_t(tsk->utime);
> +	info.si_stime = cputime_to_clock_t(tsk->stime);
> +
> +	sighand = reaper->sighand;
> +	spin_lock_irqsave(&sighand->siglock, flags);
> +	__group_send_sig_info(reaper->adopt_signal, &info, reaper);
> +	spin_unlock_irqrestore(&sighand->siglock, flags);
> +}
> +
>  static inline int may_ptrace_stop(void)
>  {
>  	if (!likely(current->ptrace & PT_PTRACED))
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 31deba8..1720053 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1726,6 +1726,16 @@ asmlinkage long sys_prctl(int option, unsigned long arg2, unsigned long arg3,
>  			else
>  				current->timer_slack_ns = arg2;
>  			break;
> +		case PR_SET_ADOPTSIG:
> +			if (!valid_signal(arg2)) {
> +				error = -EINVAL;
> +				break;
> +			}
> +			current->adopt_signal = arg2;
> +			break;
> +		case PR_GET_ADOPTSIG:
> +			error = put_user(current->adopt_signal, (int __user *)arg2);
> +			break;
>  		default:
>  			error = -EINVAL;
>  			break;
> diff --git a/security/commoncap.c b/security/commoncap.c
> index 6cbec11..a2da3ab 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -365,6 +365,7 @@ void cap_bprm_apply_creds (struct linux_binprm *bprm, int unsafe)
>  			  current->cap_permitted)) {
>  		set_dumpable(current->mm, suid_dumpable);
>  		current->pdeath_signal = 0;
> +		current->adopt_signal = 0;
>  
>  		if (unsafe & ~LSM_UNSAFE_PTRACE_CAP) {
>  			if (!capable(CAP_SETUID)) {
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index 75777cb..8f089c8 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -2280,8 +2280,10 @@ static void selinux_bprm_post_apply_creds(struct linux_binprm *bprm)
>  		spin_unlock_irq(&current->sighand->siglock);
>  	}
>  
> -	/* Always clear parent death signal on SID transitions. */
> +	/* Always clear parent death signal and adoption notification
> +	 * on SID transitions. */
>  	current->pdeath_signal = 0;
> +	current->adopt_signal = 0;
>  
>  	/* Check whether the new SID can inherit resource limits
>  	   from the old SID.  If not, reset all soft limits to

-apw




More information about the kernel-team mailing list