[Acked] [berrange at redhat.com: [PATCH] Forbid invocation of kexec_load() outside initial PID namespace]

Andy Whitcroft apw at canonical.com
Tue Aug 7 17:03:40 UTC 2012


On Tue, Aug 07, 2012 at 11:04:37AM -0400, Stéphane Graber wrote:
> On 08/07/2012 11:01 AM, Serge E. Hallyn wrote:
> > (Hopefully my unsubscribed account can email kernel-team)
> > 
> > Hi,
> > 
> > this patch will probably not hit upstream, because the 'proper' fix is
> > user namespaces.  User namespaces however won't be ready until after
> > quantal.  So I'd like this patch to be applied in precise and quantal
> > if possible.
> > 
> > Problem:
> > 
> > Containers are granted CAP_SYS_BOOT.  The reboot path in the kernel checks
> > whether you are in the initial pidns, and, if not, sends a signal to your
> > parent indicating you were 'rebooted' or 'shut down'.  So there is no
> > danger of a container rebooting the host.
> > 
> > However, CAP_SYS_BOOT also authorized kexec, without the pidns check.
> > Therefore, containers are able to kexec a new kernel, which is obviously
> > a bad thing.
> > 
> > This patch prevents that by only allowing kexec from the initial pid
> > namespace.  It is nacked by Eric Biederman (but acked by me) because
> > he feels this should be stopped by having the container in a private
> > user namespace, with the kexec cap_sys_boot check targeted to the initial
> > user namespace.  As I said, that won't be doable during quantal timeframe.
> > 
> > thanks,
> > -serge
> 
> Sounds reasonable to include that one as we've been including Daniel
> Lezcano's reboot patch in 12.04, that's just making things consistent
> and fixing a pretty serious security hole in our usually fairly secure
> defaults for containers in 12.04 and 12.10.

What an utter utter mess namespaces are.

> > ----- Forwarded message from "Daniel P. Berrange" <berrange at redhat.com> -----
> > 
> > Date: Fri,  3 Aug 2012 11:53:04 +0100
> > From: "Daniel P. Berrange" <berrange at redhat.com>
> > To: linux-kernel at vger.kernel.org
> > Cc: containers at lists.linux-foundation.org,
> > 	"Daniel P. Berrange" <berrange at redhat.com>,
> > 	Serge Hallyn <serge.hallyn at canonical.com>,
> > 	Daniel Lezcano <daniel.lezcano at free.fr>,
> > 	Michael Kerrisk <mtk.manpages at gmail.com>,
> > 	"Eric W. Biederman" <ebiederm at xmission.com>,
> > 	Tejun Heo <tj at kernel.org>, Oleg Nesterov <oleg at redhat.com>
> > Subject: [PATCH] Forbid invocation of kexec_load() outside initial PID namespace
> > 
> > From: "Daniel P. Berrange" <berrange at redhat.com>
> > 
> > The following commit
> > 
> >     commit cf3f89214ef6a33fad60856bc5ffd7bb2fc4709b
> >     Author: Daniel Lezcano <daniel.lezcano at free.fr>
> >     Date:   Wed Mar 28 14:42:51 2012 -0700
> > 
> >     pidns: add reboot_pid_ns() to handle the reboot syscall
> > 
> > introduced custom handling of the reboot() syscall when invoked
> > from a non-initial PID namespace. The intent was that a process
> > in a container can be allowed to keep CAP_SYS_BOOT and execute
> > reboot() to shutdown/reboot just their private container, rather
> > than the host.
> > 
> > Unfortunately the kexec_load() syscall also relies on the
> > CAP_SYS_BOOT capability. So by allowing a container to keep
> > this capability to safely invoke reboot(), they mistakenly
> > also gain the ability to use kexec_load(). The solution is
> > to make kexec_load() return -EPERM if invoked from a PID
> > namespace that is not the initial namespace
> > 
> > Signed-off-by: Daniel P. Berrange <berrange at redhat.com>
> > Cc: Serge Hallyn <serge.hallyn at canonical.com>
> > Cc: Daniel Lezcano <daniel.lezcano at free.fr>
> > Cc: Michael Kerrisk <mtk.manpages at gmail.com>
> > Cc: "Eric W. Biederman" <ebiederm at xmission.com>
> > Cc: Tejun Heo <tj at kernel.org>
> > Cc: Oleg Nesterov <oleg at redhat.com>
> > ---
> >  kernel/kexec.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/kernel/kexec.c b/kernel/kexec.c
> > index 0668d58..b152bde 100644
> > --- a/kernel/kexec.c
> > +++ b/kernel/kexec.c
> > @@ -947,6 +947,11 @@ SYSCALL_DEFINE4(kexec_load, unsigned long, entry, unsigned long, nr_segments,
> >  	if (!capable(CAP_SYS_BOOT))
> >  		return -EPERM;
> >  
> > +	/* Processes in containers must not be allowed to load a new
> > +	 * kernel, even if they have CAP_SYS_BOOT */
> > +	if (task_active_pid_ns(current) != &init_pid_ns)
> > +		return -EPERM;
> > +
> >  	/*
> >  	 * Verify we have a legal set of flags
> >  	 * This leaves us room for future extensions.
> > 

Patch looks like a reasonable tightening of the rules given our other
patches.

Acked-by: Andy Whitcroft <apw at canonical.com>

-apw




More information about the kernel-team mailing list