[RFC] Syntax Proposal for Seccomp Filters in Upstart

Fri Dec 14 23:34:38 UTC 2012

>> [...]
>> I'm thinking about the following syntax:
>>
>> seccomp filter
>>     : "seccomp-filter" WS [ '~' ] seccomp_rules;
>>
>> seccomp_rules
>>     : seccomp_rule ( ',' seccomp_rule )*;
>>
>> seccomp_rule
>>     : systemcall ( ':' policy )?;
>>
>> policy
>>     : "allow"
>>     | "errno" ( '(' errno ')' )?
>>     | "kill"
>>     | "trace"
>>     | "trap" ( '(' errno ')' )?
>>     ;
>>
>> errno
>>     : NUMBER
>>     | errno_identifier
>>     ;
> I think we should avoid allowing a literal here as it reduces portability if
> jobs get copied between different platforms. I appreciate that 'kill signal'
> also allows a numeric, but that is not documented and should probably be
> deprecated for the same reason.

Actually, only the errno and trap policy support a numeric/errno
according to my syntax description. Do you mean you want to avoid
using e.g. EACCES and only allow an integer value? That seems less
portable to me... A pair of Makefile rules I'd like to add will
automatically generate a list of ERRNO's a perfect hashtable for them
using gperf. I'm using the same for syscalls, inspired by Systemd.

> It's worth noting that systemd's syntax does not use commas to separate rules
> and if we adopt such a delimiter it would be the first Upstart stanza to permit
> a comma.

My wrong, I meant spaces, but I copied it from the guardian syntax,
which is a command-line utility and therefore uses a comma-separated
list. As mentioned, I'd like to stay close to systemd's syntax and
that would imply space separated.

> I wonder if it would make sense for the seccomp handling code itself to be put
> into a library since, as I understand it if all the code was in Upstart, that
> would necessitate putting out new Upstart releases just to add support for a new
> Linux system call?

Hmm yes, systemd will have the same problem. Systemd even has a bigger
problem in that case, systemd uses a bitfield for all systemcalls it
is aware of and either kills or allows these sytemcalls. Any unknown
systemcall will be killed implicitly and that cannot be prevented
unless a newer systemd is installed. That's not what I would do,  a
ruleset starting with "~" shall always allow unknown syscalls, just
like it describes (a catchall "allow" policy). Of course besides a
syscall literal, an integer could be supported as well for the syscall
part, that way there's at least a workaround.

But do note that such a new syscall would first have to actually be
*used* by an (updated) application before it would make sense to
update upstart to enforce a policy on that new syscall...

> If so, could guardian be changed in this respect? Or is
> libseccomp a better fit (it's also already in the Debian+Ubuntu archives).

I wrote guardian to get accustomed to Seccomp mode 2 and some ideas I
had and combined with strace it is great to test out some policy rules
for different applications.
But I did not intend to use it by upstart, I only intend to reuse
(most of) the "install_seccomp_filter" code.

Generating a BPF filter is so trivial that I didn't bother to actually
look at libseccomp for this purpose. I have looked at it recently, but
it didn't offer any features I missed. The only good reason would be
that it can be updated separately from upstart...

> Other thoughts:
>
> - What happens in a scenario where multiple system calls have been specified,
> but only a subset are available on the platform the job is running on?

During compile-time a list of syscalls and a list of errno's on the
target platform is generated, so everything that is available at that
time for that platform should then be available. Listing a syscall
unknown to Upstart should trigger a warning and be ignored.
Syscall-numbers that don't actually have a syscall connected to it,
could be inserted in the seccomp ruleset but that rule will not be
triggered.

This is how I generate these lists at compile time for Guardian (for
upstart I'd like to do that similar):
----
errno-list.txt:
	$(CC) -E -dM -include errno.h -xc /dev/null | $(AWK) '/^#define[
\t]+E[^ \t]+[ \t]+/ { print $$2; }' > $@ || rm $@

syscall-list.txt:
	$(CC) -E -dM -include sys/syscall.h -xc /dev/null | $(AWK)
'/^#define[ \t]+__NR_[^ ]+[ \t]+/ { sub(/__NR_/, "", $$2); print $$2;
}' > $@ || rm $@

errno-from-name.gperf: errno-list.txt
	$(AWK) 'BEGIN{ print "struct errno_name { const char* name; int id;
};"; print "%null-strings"; print "%%";} { printf "%s, %s\n", $$1, $$1
}' < $< > $@

syscall-from-name.gperf: syscall-list.txt
	$(AWK) 'BEGIN{ print "struct syscall_name { const char* name; int id;
};"; print "%null-strings"; print "%%";} { printf "%s, __NR_%s\n",
$$1, $$1 }' < $< > $@

errno-from-name.h: errno-from-name.gperf
	$(GPERF) -L ANSI-C -t -N lookup_errno -H errno_syscall_name -C -E < $< > $@

syscall-from-name.h: syscall-from-name.gperf
	$(GPERF) -L ANSI-C -t -N lookup_syscall -H hash_syscall_name -C -E < $< > $@

----
> Presumably, the only safe option would be to fail to start the job. If so, we'd
> need to find a way to notify the admin as to why the job is not starting (which
> could just be to document that when first developing/testing a new job using the
> seccomp-filter, ensure Upstart is in debug mode).
>
> - What if the syscall is known, but cannot be filtered on by the currently
> running kernel? (running back-level kernel with newer libc / seccomp libs, or
> running in a chroot environment)

I'm not sure I understand your question correctly, but a syscall is
called via a syscall-number (__NR_<syscall>). Any syscall known has
been assigned a number. (For x86, the syscall number is assigned to
%eax and then "int 0x80" is issued and that is how the syscall
interface works...) If that number is unknown to the kernel, the
kernel will react to that in some way, with or without seccomp (I
guess setting errno to ENOSYS and return with -1). Either way, the
seccomp BPF rules will just compare the syscall-number to the numbers
in the ruleset and act accordingly... Does that answer your question
or could you rephrase it...?

> - What would this do?
>
>  seccomp-filter ~ setuid:allow

setuid would be explicitly allowed, and all other calls implicitly.
(And for completeness, NO_NEW_PRIVS is possibly set depending on the
process-user and a no-new-privs stanza.)

BTW, I noticed my syntax missed something to control setting
NO_NEW_PRIVS (if running a job as root). But how should I define/parse
a boolean value for a no-new-privs stanza? Looking quickly through
parse_job.c I couldn't find an existing example...)

Kind regards,
David Gaarenstroom