[apparmor] [PATCH] parser - more regex unittests and fixes (was Re: [PATCH] [parsers] allow for nested alternations expressions)

Fri Dec 6 02:01:06 UTC 2013

On Thu, Nov 07, 2013 at 11:35:53AM -0800, John Johansen wrote:
> Good question. I'm not really sure. I know I don't want to support all of it
> but I would like to have more than we have. I think the set available in
> aare globbing should be smaller than the set we make available in the
> pcre syntax.
> 
> eg.
> @{var} is the variable expansion in aare, but for the pcre syntax I was
> considering using \@{var}

This is alright.

However, I think it'd feel more natural to use a named-group syntax to
look up the variables by name. (Not that I like the variety of named-group
syntaxes available -- it just feels like the languages already support a
named lookup of some sort that we can leverage.)

> I know on the pcre side I want positive and negative lookaheads and
> back references (though it would not use the confusing \# syntax of
> pcre, but might use \g#. I'm not sure it makes sense to expose these to
> aare

I'm not a fan of \g#. It isn't documented at all on either of these pages
from one of the best sources of multiple-language regex info:
http://www.regular-expressions.info/backref.html
http://www.regular-expressions.info/backref2.html
(I went there to kick-start my memory on which engine used \g# rather than
\#)

> I think it would be nice to support some of the posix character classes
> and maybe \d \D.

Yes.

> The big ones I want is a way to escape into pcre syntax and back to aare
> and accept permission embedding, which save a fair bit of duplication and
> extra state creation (and then removal) on the backend.
> Eg.
> for mount instead of having to provide 5 rules
> part1 <perm>
> part1\0part2 <perm>
> part1\0part2\0part3 <perm>
> part1\0part2\0part3\0part4 <perm>
> part1\0part2\0part3\0part4\0part5 <perm>
> 
> we could get away with ecoding a single rule
> part1\<perm>\0part2\<perm>\0part3\<perm>\0part4\<perm>\0part5\<perm>

This seems plausible enough. Would it only make sense for mount?

> I think there are 2 questions to answer, what set should we provide
> for the pcre style syntax, and what subset for aare?
> 
> Below are some notes a have from the last time I was looking at it
> (not that they will really clear things up any)
> 
> ---
> 
> \@{variable}  variable reference
> \^	?start regex
> \$	?end regex (return to globbing)

Can we (ab)use \A and \Z here instead? Those match beginning and end of
strings, so they'd presumably never appear in a regex.

> \#{perm}    ?embedded perm

This seems too easy to overlook when reviewing profiles; where would it be
useful?

> \-	?logical set operation minus?
> \&	?logical set operation and?
> 
> 
> see man pcrepattern
> 
> \	general escape character
> ^	assert start of string
> $	assert end of string

When would start and end be useful?

> .	match any char including newline
> []	character class
> [^]	negative character class
> [x-y]  range
> [[:xxx:]]	POSIX named set
> [[:^xxx:]]	negative POSIX named set

Unicode character classes might also be a nice addition. Maybe.

> ()	subpattern
> (?)	extended mean for sub pattern
> |	alternation
> ?	0 or 1 match, greedy, equiv to {0,1}
> +	1 or more, greedy, equiv to {1,}
> *	0 or more, greedy, equiv to {0,}
> {n}	min/max qualifier exactly n
> {,n}	min/max qualifier up to n
> {n,m}	min/max qualifier at least n, no more than m, greedy
> {n,}	min/max qualifier n or more, greedy

Will we want to expose lazy and possessive quantifiers too?

> 
> \a	alarm - hex 07
> \e	escape - hex 1B
> \f	formfeed - hex 0C
> \n	newline - hex 0A
> \r	carriage return - hex 0D
> \t	tab - hex 09

Maybe we can drop bell, escape, ff, backspace, tab, vertical tab, and so
forth. It's the year 2000, people don't put those in filenames any more. :)
I know they come practically free in the implementation, but it just feels
so crufty to spend documentation on these oddballs.

> \ddd	octal code
> \xhh	hex code
> 
> \cx	control-x where x is any ascii character
> 
> .	any character including newline
> \b	backspace
> \d	decimal digit  [0-9]
> \D	not decimal digit [^0-9]
> \h	horizontal whitespace character
> \H	not horizontal whitespace character
> \N	not a newline
> \s	white space character
> \S	not a white space character
> \v	vertical whitespace character
> \V	not a vertical whitespace character

I'm pretty retrogrouchy but these seem a bit much :) hehehe

> \w	a "word" character
> \W	not a "word" character
> \l	lower case
> \L	
> \u
> \U	upper case
> \p	property
> \P	not Property
> \R	Unicode newline sequence

I love the Unicode here, though this does open a potential rat's nest:

How do we want patterns to be written?

- UTF8 only?
- UTF16? BE? LE?
- UCS16? BE? LE?
- UCS32? BE? LE?
- UTF32? BE? LE?
- Do we want to normalize:
  - Filenames?
  - Regular Expressions?

Do we want to internalize it all to codepoints to allow a UTF16 pattern
to match a UTF8 filename? (At least I presume the kernel would forbid
UCS- and UTF- -16 and -32 pathnames from the very start, what with "/"
being difficult to express without a 0x00 byte in these encodings.)

The tables for upper-case and lower-case Unicode code points might be more
kernel memory than we want to monopolize.

We can't assume all filenames are valid unicode; should we provide some
mechanism to require Unicode or other encoding schemes? Since we've
been providing bytestream-oriented interfaces up until now, it's been
easy enough to ignore. But if we're going to provide more features like
this, some encoding-enforcing feels like a natural next step.

> 
> (?= )	look ahead assertion
> (?! )	negative look ahead assertion
> (?<= )	look behind assertion
> (?<! )	negative look behind assertion
> (?(conditional)yes-pattern)
> (?(conditional)yes-pattern|no-pattern)

Sounds good..

> ({ } )	callout to fn
> 

Interesting; I didn't know you had this in your plans for world
domination. :) My encoding tests would fit naturally here.

> 
> \p and \P   reserved
> 
> 
> NOTE: \n can NOT be used as a back reference
> 
> \gn	back reference by number
> \g{n}	back reference by number
> \g{-n}	relative back reference by number
> \k<name>	 back reference by name
> \k'name'	 back reference by name
> \g{name}	 back reference by name
> \k{name}	 back reference by name (.Net)
> (?P=name)	 back reference by name (Python)

I think I missed the grouping that assigns the names.

It seems odd to offer five different syntaxes to refer to captures by
name; it's very pleasant and kind for authoring policy but doubtless more
than one person will have to look up in the manpage if there is any
difference among the different syntaxes. <> and '' is polite though.

Thanks
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 490 bytes
Desc: Digital signature
URL: <https://lists.ubuntu.com/archives/apparmor/attachments/20131205/b1e75914/attachment.pgp>