[apparmor] [PATCH] parser - more regex unittests and fixes (was Re: [PATCH] [parsers] allow for nested alternations expressions)
Seth Arnold
seth.arnold at canonical.com
Fri Dec 6 02:01:06 UTC 2013
On Thu, Nov 07, 2013 at 11:35:53AM -0800, John Johansen wrote:
> Good question. I'm not really sure. I know I don't want to support all of it
> but I would like to have more than we have. I think the set available in
> aare globbing should be smaller than the set we make available in the
> pcre syntax.
>
> eg.
> @{var} is the variable expansion in aare, but for the pcre syntax I was
> considering using \@{var}
This is alright.
However, I think it'd feel more natural to use a named-group syntax to
look up the variables by name. (Not that I like the variety of named-group
syntaxes available -- it just feels like the languages already support a
named lookup of some sort that we can leverage.)
> I know on the pcre side I want positive and negative lookaheads and
> back references (though it would not use the confusing \# syntax of
> pcre, but might use \g#. I'm not sure it makes sense to expose these to
> aare
I'm not a fan of \g#. It isn't documented at all on either of these pages
from one of the best sources of multiple-language regex info:
http://www.regular-expressions.info/backref.html
http://www.regular-expressions.info/backref2.html
(I went there to kick-start my memory on which engine used \g# rather than
\#)
> I think it would be nice to support some of the posix character classes
> and maybe \d \D.
Yes.
> The big ones I want is a way to escape into pcre syntax and back to aare
> and accept permission embedding, which save a fair bit of duplication and
> extra state creation (and then removal) on the backend.
> Eg.
> for mount instead of having to provide 5 rules
> part1 <perm>
> part1\0part2 <perm>
> part1\0part2\0part3 <perm>
> part1\0part2\0part3\0part4 <perm>
> part1\0part2\0part3\0part4\0part5 <perm>
>
> we could get away with ecoding a single rule
> part1\<perm>\0part2\<perm>\0part3\<perm>\0part4\<perm>\0part5\<perm>
This seems plausible enough. Would it only make sense for mount?
> I think there are 2 questions to answer, what set should we provide
> for the pcre style syntax, and what subset for aare?
>
> Below are some notes a have from the last time I was looking at it
> (not that they will really clear things up any)
>
> ---
>
> \@{variable} variable reference
> \^ ?start regex
> \$ ?end regex (return to globbing)
Can we (ab)use \A and \Z here instead? Those match beginning and end of
strings, so they'd presumably never appear in a regex.
> \#{perm} ?embedded perm
This seems too easy to overlook when reviewing profiles; where would it be
useful?
> \- ?logical set operation minus?
> \& ?logical set operation and?
>
>
> see man pcrepattern
>
> \ general escape character
> ^ assert start of string
> $ assert end of string
When would start and end be useful?
> . match any char including newline
> [] character class
> [^] negative character class
> [x-y] range
> [[:xxx:]] POSIX named set
> [[:^xxx:]] negative POSIX named set
Unicode character classes might also be a nice addition. Maybe.
> () subpattern
> (?) extended mean for sub pattern
> | alternation
> ? 0 or 1 match, greedy, equiv to {0,1}
> + 1 or more, greedy, equiv to {1,}
> * 0 or more, greedy, equiv to {0,}
> {n} min/max qualifier exactly n
> {,n} min/max qualifier up to n
> {n,m} min/max qualifier at least n, no more than m, greedy
> {n,} min/max qualifier n or more, greedy
Will we want to expose lazy and possessive quantifiers too?
>
> \a alarm - hex 07
> \e escape - hex 1B
> \f formfeed - hex 0C
> \n newline - hex 0A
> \r carriage return - hex 0D
> \t tab - hex 09
Maybe we can drop bell, escape, ff, backspace, tab, vertical tab, and so
forth. It's the year 2000, people don't put those in filenames any more. :)
I know they come practically free in the implementation, but it just feels
so crufty to spend documentation on these oddballs.
> \ddd octal code
> \xhh hex code
>
> \cx control-x where x is any ascii character
>
> . any character including newline
> \b backspace
> \d decimal digit [0-9]
> \D not decimal digit [^0-9]
> \h horizontal whitespace character
> \H not horizontal whitespace character
> \N not a newline
> \s white space character
> \S not a white space character
> \v vertical whitespace character
> \V not a vertical whitespace character
I'm pretty retrogrouchy but these seem a bit much :) hehehe
> \w a "word" character
> \W not a "word" character
> \l lower case
> \L
> \u
> \U upper case
> \p property
> \P not Property
> \R Unicode newline sequence
I love the Unicode here, though this does open a potential rat's nest:
How do we want patterns to be written?
- UTF8 only?
- UTF16? BE? LE?
- UCS16? BE? LE?
- UCS32? BE? LE?
- UTF32? BE? LE?
- Do we want to normalize:
- Filenames?
- Regular Expressions?
Do we want to internalize it all to codepoints to allow a UTF16 pattern
to match a UTF8 filename? (At least I presume the kernel would forbid
UCS- and UTF- -16 and -32 pathnames from the very start, what with "/"
being difficult to express without a 0x00 byte in these encodings.)
The tables for upper-case and lower-case Unicode code points might be more
kernel memory than we want to monopolize.
We can't assume all filenames are valid unicode; should we provide some
mechanism to require Unicode or other encoding schemes? Since we've
been providing bytestream-oriented interfaces up until now, it's been
easy enough to ignore. But if we're going to provide more features like
this, some encoding-enforcing feels like a natural next step.
>
> (?= ) look ahead assertion
> (?! ) negative look ahead assertion
> (?<= ) look behind assertion
> (?<! ) negative look behind assertion
> (?(conditional)yes-pattern)
> (?(conditional)yes-pattern|no-pattern)
Sounds good..
> ({ } ) callout to fn
>
Interesting; I didn't know you had this in your plans for world
domination. :) My encoding tests would fit naturally here.
>
> \p and \P reserved
>
>
> NOTE: \n can NOT be used as a back reference
>
> \gn back reference by number
> \g{n} back reference by number
> \g{-n} relative back reference by number
> \k<name> back reference by name
> \k'name' back reference by name
> \g{name} back reference by name
> \k{name} back reference by name (.Net)
> (?P=name) back reference by name (Python)
I think I missed the grouping that assigns the names.
It seems odd to offer five different syntaxes to refer to captures by
name; it's very pleasant and kind for authoring policy but doubtless more
than one person will have to look up in the manpage if there is any
difference among the different syntaxes. <> and '' is polite though.
Thanks
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 490 bytes
Desc: Digital signature
URL: <https://lists.ubuntu.com/archives/apparmor/attachments/20131205/b1e75914/attachment.pgp>
More information about the AppArmor
mailing list