Replacing the dfa

Wed Jun 16 08:59:03 BST 2010

Over the last few weeks I have been looking into improving loaded policy
size, and creation time.  This has lead me to believe that we need to
replace how AA does its pattern matching.

AA currently uses a dfa to do its pattern matching.  This allows for fast
linear match against path names, and it can also provide a unique label
via accept states.

The dfa however has some problems, it can be very large, take a long time
to create and it can't currently handle embedded variables.

The size of dfas is governed by the number of states and the language being
matched, in our case the language is a fixed 8 bit byte.  The number of
states is variable and is not dependent on the number of rules but on
how those rules interact, and is subject to state explosion at certain
points of rule overlap.  In its simplest form a dfa is a table with states
as rows and input characters as columns, with each cell containing the next
state transition.  This results in unacceptably large tables for any but the
simplest of dfas so they are usually compressed.

The AA dfa uses a simple comb compression which, on average, reduces the size
of the dfa by about 90%, while keeping each state traversal a constant time
operation.  Even with this reduction we are seeing some dfas larger than
1 MB size.

The creation of the dfa is time and memory consuming because of state explosion,
redundant states, and compression.  State explosion is a property of the dfa
so there is not much that can be done about it as long as we use dfas.  The
redundant states occur because when generating a dfa from regular expressions
the minimal form of the dfa isn't usually created.  This problem is exacerbated
because redundant states are also subject to state explosion.

To reduce the number of redundant states AA applies a stage of regular
expression tree rewriting which results in a new expression tree that builds
an equivalent dfa closer to the optimal dfa for the expression.  AA then
applies a dfa minimization step to remove all remaining redundant states.
Finally the compression phase is applied.

So to summarize AA is doing a lot of work, some of it is redundant and which
part of the compiler pipeline affects a policy most is dependent on what
expressions are contained in its profiles.

So now to the solutions
- switch to a NFA which is space efficient but not time efficient, nor does
  it provide a tight bound on run time memory use that we want for in the
  kernel.
- switch to a hybrid NFA/DFA which is NFA nodes to provide choke points
  on a DFA, to keep exponential state expansion in check.  How many and
  where choke points are inserted can be controlled so that run time
  memory use can be controlled.
- switch to an extended finite automata (xfa), these are like dfas with
  a finite scratch memory that is used instead of state explosion to
  track unique positions.

xfas would be my first choice except that they are encumbered by patents,
which basically leaves us with the hybrid nfa/dfa, as the best choice.

The hybrid should allow us to improve both compile time and memory use,
by reducing state explosion.  If done properly this should reduce the
compiled size of all but the simplest profiles.

Compression will still be needed, and I suggest we look at using an
improved compression scheme where we can take better advantage of dfa
state similarities.  I would like to avoid using a d^2fa style compression
as the traversal of a single character is not bounded and can result
in multiple state transitions, but a delta style compression, where next
state entries are relative to the current state, should allow for
better state compression than we can achieve now.

As always feedback is welcome
john