[MERGE] is_ignored improvements...

Thu May 18 18:18:08 BST 2006

On Fri, May 19, 2006 at 00:14:08 +1000, Robert Collins wrote:
> Heres my alternative patch, finished up:
> 
> I currently prefer this over the original patch you prepared because of
> the preservation of API. I dont think many users will have more than 100
> regexs, and it should perform just fine.
> 
> What do you think? Does this make the future improvements you are
> working on harder? Is it a problem to take this variation?

I certainly isn't a problem. I can combine the patches later when/as needed.

However I have some comments, below in the patch. The most important is
probably that you should use the fact, that fnmatch.translate will never
generate capturing (nor non-capturing) groups, and simplify the combiner.

Also I think you should add a blackbox test that actually creates
a 1000-lines long .bzrignore and checks some files matched by patterns
towards the end are ignored and proper pattern is reported for them. I had
such test there simply creating patterns '*.0' through '*.999' and matching
a few files like 'foo.333' and 'bar.987' or somesuch. This will test that the
regexp engine will actually handle the patterns as split and report correct
group matches.

> === added file 'bzrlib/tests/workingtree_implementations/test_is_ignored.py'
> --- /dev/null	
> +++ bzrlib/tests/workingtree_implementations/test_is_ignored.py	
[...]
> +
> +import bzrlib
> +import bzrlib.branch
> +from bzrlib.branch import Branch
> +import bzrlib.bzrdir as bzrdir
> +from bzrlib.bzrdir import BzrDir
> +import bzrlib.errors as errors
> +from bzrlib.errors import NotBranchError, NotVersionedError
> +from bzrlib.osutils import basename
> +from bzrlib.tests import TestSkipped
> +from bzrlib.tests.workingtree_implementations import TestCaseWithWorkingTree
> +from bzrlib.trace import mutter
> +from bzrlib.transport import get_transport
> +import bzrlib.workingtree as workingtree
> +from bzrlib.workingtree import (TreeEntry, TreeDirectory, TreeFile, TreeLink,
> +                                WorkingTree)

There are many imports here, but they don't seem to be actually used
anywhere.

[...]
> === modified file 'NEWS'
> --- NEWS	
> +++ NEWS	
> @@ -34,6 +34,12 @@
>      * Fix shadowed definition of TestLocationConfig that caused some 
>        tests not to run.  (#32587, Erik B??gfors, Michael Ellerman, 
>        Martin Pool)
> +
> +  INTERNALS:
> +
> +    * Combine the ignore rules into a single regex rather than looping over
> +      them to avoid N^2 behaviour in operations like status. (Jan Hudec,
> +      Robert Collins).

Just a very very picky note: It's actually still N^2, because only fixed
number of patterns is compiled together. I actually fear it is actually N^2
even if all patterns are compiled together, because I don't think it creates
a full DFA. It's of course still a big improvement.

> === modified file 'bzrlib/tests/test_workingtree.py'
> --- bzrlib/tests/test_workingtree.py	
> +++ bzrlib/tests/test_workingtree.py	
> @@ -17,6 +17,7 @@
>  
>  from cStringIO import StringIO
>  import os
> +import _sre

I don't see this used anywhere below (I am not saying I didn't have such bugs
in my patch as well).

[...]
> === modified file 'bzrlib/workingtree.py'
> --- bzrlib/workingtree.py	
> +++ bzrlib/workingtree.py	
[...]
> @@ -966,6 +966,62 @@
>                  subp = appendpath(path, subf)
>                  yield subp
>  
> +    def _translate_ignore_rule(self, rule):
> +        """Translate a single ignore rule to a regex.
> +
> +        There are three sorts of ignore rules:
> +        root only - regex is the rule itself without the leading './'. These
> +        are identified by a leading './'.
> +        full path - regex is the rule itself and is identified by the 
> +        presenve of a '/' in the path.
> +        basename only rule - regex is a rule that ignores everything up
> +        to the last / in the string before applying the supplied rule.
> +        These are the default case.
> +
> +        :return: The translated regex.
> +        """
> +        if rule[:2] in ('./', '.\\'):
> +            # rootdir rule
> +            result = fnmatch.translate(rule[2:])
> +        elif '/' in rule or '\\' in rule:
> +            # path prefix 
> +            result = fnmatch.translate(rule)
> +        else:
> +            # default rule style.
> +            result = "(?:.*/)?(?!.*/)" + fnmatch.translate(rule)
> +        assert result[-1] == '$', "fnmatch.translate did not add the expected $"
> +        return "(" + result + ")"
> +
> +    def _combine_ignore_rules(self, rules):
> +        """Combine a list of ignore rules into a single regex object.
> +
> +        Each individual rule is combined with | to form a big regex, which then
> +        has $ added to it to form something like ()|()|()$. The group index for
> +        each subregex's outermost group is placed in a dictionary mapping back 
> +        to the rule. This allows quick identification of the matching rule that
> +        triggered a match.
> +        :return: a list of the compiled regex and the matching-group index 
> +        dictionaries. We return a list because python complains if you try to 
> +        combine more than 100 regexes.
> +        """
> +        result = []
> +        groups = {}
> +        next_group = 0
> +        translated_rules = []
> +        for rule in rules:
> +            translated_rule = self._translate_ignore_rule(rule)
> +            compiled_rule = re.compile(translated_rule)
> +            groups[next_group] = rule
> +            next_group += compiled_rule.groups
> +            translated_rules.append(translated_rule)
> +            if next_group == 99:
> +                result.append((re.compile("|".join(translated_rules)), groups))
> +                groups = {}
> +                next_group = 0
> +                translated_rules = []
> +        if len(translated_rules):
> +            result.append((re.compile("|".join(translated_rules)), groups))
> +        return result

This is hell lot needlessly complicated. fnmatch.translate will never create
a capturing group, so you can just take a slice, which would be quite a bit
simpler and faster. You also don't need the compile to check whether there
are groups. You can assert that the pattern does not contain '(' not preceded
by r'\' if you don't trust that.

>      def ignored_files(self):
>          """Yield list of PATH, IGNORE_PATTERN"""
[...]

-- 
						 Jan 'Bulb' Hudec <bulb at ucw.cz>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060518/d803c86f/attachment.pgp