[MERGE] is_ignored improvements...

John A Meinel john at arbash-meinel.com
Sat May 20 15:35:28 BST 2006


Jan Hudec wrote:
> On Fri, May 19, 2006 at 15:51:40 +1000, Martin Pool wrote:
>> On 18 May 2006, Robert Collins <robertc at robertcollins.net> wrote:
>>> On Thu, 2006-05-18 at 14:34 +0200, Jan Hudec wrote:
>>>
>>>>> The code is very complex compared to the current code, and we end up
>>>>> with two apis rather than one. I'm working on a variation derived from
>>>>> your work, which will combine the regexes but only need the single API.
>>>>> I'll post that up shortly, and I'd appreciate your review on this, as I
>>>>> know you put a lot of thought into the is_ignored work in your branch.
>>>> The two interfaces are there because python has arbitrary limit of 100
>>>> match groups in a pattern. So if I wanted to tell which pattern matched
>>>> (the user interface displays that), I had to do the match bunch at a
>>>> time rather than all at once. But in many places just a boolean is
>>>> wanted, so all patterns can be matched at once, without captures.
>> And even aside from the limit, matching them without capture is somewhat
>> faster, so in the long term having separate interfaces may be better.
>> But for now Robert's patch looks reasonably simple and I'm +1 on merging
>> it, if it's OK with Jan.
> 
> It's OK with me. I'll keep my version around and combine them when I finish
> the properties stuff. Then I'll combine with Roberts patch and cut down the
> call nesting in my version a bit.
> 
> Note, that the whole pattern is not that faster than matching in reasonably
> large chunks, because the regexp engine is backtracking. What would actually
> help is merging common prefixes (so the matcher can skip a lot of patterns at
> once if the prefix fails).

Well, I think the #1 pattern format is *.foo, where we are just looking
for some sort of extension. And looking in all directories, etc. I
almost wonder if we wouldn't be better off with some sort of translation
that changes all of the *.foo into 'path.endswith()' calls.

I realize we would still end up with some sort of M*N number of endswith
calls, but it seems like each call should be much less expensive.

In my minimal amount of testing, this was very much not the case,
though. This is what I changed:

class Endswith(object):

        __slots__ = ['end', 'pattern']

        def __init__(self, pattern):
                assert pattern.startswith('.*\\.')
                self.pattern = pattern
                self.end = pattern[3:]

        def match(self, s):
                if s.endswith(self.end):
                        return self.pattern
                return None


endsWith = [Endswith(p) for p in patlist]

So it would seem that the regex implementation is fairly good, even if
it isn't perfect.

...

>> So the ?! says "don't match anything with .*/ after this point", which
>> is OK but I wonder if it'd be simpler or faster to just translate '*' to [^/]*.
>> Anyhow, could just add a comment about that until someone has time to
>> measure it.
> 
> Yes, it would be. And it indeed is what my new converter does. But we can't
> tell fnmatch to do it. So I would not do it for this version. Of course we
> could bring in the Replacer and reimplement fnmatch.translate in it, but
> I don't think it's worth it now.
> 

I also tried this:
compPrefix = [re.compile('.*\\.(?:' +
        '|'.join(['(?:%s$)' % i for i in range(0, max)])
        + ')' )]
(Factoring out the '.*\.' prefix), and I found that it doubles the
performance:

# For foo.19
$ python ,time-matches.py
NoMatch: 0.572
NoMatchSplit: 0.568
MatchSplit: 0.740
Separate: 2.792
Prefix: 0.292
PrefixSplit: 0.288
Endswith: 3.509

So I think there is stuff worth looking into.

John
=:->

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: ,time-matches.py
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20060520/a66ca0d6/attachment.diff 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 249 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060520/a66ca0d6/attachment.pgp 


More information about the bazaar mailing list