[MERGE] is_ignored improvements...

Sat May 20 16:35:32 BST 2006

On Sat, May 20, 2006 at 09:35:28 -0500, John A Meinel wrote:
> Jan Hudec wrote:
> > On Fri, May 19, 2006 at 15:51:40 +1000, Martin Pool wrote:
> >> On 18 May 2006, Robert Collins <robertc at robertcollins.net> wrote:
> >>> On Thu, 2006-05-18 at 14:34 +0200, Jan Hudec wrote:
> >>>
> >>>>> The code is very complex compared to the current code, and we end up
> >>>>> with two apis rather than one. I'm working on a variation derived from
> >>>>> your work, which will combine the regexes but only need the single API.
> >>>>> I'll post that up shortly, and I'd appreciate your review on this, as I
> >>>>> know you put a lot of thought into the is_ignored work in your branch.
> >>>> The two interfaces are there because python has arbitrary limit of 100
> >>>> match groups in a pattern. So if I wanted to tell which pattern matched
> >>>> (the user interface displays that), I had to do the match bunch at a
> >>>> time rather than all at once. But in many places just a boolean is
> >>>> wanted, so all patterns can be matched at once, without captures.
> >> And even aside from the limit, matching them without capture is somewhat
> >> faster, so in the long term having separate interfaces may be better.
> >> But for now Robert's patch looks reasonably simple and I'm +1 on merging
> >> it, if it's OK with Jan.
> > 
> > It's OK with me. I'll keep my version around and combine them when I finish
> > the properties stuff. Then I'll combine with Roberts patch and cut down the
> > call nesting in my version a bit.
> > 
> > Note, that the whole pattern is not that faster than matching in reasonably
> > large chunks, because the regexp engine is backtracking. What would actually
> > help is merging common prefixes (so the matcher can skip a lot of patterns at
> > once if the prefix fails).
> 
> Well, I think the #1 pattern format is *.foo, where we are just looking
> for some sort of extension. And looking in all directories, etc. I
> almost wonder if we wouldn't be better off with some sort of translation
> that changes all of the *.foo into 'path.endswith()' calls.

You can try to time it, but I don't believe it. I do more believe in
stripping the *. from them, converting to regexps, oring together and then
wrapping with r'.*\.(?:%s)'. That would make a fourth case in the converting
switch.

> I realize we would still end up with some sort of M*N number of endswith
> calls, but it seems like each call should be much less expensive.

I don't think the endswith call will be faster than matching a branch in the
regexp by enough to compensate for the python versus C loop.

Robert already suggested combining the relative patterns and prefixing them
with '(?:.*/)?(?!.*/)' as a group rather than each individually. So I'd
recommend taking similar approach for the *.<something> patterns.

> In my minimal amount of testing, this was very much not the case,
> though. This is what I changed:
> 
> class Endswith(object):
> 
>         __slots__ = ['end', 'pattern']
> 
>         def __init__(self, pattern):
>                 assert pattern.startswith('.*\\.')
>                 self.pattern = pattern
>                 self.end = pattern[3:]
> 
>         def match(self, s):
>                 if s.endswith(self.end):
>                         return self.pattern
>                 return None
> 
> 
> endsWith = [Endswith(p) for p in patlist]
> 
> So it would seem that the regex implementation is fairly good, even if
> it isn't perfect.
> 
> ...
> 
> >> So the ?! says "don't match anything with .*/ after this point", which
> >> is OK but I wonder if it'd be simpler or faster to just translate '*' to [^/]*.
> >> Anyhow, could just add a comment about that until someone has time to
> >> measure it.
> > 
> > Yes, it would be. And it indeed is what my new converter does. But we can't
> > tell fnmatch to do it. So I would not do it for this version. Of course we
> > could bring in the Replacer and reimplement fnmatch.translate in it, but
> > I don't think it's worth it now.
> > 
> 
> I also tried this:
> compPrefix = [re.compile('.*\\.(?:' +
>         '|'.join(['(?:%s$)' % i for i in range(0, max)])
>         + ')' )]
> (Factoring out the '.*\.' prefix), and I found that it doubles the
> performance:
> 
> # For foo.19
> $ python ,time-matches.py
> NoMatch: 0.572
> NoMatchSplit: 0.568
> MatchSplit: 0.740
> Separate: 2.792
> Prefix: 0.292
> PrefixSplit: 0.288
> Endswith: 3.509
> 
> So I think there is stuff worth looking into.

Yes. Seems to make quite a difference.

-- 
						 Jan 'Bulb' Hudec <bulb at ucw.cz>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060520/adbf49a2/attachment.pgp