[MERGE] is_ignored improvements...

Thu May 18 23:47:05 BST 2006

On Thu, 2006-05-18 at 20:40 +0200, Jan Hudec wrote:
> On Thu, May 18, 2006 at 11:21:17 -0500, John Arbash Meinel wrote:
> > Did you try performance testing with and without group matching? I would
> > have guessed that it was faster to match without the groups.
> > The reason I originally preferred Jan's double api, is because *most* of
> > the callers of is_ignored() didn't actually care about what pattern was
> > matched. So why spend the time if you are just throwing the information
> > away.
> > 
> > Remember, status also uses 'is_ignored', and that is actually where *I*
> > ran into the major slowdown. (add, status, ignored, info all use it)
> > 
> > If you do grep "\<is_ignored\>", you get:
> > add.py:158:                    ignore_glob = tree.is_ignored(subp)
> > builtins.py:1425:            pat = tree.is_ignored(path)
> > info.py:203:        if working.is_ignored(path):
> > tree.py:242:        if new_tree.is_ignored(filename):
> > workingtree.py:672:        elif self.is_ignored(filename):
> > workingtree.py:710:                elif self.is_ignored(fp):
> > workingtree.py:889:            if not self.is_ignored(subp):
> > workingtree.py:973:            pat = self.is_ignored(subp)
> > workingtree.py:994:    def is_ignored(self, filename):
> > workingtree.py:1160:                if self.is_ignored(f):
> > 
> > Out of these requests 3 use the return value, and 6 just check for
> > true/false.
> 
> ... so I just tried a simple, not at all representative test. I took patterns
>   ['.*\\.%i' % i for i in range(0, 999) and matched them:
> 1. Non-capturing groups, all at once.
> 2. Non-capturing groups, 50 at a time, until match is found.
> 3. Capturing groups, 50 at a time, until match is found.
> 4. No groups, each separately, until match is found.
> against a pattern corresponding to 789th group ('foo.789'). And the times for
> 100000 runs on my machine (Athlon64 3000 (2.2GHz)) were about:
> 1. 12.5
> 2. 13.7
> 3. 18.9
> 4. 70.3
> 
> So the groups slows it down by about a half, but compared to matching one
> pattern at a time it's not /that/ significant.
> 
> Now I have to say, that when I doubled the pattern legnth and matched against
> 'foo.1789', the times doubled even for the single pattern. That means that
> the pattern compiler does not compile to a DFA. So it will be N^2 in all
> cases until python gets a better regex compiler.

Right. For 100,000 runs with 1000 ignore rules, I think 20 seconds is
not a concern : must users will have < 100 rules (drops to 2 seconds
overhead), and 100,000 files is 10 times the linux kernel source size,
or 5* if you do a build first. So a kernel tree might take 0.4 seconds
in the is_ignored routine - I think that is acceptable.

Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 191 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060519/3fd29065/attachment.pgp