[patch] improved ignore pattern matching (#57637)

Tue Nov 28 12:41:43 GMT 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Kent Gibson wrote:
>
>
> John Arbash Meinel wrote:
>>> My memory may be faulty, but I seem to recall it making a
>>> difference when you have ~50 groups. It has been a long time
>>> since I benchmarked it, though.
>>>
> And I've not tested that particular sour spot, so your memory may
> be correct.
>>> Actually, right now bzr uses matching groups, because it uses
>>> the match to determine what original pattern matched. Used by
>>> 'bzr ignored', and a few other places. We talked about doing it
>>>  differently. Because I think it did make a difference if you
>>> match 100 patterns all at once, rather than matching 50
>>> patterns, and then another 50 patterns.
> I found little difference. OTOH merging similar patterns, as you
> suggest below, made a huge difference.
>>> There are probably other more important optimizations, though.
>>> Like maybe being able to split out all of the patterns that
>>> match all of the path, versus the ones that only match a little
>>> bit. So that you can combine the common prefixes.
>>>
>>> .*/(pat1|pat2|pat3)
>>>
>>> versus (.*/pat1|.*/pat2|.*/pat3)
>>>
> The patch does that.  It splits patterns into 3 categories -
> extension, basename and full path. It also matches in that order,
> on the assumption that more general patterns are more likely to
> match.
>>> Though again, it isn't 100% clear how the actual regex engine
>>> translates things, as to which one it will actually find faster
>>> to process. Because I know I found: (pat1$|pat2$|pat3$) faster
>>> than: (pat1|pat2|pat3)$
>>>
> Interesting.  I'll have to try that. The patch uses the second
> form, well sort of - for extensions it would actually be something
> like (?:.*\.(pat1)|(pat2)|(pat3))$.
>
I got the pattern wrong (so my memory is definitely faulty) - it
should be '.*\.(?:(pat1)|(pat2)|(pat3))$'

Anyway, I did a little benchmarking using "bzr selftest --bench
ignore" and twiddling with the following lines in glob.py:

            grouped_rules = ['(%s)' % translator(pat) for pat in
patterns[:99]]
            joined_rule = '%s(?:%s)$' % (prefix, '|'.join(grouped_rules))

To test named versus unnamed grouping I changed the first line to

            grouped_rules = ['(?:%s)' % translator(pat) for pat in
patterns[:99]]

This will shag the regex to glob mapping, but works ok for
benchmarking since it uses the non-matching case.
The result was no observable difference.  So the Python documentation
seems to be right - for Python 2.4 anyway.

To test placing the $ per pattern I changed the lines to

            grouped_rules = ['(%s$)' % translator(pat) for pat in
patterns[:99]]
            joined_rule = '%s(?:%s)' % (prefix, '|'.join(grouped_rules))

The result was about a 10% slowdown.
So I think I'll leave it the way it was.

As always YMMV.

Cheers,
Kent.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFbC6HgoxTFTi1P8QRAsJnAJ9Bq/C6CB+GIZY3lpsheAbEmuTdrACg4Pg6
bMpqpe+8+x1kh8H12dLL1Z0=
=oulh
-----END PGP SIGNATURE-----