[MERGE] is_ignored improvements...

Fri May 19 01:17:08 BST 2006

Jan Hudec wrote:
> On Thu, May 18, 2006 at 14:14:02 -0500, John Arbash Meinel wrote:
>> Jan Hudec wrote:
>>> On Thu, May 18, 2006 at 11:21:17 -0500, John Arbash Meinel wrote:
>> ...
>>
>>>> Out of these requests 3 use the return value, and 6 just check for
>>>> true/false.
>>> ... so I just tried a simple, not at all representative test. I took patterns
>>>   ['.*\\.%i' % i for i in range(0, 999) and matched them:
>>> 1. Non-capturing groups, all at once.
>>> 2. Non-capturing groups, 50 at a time, until match is found.
>>> 3. Capturing groups, 50 at a time, until match is found.
>>> 4. No groups, each separately, until match is found.
>>> against a pattern corresponding to 789th group ('foo.789'). And the times for
>>> 100000 runs on my machine (Athlon64 3000 (2.2GHz)) were about:
>>> 1. 12.5
>>> 2. 13.7
>>> 3. 18.9
>>> 4. 70.3
>>>
>>> So the groups slows it down by about a half, but compared to matching one
>>> pattern at a time it's not /that/ significant.
>>>
>>> Now I have to say, that when I doubled the pattern legnth and matched against
>>> 'foo.1789', the times doubled even for the single pattern. That means that
>>> the pattern compiler does not compile to a DFA. So it will be N^2 in all
>>> cases until python gets a better regex compiler.
>>>
>> I'm a little curious about what syntax you used for the final
>> compilation. Because I tried:
>>>>> patterns = ['.*\\.%i' % i for i in range(0,1999)]
>>>>> all = re.compile('|'.join(patterns))
>>>>> all.match('foo.789').group()
>> 'foo.7'
>>>>> all = re.compile('(' + '|'.join(patterns) + ')$')
>>>>> all.match('foo.789').group()
>> 'foo.789'
>>
>> So without using the (patterns)$ form, it was actually matching the
>> wrong pattern. (it was matching *.7 instead of *.789).
> 
> I used '|'.join(['(%s$)' % p for p in patterns]) (and the non-capturing
> equivalent). Attaching the actual script.
> 
>> And I'm not positive, but I thought a DFA was not possible when you had
>> a pattern with '^' or '$'.  At least that's what I remember reading from
>> the tla source code. I'm not really a regex expert.
> 
> - For ^ it certainly is -- DFA actually does a match, not a search, so need
>   to tweak it by prepending loops to start points without ^ to do a search.
> - For $ it's possible -- and quite simple -- as well. Just include
>   a transition for end-of-string or include a flag telling you whether the
>   accept state is end-of-string only or not.
> 
>> Anyway, just to say, because we need to anchor at the beginning and end
>> of the pattern I don't think it is possible to create a DFA.
> 
> These are perfectly doable with a DFA. There are other things that are hard
> or impossible though. One that is impossible are backreferences ('(.*)(?=1)$'
> is not a regular grammar). And for some others I am not sure whether they are
> hard or actually impossible. One of them are actually the capture groups,
> since you may need to record multiple starts and ends in parallel and then
> choose the right match afterwards.
> 

Well, I did some timing myself, and I can say having the match early
makes a world of difference. Attached is what I got.

Its pretty obvious that separate is 1 order of magnitude worse than any
of the others. It is also obvious that searching for '.19' is more than
an order of magnitude faster than 900 or 1900. But that 2k rules isn't a
lot more expensive than 1k rules.

So it would seem that we would benefit a lot by putting commonly matched
patterns at the beginning.

However, I have to yield to Robert's mention that 2k patterns and 100k
files is very uncommon.
However, I did have a source tree with > 100, because it was a big
project, and all the build projects stayed in the source tree. (lots of
.o, >50 executables)

The number which is actually fairly representative is the foo.yyy with
100 patterns. Because if you think of the kernel, you are likely to
easily have 100 patterns (bzr has 53 built-in).
And if there are 100k files that are all source (thus they *don't*
match), you end up with 3.34 seconds for the MatchSplit vs 2.4 for
NoMatchSplit and 15.07 for Separate. So it should be 0.334 seconds for a
kernel sized tree. (versuse 1.5s just spent in the is_ignored function)

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: match-time.png
Type: image/png
Size: 20774 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060518/f2b75439/attachment.png 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ,timing.ods
Type: application/vnd.oasis.opendocument.spreadsheet
Size: 8999 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060518/f2b75439/attachment.ods 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: unmatched-times.png
Type: image/png
Size: 17690 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060518/f2b75439/attachment-0001.png 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060518/f2b75439/attachment.pgp