[MERGE] is_ignored improvements...

Thu May 18 21:45:33 BST 2006

On Thu, May 18, 2006 at 14:14:02 -0500, John Arbash Meinel wrote:
> Jan Hudec wrote:
> > On Thu, May 18, 2006 at 11:21:17 -0500, John Arbash Meinel wrote:
> 
> ...
> 
> >> Out of these requests 3 use the return value, and 6 just check for
> >> true/false.
> > 
> > ... so I just tried a simple, not at all representative test. I took patterns
> >   ['.*\\.%i' % i for i in range(0, 999) and matched them:
> > 1. Non-capturing groups, all at once.
> > 2. Non-capturing groups, 50 at a time, until match is found.
> > 3. Capturing groups, 50 at a time, until match is found.
> > 4. No groups, each separately, until match is found.
> > against a pattern corresponding to 789th group ('foo.789'). And the times for
> > 100000 runs on my machine (Athlon64 3000 (2.2GHz)) were about:
> > 1. 12.5
> > 2. 13.7
> > 3. 18.9
> > 4. 70.3
> > 
> > So the groups slows it down by about a half, but compared to matching one
> > pattern at a time it's not /that/ significant.
> > 
> > Now I have to say, that when I doubled the pattern legnth and matched against
> > 'foo.1789', the times doubled even for the single pattern. That means that
> > the pattern compiler does not compile to a DFA. So it will be N^2 in all
> > cases until python gets a better regex compiler.
> > 
> 
> I'm a little curious about what syntax you used for the final
> compilation. Because I tried:
> >>> patterns = ['.*\\.%i' % i for i in range(0,1999)]
> >>> all = re.compile('|'.join(patterns))
> >>> all.match('foo.789').group()
> 'foo.7'
> >>> all = re.compile('(' + '|'.join(patterns) + ')$')
> >>> all.match('foo.789').group()
> 'foo.789'
> 
> So without using the (patterns)$ form, it was actually matching the
> wrong pattern. (it was matching *.7 instead of *.789).

I used '|'.join(['(%s$)' % p for p in patterns]) (and the non-capturing
equivalent). Attaching the actual script.

> And I'm not positive, but I thought a DFA was not possible when you had
> a pattern with '^' or '$'.  At least that's what I remember reading from
> the tla source code. I'm not really a regex expert.

- For ^ it certainly is -- DFA actually does a match, not a search, so need
  to tweak it by prepending loops to start points without ^ to do a search.
- For $ it's possible -- and quite simple -- as well. Just include
  a transition for end-of-string or include a flag telling you whether the
  accept state is end-of-string only or not.

> Anyway, just to say, because we need to anchor at the beginning and end
> of the pattern I don't think it is possible to create a DFA.

These are perfectly doable with a DFA. There are other things that are hard
or impossible though. One that is impossible are backreferences ('(.*)(?=1)$'
is not a regular grammar). And for some others I am not sure whether they are
hard or actually impossible. One of them are actually the capture groups,
since you may need to record multiple starts and ends in parallel and then
choose the right match afterwards.

-- 
						 Jan 'Bulb' Hudec <bulb at ucw.cz>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ,time-matches.py
Type: text/x-python
Size: 1158 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060518/90a28ed8/attachment.py 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060518/90a28ed8/attachment.pgp