[Fwd: Re: 'bzr status' stats each file multiple times]

Tue Dec 6 22:33:02 GMT 2005

forwarding this to the list

-------- Original Message --------
Subject: Re: 'bzr status' stats each file multiple times
Date: Mon, 05 Dec 2005 09:16:50 -0500
From: Jean-François Veillette <jean_francois_veillette at yahoo.ca>
To: John A Meinel <john at arbash-meinel.com>
References: <43912EAB.9040207 at arbash-meinel.com> 
<200512041414.50543.michael at ellerman.id.au> 
<43935077.7060900 at arbash-meinel.com> 
<200512041446.01870.michael at ellerman.id.au> 
<43935F93.6060100 at arbash-meinel.com> <20051205073106.GB27529 at djinn>

John,
First, I have no clue about Python, I'm currently mainly a Java
programmer.  I'm on the bazaar-ng mailing list because I'm using it for
one of my project and I'm curious about the dev going on behind it.
I'm posting just to you since I'm a bit shy to enter the developer
discussion as I have no clue about Python or the bzr implementation.

The cache problem :
- it seem that we do not want to be available at large (concurency
problem, re-entrant, stale cache).
- the cache is accessed from many different part of the system given a
root call.
Solution :
Why not create the cache in the root method (the one asked to get
status), and pass it to sub-functions as an argument.
It could be called a 'context' and be a dictionary (hash table).
Depending of the needs, function put/get informations in the 'context',
one of them could be the hash-cache.  That way, the hash cache is only
good for a single root call, doesn't come stale (and if it does, null
it out in the context), doesn't need time out (since once the root call
is finished, the 'context' is flushed).
Each function part of the function tree needed to process the status
would expect a context as a parameter.
On a larger note, you could extends the concept and have all function
(where it make sense) expect a context, if the function need's it and
is not given (it was pass null), it would create it and pass it to
function this method is calling.

a happy bzr user !

- jfv

Le 05-12-05, à 02:31, Jan Hudec a écrit :

> On Sun, Dec 04, 2005 at 15:28:51 -0600, John A Meinel wrote:
>> Actually, the thing that seems to take a really long time for me is 
>> the
>> unknowns check. I have a rather large .bzrignore, because this project
>> likes to build inside the tree. It's a rather large project with 1600
>> source files, and about 50 output executables.
>> So I think bzr is trying to match each file it finds against all 
>> entries
>> in .bzrignore.
>> Since a lot of them are absolute paths, I'm thinking bzr should use a
>> dictionary for the absolute paths, then it can just say "is path in
>> ignored_paths" rather than doing a fnmatchcase against each one. 
>> 50x500
>> files takes a while.
>
> Looking at the fnmatch module, it does the matching via regexps 
> internally:
>
> def fnmatchcase(name, pat):
>     if not pat in _cache:
>         res = translate(pat)
>         _cache[pat] = re.compile(res)
>     return _cache[pat].match(name) is not None
>
>
> (+ some import and docstring, that are irrelevant now).
>
> While it caches the patterns, it:
> 1) Does the hash lookup each time through the loop.
> 2) Matches the patterns independently.
>
> Thus I'd suggest:
> 1) Convert all the patterns to regexes. Using the fnmatch.translate is
>    possible, though I am not sure the '$' it appends is not a problem 
> in
>    the next step. Custom translator would also have the advantage of
>    allowing to extend the syntax (though I'd rather see option to put
>    regexps in the .bzrignore directly).
> 2) Join all the patterns with '|'.
> 3) Compile the one long pattern.
> 4) Match each filename just once against this pattern.
>
> I am not sure how well python optimizes the regular expression, but it
> will certainly do a better job than matching against all of them
> separately.
>
> -- 
> 						 Jan 'Bulb' Hudec <bulb at ucw.cz>
>
http://www.freeiPods.com/?r=21419063