'bzr status' stats each file multiple times

Sun Dec 4 20:45:58 GMT 2005

On Sun, 4 Dec 2005 14:24, John A Meinel wrote:
> Michael Ellerman wrote:
> > On Sun, 4 Dec 2005 13:41, John A Meinel wrote:
> >> Michael Ellerman wrote:
> >>> On Sun, 4 Dec 2005 08:43, John A Meinel wrote:
> >>>> The idea of calling hashcache.scan() early, is that you can stat in
> >>>> inode order. Which according to git, should be faster. And who would
> >>>> know better than kernel hackers. :)
> >>>
> >>> OK that sounds reasonable, except that we then re-stat everything. So
> >>> that's two stats for every file in the cache right there, sounds look a
> >>> false optimisation to me.
> >>>
> >>> If we're interested in being really fast, then we should do the entire
> >>> status operation in inode order - we can sort the output if we want to.
> >>> But step one should be to make sure status only stats each file exactly
> >>> once.
> >>>
> >>>> I think we should have a timeout value. So that if I stat'd the file
> >>>> within the last X seconds, I assume the file hasn't changed.
> >>>> We can even go one better, and do a double check saying, If the files
> >>>> mtime is older than Y seconds, and I have stat'd the file within X
> >>>> seconds, don't stat again.
> >>>
> >>> Hmm, that sounds like a kludge to me - I think we can improve on the
> >>> current times before we need to resort to something like that.
> >>
> >> How is it a kludge? Isn't it exactly what you just requested. Stat
> >> everything 1 time?
> >
> > I think it's a kludge because you're potentially losing information, ie.
> > that a file has changed recently, in order to gain performance. Why 5
> > seconds? Why not 1, 10, 60, 120 ?
> >
> > What I was suggesting is that we should work on the higher level code,
> > eg. compare_trees() to make sure it only requires one stat per file - and
> > preferably in inode order.
>
> Well, there are lots of places that need information about a file (does
> file exist, get file size, is executable, etc). It sounds like you are
> saying that "compare_trees" should keep a cache (possibly as part of
> each file entry) of what it has statted.
>
> I'm saying we already have a location which has the stat results, along
> with the sha1 hash for each file (hashcache). Why cache the same
> information twice in two different locations?

No I wasn't suggesting compare_trees() cache anything, what I was thinking was 
that we should change the algorithm so there's exactly one loop in 
compare_trees() and it just looks at each file as it comes across it - 
preferably in the optimal order.

But I think I see what you mean, there's likely to be other parts of the code 
that cause a stat to occur, even when compare_trees() already has that 
information.

I guess I just worry that putting aribtrary "x seconds" timeouts is a bad 
idea, and might come back and bite us one day.

cheers

-- 
Michael Ellerman
IBM OzLabs

email: michael:ellerman.id.au
inmsg: mpe:jabber.org
wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20051204/677f34e8/attachment.pgp