hash cache tuning

Martin Pool mbp at sourcefrog.net
Wed Aug 22 08:49:29 BST 2007


I did actually observe the behaviour Andrew refers to while working on
the pre-dirstate hashcache some time ago.  It looks gutsy's kernel has
a fix to cap the resolution to the actual stored resolution of the
filesystem, but it would be foolhardy to rely on this.

If we want to know the resolution, I think the straightforward thing
is to do a statvfs (once per process!) and look at the filesystem
type.

> Now in general we cannot write a new file to the hashcache because we
> cannot be sure its not altered post creation without re-reading it
> outside the filesystems modification granularity.
>
> However I think this is bogus: writing to files in limbo while we are
> organising the tree just gives the user both pieces to take home.

I don't understand.

> So:
>  we stat files as we write them to limbo.
>  For the first file, we do an additional stat after making it but before
> writing content, which we use to determine if we have subsecond, second,
> or multisecond granularity.
>  At the end of the limbo operation, some fraction of the files are
> probably now older than the detected stat granularity; those we put into
> the stat cache/dirstate.

Also of course moving the file into place will change its ctime, and
that's part of the validator.

I think it's reasonable to say that modifying the tree while bzr is in
the middle of committing or building the tree is not supported.  In
particular if we think we've written a file but the user changed it
while we were writing, we may get things wrong.

So the approach I'd suggest is: immediately after putting the file in
its final location, stat it.  At the end of the operation, write the
hashcache containing the files whose stat information is before the
exclusion window.

This would mean that after a commit/build-tree, we will have caches
for everything except on average the files built in the last 500ms (on
Linux.)  And therefore the uncached files are ones that we can
reasonably expect to re-read quickly.  (Writes into cache may be
faster than reads that need to go to disk, but still there is some
kind of proportionality.)

-- 
Martin



More information about the bazaar mailing list