lock free dirstate - prototype

Wed Sep 30 15:19:59 BST 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> On Wed, 2009-09-30 at 06:24 +0100, Martin Pool wrote:
> 
> On the case of bounded garbage, its complicated because we don't know if
> a hash-named file is 
> a) garbage
> b) a pending concurrent update from 'bzr st' or similar
> 
> I guess we can just attempt to delete them all during a semantic update.
> 
> The problems I see are that its not well bounded, and the stat cache
> writers will need to handle more cases.
> 
> In any event, I've done my main spike; I'm waiting some folk looking at
> it more closely, and we can make concrete alterations from there on in.
> 
>> It also tends to make us do more work than we have to for operations
>> like
>>
>>   echo foo >>iso
>>   bzr st
>>   echo bar >>iso
>>   bzr st
>>
>> This particular case may have been squashed, but the hashcache
>> approach makes it likely we'll read and hash the whole file, and write
>> the whole dirstate, on each operation, quite unnecessarily.  To me
>> this is much more common than just touching a dirstate file. 
> 
> We only hash a file if the size is unchanged.
> 
> -Rob

I'm pretty sure the way the code is now, we hash it whether the content
has changed or not at some point.

The #1 reason why we update the hashcache during 'status' is because
'bzr update' and 'bzr merge' and 'bzr commit' did *not*. We squashed at
least 'bzr commit', I don't know about the rest.

The idea here is that those commands tend to change a *lot* of files
(especially in the 'initial commit' case.) As such, it isn't the 3-4
files that a user has modified that are slightly out of date, but the
100s of files that were updated that would be re-read repeatedly.

I believe Ian tried a heuristic of "if less than XX changes, don't
write". Which is good, but IIRC he didn't quite distinguish "this file
has been moved/added" type of change which is critical, with the "this
hashcache has been update" which is not.

I think Robert's general work is a good step forward. I don't know that
it is worth "landing as a development format" without addressing more
issues that we have with dirstate.

I haven't dug into it enough to really understand the rename dance he is
doing. And I especially haven't thought through the various edge cases yet.

I thought at one point he wanted an indirection file. So that we would
have a file which pointed at the 'current' dirstate file. But now it
seems that he is just renaming the actual dirstate content file around.

It does make me wonder if we would be better of using directories to be
renamed, rather than files, like we do for the OS locking. So you would
have:

.bzr/checkout/
 current/dirstate
 $MD5SUM.new/dirstate
 $MD5SUM.old/dirstate
...

or arguably:
 old.$MD5SUM/dirstate

as then all the 'old' files sort together rather than being randomly
distributed by md5sum.

We know from experience that filesystems are generally better about
being consistent about directory renames than file renames. (As long as
the directory has content.)

On the flip side, it adds a level of indirection and one more system
call (mkdir, open().write() vs just open.write()).
I'm not very concerned about rename performance on Windows in real-world
cases, but it would be nice to do better for the test suite. (Taking
50ms to rename a directory is no big deal when the status takes 200ms,
but when you run the test suite 50ms*100,000 locks and unlocks starts to
be noticeable.)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkrDaQ8ACgkQJdeBCYSNAAPZPgCgukkQ9wAQYFATfIElCeJ/SaWh
6YsAnjjbL5qV6rhGDTW54ZTVb86WeZi+
=hf2Q
-----END PGP SIGNATURE-----