[RFC] Strawman replacement local directory state

Tue Jun 13 17:43:51 BST 2006

Generally, I think you've covered a lot of ground here. And it sounds
pretty good.

Robert Collins wrote:
> I've put a strawman working tree state file format up at
> http://bazaar-vcs.org/Classes/WorkingTree.
> 
> Its got some interesting aspects, and some things that we might choose
> to address...
> 
> Its designed as a 'parse it fully every time, and write it fully too,
> but make it fast'. format.

What if you split it up by directory? I know the hg guys thought of that
as a net loss, because of the seek between files overhead, but if we are
including a hashcache which needs to be updated, it may mean that we can
keep the overall updates to a much smaller set.

> 
> It has the basis inventory, merge list and current inventory and
> hashcache built in.
> 
> This is IMO desirable because most of our operations that need the basis
> inventory actually want to to compare_trees on it. We can yield that
> *directly* out of the format I am proposing - the basis data is
> colocated with the active data, which means a streaming read of the
> state file should be able to produce very fast status/diff/commit
> results. The suggests that we want methods on working tree to expose the
> idea of 'difference from basis' to allow this sort of optimisation to
> hook in.

I can see that as a possibility. I don't want to end up with too much
stuff in WorkingTree, though. But this does seem reasonable.

> 
> On the down side, having the hashcache built in does mean that updates
> to it are more expensive - such as after a diff or status operation. On
> the other hand, as we can process this data as we go, we can be building
> the replacement state file *during* the diff or status operation.
> 
> On the down side again, this does require diff and status to take a
> working tree (only the working tree would be needed) write lock, because
> they would be updating a file that add/rename/remove would also affect. 

You could only take the lock during the short period of time that you
are actually modifying the file. But I guess if you want to modify as
you go, you may end up with it being invalidated underneath you.

> 
> On balance though, I think this should work very well.
> 
> Seeking thoughts and hole-finding.
> 
> Rob

'pending merges' is that for the whole tree, or per file? We've wanted
per-file 'pending-merges' for a while, since it can be used as a speed
boost for Commit.

'all fields on a newline' seems a little overdone.
I would rather see 'all blobs 1/newline' and then maybe '\0' termination
inbetween the blobs.

For the adler checksum + length, does that include the format length and
length of the checksum bytes? You end up with a recursive issue, since
modifying the length modifies the number, modifies the length, etc.
You could have a fixed size length, or just start counting the length
after the header.

Are you going to put keys in, or is it all positional data?
so the line becomes
	parents: id1 id2 id3\n
or is it just
	id1 id2 id3\n

I hesitate to have pure positional data, because recovery is very
difficult. I would like to see how much overhead they cause, since it
would be possible to make some gains there.

The reason to not prefer to only use a single separator ('\n') is so
that you can break everything up into chunks, and then break those
chunks up as appropriate.

I definitely prefer to use ascii/hexlified text as much as possible,
since it means opening it up in a text editor will have real meaning. I
would be curious about the performance differences, though.

We might investigate something that has a reasonable ascii header, and
the rest is all positional binary. Then you read in the entire file as a
string, and use 'struct.unpack()' to pull out information from it.

Does dirstate need to include entries for unknown files? It seems like
it should. And it seems like it should also include entries for ignored
files, so that we can quickly discard them. Though it might require
tracking some sort of 'hash of the ignore rules' so that we know to
check them again if the ignore rules change.

If you are including stuff like unknowns, it also stands to reason that
you don't really want to add 10 blank lines for each unversioned file.
Hence why I would recommend going with a \0 + \n formatted file.
Then you split on '\n' to get each file hunk, and for each file you
could split on '\0' to get the file information.

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060613/fa7ed13b/attachment.pgp