[RFC] Strawman replacement local directory state

Tue Jun 13 18:15:22 BST 2006

On Tue, 2006-06-13 at 11:43 -0500, John Arbash Meinel wrote:
> Generally, I think you've covered a lot of ground here. And it sounds
> pretty good.

Thanks. I think profiling with a sqlite db is probably worth doing at
some point. I'm hesitant about using it though, because while you avoid
some overhead, you add other :).

> Robert Collins wrote:
> > I've put a strawman working tree state file format up at
> > http://bazaar-vcs.org/Classes/WorkingTree.
> > 
> > Its got some interesting aspects, and some things that we might choose
> > to address...
> > 
> > Its designed as a 'parse it fully every time, and write it fully too,
> > but make it fast'. format.
> 
> What if you split it up by directory? I know the hg guys thought of that
> as a net loss, because of the seek between files overhead, but if we are
> including a hashcache which needs to be updated, it may mean that we can
> keep the overall updates to a much smaller set.

Well, it is structured by directory now, we could certainly test that.

> > 
> > It has the basis inventory, merge list and current inventory and
> > hashcache built in.
> > 
> > This is IMO desirable because most of our operations that need the basis
> > inventory actually want to to compare_trees on it. We can yield that
> > *directly* out of the format I am proposing - the basis data is
> > colocated with the active data, which means a streaming read of the
> > state file should be able to produce very fast status/diff/commit
> > results. The suggests that we want methods on working tree to expose the
> > idea of 'difference from basis' to allow this sort of optimisation to
> > hook in.
> 
> I can see that as a possibility. I don't want to end up with too much
> stuff in WorkingTree, though. But this does seem reasonable.

I agree that we should be careful about api bloat here...

> > 
> > On the down side, having the hashcache built in does mean that updates
> > to it are more expensive - such as after a diff or status operation. On
> > the other hand, as we can process this data as we go, we can be building
> > the replacement state file *during* the diff or status operation.
> > 
> > On the down side again, this does require diff and status to take a
> > working tree (only the working tree would be needed) write lock, because
> > they would be updating a file that add/rename/remove would also affect. 
> 
> You could only take the lock during the short period of time that you
> are actually modifying the file. But I guess if you want to modify as
> you go, you may end up with it being invalidated underneath you.

You have to lock the entire time: otherwise someone can do a commit
while you are blocked on output for diff, then when you flush at the end
it barfs.

Alternatively you can take a lock at the end and check the checksum in
the header is unaltered... which is more complex but still safe enough.

> > 
> > On balance though, I think this should work very well.
> > 
> > Seeking thoughts and hole-finding.
> > 
> > Rob
> 
> 'pending merges' is that for the whole tree, or per file? We've wanted
> per-file 'pending-merges' for a while, since it can be used as a speed
> boost for Commit.

The pending merges list is the list of pending tree wide merges. The per
file pending merges are listed inline in the format, so we get that
boost. Its not quite variadic enough to do arbitrary sub-graphs, but
thats not supported now, so I dont see it as an issue. We could possibly
just do 'null:' for all the files not being merged in sub tree merge,
but lets discuss that later: this does not need to be the final format.

> 'all fields on a newline' seems a little overdone.
> I would rather see 'all blobs 1/newline' and then maybe '\0' termination
> inbetween the blobs.

From what I've heard, python is really optimised for line based stuff.
We can profile this. I think in this regard we should choose for speed
though.

> For the adler checksum + length, does that include the format length and
> length of the checksum bytes? You end up with a recursive issue, since
> modifying the length modifies the number, modifies the length, etc.
> You could have a fixed size length, or just start counting the length
> after the header.

The bytes starting on the next line :)

> Are you going to put keys in, or is it all positional data?
> so the line becomes
> 	parents: id1 id2 id3\n
> or is it just
> 	id1 id2 id3\n
>
> I hesitate to have pure positional data, because recovery is very
> difficult. I would like to see how much overhead they cause, since it
> would be possible to make some gains there.

positional only.

> The reason to not prefer to only use a single separator ('\n') is so
> that you can break everything up into chunks, and then break those
> chunks up as appropriate.
> 
> I definitely prefer to use ascii/hexlified text as much as possible,
> since it means opening it up in a text editor will have real meaning. I
> would be curious about the performance differences, though.

I dont want to support opening it in a text editor at all: we can write
debugging tools for this, and as its in the innermost loop of *every*
common operation a user does to a tree, when choosing between speed and
human-readability, speed will win every time. 

> We might investigate something that has a reasonable ascii header, and
> the rest is all positional binary. Then you read in the entire file as a
> string, and use 'struct.unpack()' to pull out information from it.

struct.unpack uses regexes, and the *vast* bulk of our data is strings.
That said, I'm open to experimentation, but you get into issues with
'where can NULL bytes appear' if you want to split on \0, or newlines if
you split on \n.

> Does dirstate need to include entries for unknown files? It seems like
> it should. And it seems like it should also include entries for ignored
> files, so that we can quickly discard them. Though it might require
> tracking some sort of 'hash of the ignore rules' so that we know to
> check them again if the ignore rules change.

We could insert that inline into the file, sure.

> If you are including stuff like unknowns, it also stands to reason that
> you don't really want to add 10 blank lines for each unversioned file.
> Hence why I would recommend going with a \0 + \n formatted file.
> Then you split on '\n' to get each file hunk, and for each file you
> could split on '\0' to get the file information.

Well, it depends whats faster - I dont think reason is particularly good
at predicting how python will behave :)

Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 191 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060614/d33b0f96/attachment.pgp