[plugin] dirstate experiments

Wed Jun 14 14:49:55 BST 2006

Robert Collins wrote:
> On Wed, 2006-06-14 at 01:38 -0500, John Arbash Meinel wrote:
>> After reading Robert's discussion of possible dirstate requirements, I
>> went ahead and implemented it in a few possible formats.
> 
> Nice  -  I'm *really* glad I did a writeup :). I shall be
> peeking/borrowing from your code I suspect - though I'm planning to
> start from the new tree format and work in from there - I may end up
> with a different feel to things.
> 
> Whats flat vs blob ?
> 
> Rob

I played around with the flat format some more. It turns out that using
'base64.encodestring' instead of 'sha.hexdigest()' saves quite a bit of
disk space.

I also tried packing the longs with struct.pack(), and then base64
encoding it. (One trick, base64.encodestring() always appends a newline).

It turns out that unpacking a set of longs is actually quite expensive.
If when reading in I read them in as strings, and don't try to convert
them back into longs, it saves about 100ms (220ms/350ms).

Now, one can argue that eventually they need to be turned back into
longs so that we can use them, but we may not need to do that for every
line in the file.

However, there is one other case. I can get the read time down to 185ms
if I just don't unpack the stat. And if I think about it, you really
don't need to.
The only thing we use the stat for is to make sure the file hasn't
changed. Which we can easily do by just packing the new stat, and
comparing to the old packed value.
It turns out that packing them barely even shows up in the timing, while
unpacking them takes a long time. Same thing for the sha hash.

So the current best format is my 'flat' format, which packs up the stat,
and base64 encodes the sha hash, uses '\0' to terminate the fields, and
then doesn't really have to unpack them on the other end.

With this format, we are actually almost as fast as hg, if not actually
faster. And we are using unicode strings, and using a format that while
not pretty in a text editor, is actually somewhat readable. (At least in
vim which handles null characters just fine).

It most certainly is not editable in that form, but it is readable.

Also, it turns out that the major overhead at this point is the giant
'text.decode()', plus the overhead of operating on unicode objects
rather than operating on str() objects.

If I take out the .decode('utf8') the time jumps from 180ms down to 80ms.
We may want to seriously consider not decoding at the time of reading
the file, and only decoding what we need to later. I found a major win
if we decode the whole file at once (versus decoding just the file_id
and relpath). But we don't really have to do that.

Especially under Linux, we can just access the paths as bytestreams,
rather than decoding them. Also, I'm not sure if we could just declare
that our file_ids are utf-8 strings, rather than Unicode objects.

Obviously on Windows our paths have to be unicode because the underlying
apis change. But under Linux, if we use bytestreams, we can see massive
dirstate improvements (faster than hg at reading the dirstate).

On a bzr.dev size tree, the time to build is 66ms, and the time to read
is 3ms.

When you are down in these sizes, I do believe the hg guys that the seek
time could seriously hurt you if you wanted to split up the manifest. We
may find it to be a net improvement, since modification of the manifest
is so much faster.

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060614/5f694144/attachment.pgp