[plugin] dirstate experiments

Wed Jun 14 15:08:17 BST 2006

On Wed, 2006-06-14 at 08:49 -0500, John Arbash Meinel wrote:
> Robert Collins wrote:
> > On Wed, 2006-06-14 at 01:38 -0500, John Arbash Meinel wrote:
> >> After reading Robert's discussion of possible dirstate requirements, I
> >> went ahead and implemented it in a few possible formats.
> > 
> > Nice  -  I'm *really* glad I did a writeup :). I shall be
> > peeking/borrowing from your code I suspect - though I'm planning to
> > start from the new tree format and work in from there - I may end up
> > with a different feel to things.
> > 
> > Whats flat vs blob ?
> > 
> > Rob
> 
> I played around with the flat format some more. It turns out that using
> 'base64.encodestring' instead of 'sha.hexdigest()' saves quite a bit of
> disk space.
>
> I also tried packing the longs with struct.pack(), and then base64
> encoding it. (One trick, base64.encodestring() always appends a newline).
> 
> It turns out that unpacking a set of longs is actually quite expensive.
> If when reading in I read them in as strings, and don't try to convert
> them back into longs, it saves about 100ms (220ms/350ms).
> 
> Now, one can argue that eventually they need to be turned back into
> longs so that we can use them, but we may not need to do that for every
> line in the file.

I agree.

> However, there is one other case. I can get the read time down to 185ms
> if I just don't unpack the stat. And if I think about it, you really
> don't need to.
> The only thing we use the stat for is to make sure the file hasn't
> changed. Which we can easily do by just packing the new stat, and
> comparing to the old packed value.
> It turns out that packing them barely even shows up in the timing, while
> unpacking them takes a long time. Same thing for the sha hash.
> 
> So the current best format is my 'flat' format, which packs up the stat,
> and base64 encodes the sha hash, uses '\0' to terminate the fields, and
> then doesn't really have to unpack them on the other end.
> 
> With this format, we are actually almost as fast as hg, if not actually
> faster. And we are using unicode strings, and using a format that while
> not pretty in a text editor, is actually somewhat readable. (At least in
> vim which handles null characters just fine).

Well, there are two cases here. The stat cache is just a validator for
the fs, so unpacking it does not make sense. However, the sha1 is stored
in the inventory in commit, and code expects to get at it, so I think
having the sha be accessible is a good idea - I'd rather pay a little
bit more in size for a format that is directly usable in the inventory
logic, as streaming reads are fast - its seeks that are slow.

> It most certainly is not editable in that form, but it is readable.
> 
> Also, it turns out that the major overhead at this point is the giant
> 'text.decode()', plus the overhead of operating on unicode objects
> rather than operating on str() objects.

I was wondering when that would start to bite :(.

> If I take out the .decode('utf8') the time jumps from 180ms down to 80ms.
> We may want to seriously consider not decoding at the time of reading
> the file, and only decoding what we need to later. I found a major win
> if we decode the whole file at once (versus decoding just the file_id
> and relpath). But we don't really have to do that.
> 
> Especially under Linux, we can just access the paths as bytestreams,
> rather than decoding them. Also, I'm not sure if we could just declare
> that our file_ids are utf-8 strings, rather than Unicode objects.

This is worth thinking about in parallel. Another alternative in this
format would be UCS2, or whatever the native unicode string inside
common python builds is. Serialisation to/from that should be extremely
fast. (If only by using cStringIO.StringIO(string here).getvalue() :))
(is that unportable enough ? :))

> Obviously on Windows our paths have to be unicode because the underlying
> apis change. But under Linux, if we use bytestreams, we can see massive
> dirstate improvements (faster than hg at reading the dirstate).
> 
> On a bzr.dev size tree, the time to build is 66ms, and the time to read
> is 3ms.
> 
> When you are down in these sizes, I do believe the hg guys that the seek
> time could seriously hurt you if you wanted to split up the manifest. We
> may find it to be a net improvement, since modification of the manifest
> is so much faster.

Indeed.

Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 191 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060615/758df754/attachment.pgp