[plugin] dirstate experiments

John Arbash Meinel john at arbash-meinel.com
Wed Jun 14 20:34:23 BST 2006


Robert Collins wrote:

...

> 
> Well, there are two cases here. The stat cache is just a validator for
> the fs, so unpacking it does not make sense. However, the sha1 is stored
> in the inventory in commit, and code expects to get at it, so I think
> having the sha be accessible is a good idea - I'd rather pay a little
> bit more in size for a format that is directly usable in the inventory
> logic, as streaming reads are fast - its seeks that are slow.
> 
>> It most certainly is not editable in that form, but it is readable.
>>
>> Also, it turns out that the major overhead at this point is the giant
>> 'text.decode()', plus the overhead of operating on unicode objects
>> rather than operating on str() objects.
> 
> I was wondering when that would start to bite :(.


So I've hacked on the format some more. Now I'm able to include an
arbitrary number of parent entries. And now the delimiter is '\n' all
the way around.
Just instead of iterating and pulling things off as I need it, I split
the text, and use slicing to pull out the chunks.
With 2 parent blobs (which are the same size as the current blob) I have
the time down to 180ms for reading everything in.

I also worked out actually building up the lists of objects, and if I
fast-path the cases of having 1 or 2 parents, then the total cost
doesn't go up too much. I'm at 280ms for 2 parents.

It turns out that 'unicode.split' is 2x slower than 'string.split', so
it takes 120ms (lsprof) versus 68ms. And then you pay another 60ms to
decode('utf8') the whole file, for a total real cost of 360ms instead of
280ms.

At this point I'm running into the time it takes to do 60k additions,
and 20k list.appends().
So I don't think there is much more room for improvement.
I'm curious if we would be better off having a separate file for basis
information, just in case we ever have a use case for reading the
working inventory without wanting the basis stuff.

This is the breakdown based on number of parent entries:
num	str	unicode	file_size
0	100ms	150ms	3.1MB
1	177ms	268ms	5.7MB
2	275ms	352ms	5.9MB
3	370ms	470ms	6.1MB

(With >1 parent, I just have the extra parents have null: records, not
perfectly reasonable, but I needed something.)

So if we restrict ourselves to not caching any parents, we can get this
to be very fast. In comparison, hg's dirstate read of the same kernel
tree takes 120ms. Though their dirstate only records 'state'
(removed,added,needs merge), size, mtime, and path. They don't save the
kind, hash, file_id, or the extra stat bits.

If I restrict my list to just include what they include, my file size
drops to 1M, and it can be read in 70ms. (revno 42 explores an hg style
dirstate). So we do pay for our extra information. 110ms versus 70ms for
having file_ids. Though still faster than hg's struct packing.

And while I'm on it, adler32 checksum is really frickin' fast. On 6MB it
takes about 4ms to complete. Way faster than sha1.

I'm pretty happy with how everything has turned out. It ends up that I'm
using a format almost exactly like what Robert mentioned, and with all
our extra information, we can still read it really fast.

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060614/50eb0324/attachment.pgp 


More information about the bazaar mailing list