[plugin] dirstate experiments

Thu Jun 15 17:50:55 BST 2006

John Arbash Meinel wrote:

...

> Here is the performance differences. The first pass for \0+\n doesn't
> use list comprehensions for the parsing of >0 parents.
> 
> 0, 1, and 2 parents all have fast-path code since they will be the
> common case. 3 parents uses a generic "iterate over the records and
> build it up", rather than just fast slicing.
> 
> parents	\n	\0+\n
> 0	106ms	120ms
> 1	185	228
> 2	275	315
> 3	370	450
> 

I've come up with an semi-evil hack to get around the problem of the
lack of list comprehensions, and calling str.split() all the time.
Basically, each record is just terminated with a '\n' field, which
itself is wrapped in '\0'.
So the real terminator is now '\0', and it just happens that every entry
has an unused field which is '\n'.

This gets me back to the original performance, while still allowing us
to see each entry on a line by itself.

parents	\n	\0+\n	\0+(\0\n\0)	\0+(\0\n\0) unicode
0	106	120	108		158
1	185	228	190		280
2	275	315	276		380
3	370	450	385		470

The file size is only slightly bigger (2 chars per record), but
considering the performance boost, it is probably worth it.
It is also nice that if you want to do more integrity checking, you
could just check that the field for every entry was '\n'.

So this is probably my preferred format. Using '\0' to separate the
fields, and having a '\n' between entries (that we just throw away).

I was wondering if we could do better with a byte-packed format, but the
basic problem is that we have variable length records. If we were
willing to say "paths must be <256 (utf8) characters, and ids <64, etc"
we could create a very sparse format that could be fixed size.

However, right now the average record length with 1 parent is around
200-300 characters. If we bloat everything to be fixed size, all of a
sudden we are at 2 paths + 1 file_id + 1 revision_id + misc overhead,
and with the above fixed lengths, that comes to 512+128+20 = 660
characters, or 12MB to store a kernel tree. This isn't horrible
horrible, since right now my format is at 5MB for the same tree. Though
for each new parent we add 340 characters.
We could chop a lot of that off if we switch to 128 character paths, and
shorter id lengths.
However, because of upgrading from Arch, we already have some file ids
which are 80 characters long. Arch directory ids contain the full
username plus datestamp, such as:
x_John_Arbash_Meinel_<john at arbash-meinel.com>_Thu_Mar_17_16:20:22_2005_27377.0

So my recommendation is to use \0 as a field delimiter and \n\0 as a
record delimiter.
It puts things nicely on lines, lets us do a single split + slicing,
while still giving integrity bounding.

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060615/fd07ddbb/attachment.pgp