Effects of customizing search-key hash
John Arbash Meinel
john at arbash-meinel.com
Fri Feb 27 05:24:00 GMT 2009
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
John Arbash Meinel wrote:
...
> 16-way Inventories: 21897 KiB 2% 10834 KiB 46% 15374
> 255-way Inventories: 31012 KiB 3% 15987 KiB 56% 12090
>
> 16/255 Inventories: 31601 KiB 3% 19090 KiB 60% 15197
> 63/255 Inventories: 18043 KiB 1% 11604 KiB 48% 17151
> 127/255 Inventories: 16763 KiB 1% 10909 KiB 46% 17712
> 127-way Inventories: 16647 KiB 1% 10769 KiB 46% 17627
>
> So 127 seems to be the 'sweet spot' for this tree. Also all of this is
> without compression. I still want to test this with compression. Though
> I noticed recently that the 'gc' format repositories weren't getting
> better compression than plain zlib compression for the chk pages. I
> think I know why, I just need to spend some time to fix it.
>
So I think I worked out an ordering for the chk pages. It seems to work
well, though you unfortunately need to walk the data.
Basically, start by walking inventories in newest-first (reverse
topological) order. Track what id_to_entry and
parent_id_basename_to_file_id records are found.
Walk those top level nodes in the same order. Note that the sha1 keys
don't have any obvious ordering, but we infer it from the first-time we
see the key in the sorted inventory walking.
As you walk the top level nodes, keep track of the search prefix for
each child record. Also track what child records have been requested for
transfer, and what ones have already been copied.
group the child requests by the search prefix, going in 'sorted' order.
(so search keys for 'a' are copied before search keys for 'b', etc.)
The good things about this are that we can stream the bytes across and
work out what is next as we go, rather than needing to look it all up
ahead of time. The bad is that you need the fulltexts, and that you have
to start at 'inventories' which aren't even *in* the chk_bytes
versionedfiles store.
I think the new get_stream() code will still be amenable to this,
because it is written as a 'stream of streams', which seems well suited
to handle it.
With those changes, the compression starts to get decent. For example:
Commits: 1043
Raw % Compressed % Objects
Revisions: 3990 KiB 0% 828 KiB 4% 1043
Inventories: 31012 KiB 3% 6554 KiB 35% 12090
Texts: 882272 KiB 96% 11301 KiB 60% 7226
Signatures: 0 KiB 0% 0 KiB 0% 0
Total: 917275 KiB 100% 18684 KiB 100% 20359
Which is down from about 15MB for Inventories if you use chk255 without
gc compression.
Another interesting bit, is that while gc+chk16 repositories have a
smaller Raw inventory (about 21MB), it ends up with a larger compressed
inventory (13.5MB).
The way I see it, is that the size of a *delta* is based on the depth of
the tree, and it doesn't matter how wide an internal node is. As a
case-in-point, consider our current inventory structure, which is
effectively 1 giant node, with no depth.
Raw % Compressed % Objects
Revisions: 3949 KiB 0% 1201 KiB 8% 1043
Inventories: 842115 KiB 48% 1840 KiB 12% 1043
Texts: 882272 KiB 51% 11174 KiB 78% 7226
Signatures: 0 KiB 0% 0 KiB 0% 0
Total: 1728337 KiB 100% 14216 KiB 100% 9312
This is a pack-1.9 repo, not gc or chk. But notice that the inventories
is still about 1/5th the size of the gc+chk255. I'm curious to see how
this changes over time when converting more of the total ancestry.
(Since at this depth, we probably don't have many fulltexts in the
ancestry.)
gc+chk repositories also have 'fulltexts' periodically, but each one is
a lot smaller.
I then spent a bit more time on it, tweaking some stuff based on how
'groupcompress' works. I basically changed it so that we don't try to
compress between layers, which are likely to not have overlap anyway.
And suddenly I was able to get a gc+chk repository's inventory to be
smaller than the original:
Commits: 1047
Raw % Compressed % Objects
Revisions: 3992 KiB 11% 828 KiB 36% 1047
Inventories: 31013 KiB 88% 1468 KiB 63% 12099
...
\o/
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkmnePAACgkQJdeBCYSNAAOWvQCfVVOF0sFDhybj1Hf+MWYG9NzB
mGUAn2I98b0Pz0Tiic7t3kHBzS7F6CgK
=5Xyt
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list