Effects of customizing search-key hash

Fri Feb 27 05:24:00 GMT 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:
...

> 16-way  Inventories:    21897 KiB   2%     10834 KiB  46%    15374
> 255-way Inventories:    31012 KiB   3%     15987 KiB  56%    12090
> 
> 16/255	Inventories:    31601 KiB   3%     19090 KiB  60%    15197
> 63/255  Inventories:    18043 KiB   1%     11604 KiB  48%    17151
> 127/255 Inventories:    16763 KiB   1%     10909 KiB  46%    17712
> 127-way Inventories:    16647 KiB   1%     10769 KiB  46%    17627
> 
> So 127 seems to be the 'sweet spot' for this tree. Also all of this is
> without compression. I still want to test this with compression. Though
> I noticed recently that the 'gc' format repositories weren't getting
> better compression than plain zlib compression for the chk pages. I
> think I know why, I just need to spend some time to fix it.
> 

So I think I worked out an ordering for the chk pages. It seems to work
well, though you unfortunately need to walk the data.

Basically, start by walking inventories in newest-first (reverse
topological) order. Track what id_to_entry and
parent_id_basename_to_file_id records are found.

Walk those top level nodes in the same order. Note that the sha1 keys
don't have any obvious ordering, but we infer it from the first-time we
see the key in the sorted inventory walking.

As you walk the top level nodes, keep track of the search prefix for
each child record. Also track what child records have been requested for
transfer, and what ones have already been copied.

group the child requests by the search prefix, going in 'sorted' order.
(so search keys for 'a' are copied before search keys for 'b', etc.)

The good things about this are that we can stream the bytes across and
work out what is next as we go, rather than needing to look it all up
ahead of time. The bad is that you need the fulltexts, and that you have
to start at 'inventories' which aren't even *in* the chk_bytes
versionedfiles store.

I think the new get_stream() code will still be amenable to this,
because it is written as a 'stream of streams', which seems well suited
to handle it.

With those changes, the compression starts to get decent. For example:

Commits: 1043
                      Raw    %    Compressed    %  Objects
Revisions:       3990 KiB   0%       828 KiB   4%     1043
Inventories:    31012 KiB   3%      6554 KiB  35%    12090
Texts:         882272 KiB  96%     11301 KiB  60%     7226
Signatures:         0 KiB   0%         0 KiB   0%        0
Total:         917275 KiB 100%     18684 KiB 100%    20359

Which is down from about 15MB for Inventories if you use chk255 without
gc compression.

Another interesting bit, is that while gc+chk16 repositories have a
smaller Raw inventory (about 21MB), it ends up with a larger compressed
inventory (13.5MB).

The way I see it, is that the size of a *delta* is based on the depth of
the tree, and it doesn't matter how wide an internal node is. As a
case-in-point, consider our current inventory structure, which is
effectively 1 giant node, with no depth.

                      Raw    %    Compressed    %  Objects
Revisions:       3949 KiB   0%      1201 KiB   8%     1043
Inventories:   842115 KiB  48%      1840 KiB  12%     1043
Texts:         882272 KiB  51%     11174 KiB  78%     7226
Signatures:         0 KiB   0%         0 KiB   0%        0
Total:        1728337 KiB 100%     14216 KiB 100%     9312

This is a pack-1.9 repo, not gc or chk. But notice that the inventories
is still about 1/5th the size of the gc+chk255. I'm curious to see how
this changes over time when converting more of the total ancestry.
(Since at this depth, we probably don't have many fulltexts in the
ancestry.)

gc+chk repositories also have 'fulltexts' periodically, but each one is
a lot smaller.

I then spent a bit more time on it, tweaking some stuff based on how
'groupcompress' works. I basically changed it so that we don't try to
compress between layers, which are likely to not have overlap anyway.
And suddenly I was able to get a gc+chk repository's inventory to be
smaller than the original:

Commits: 1047
                      Raw    %    Compressed    %  Objects
Revisions:       3992 KiB  11%       828 KiB  36%     1047
Inventories:    31013 KiB  88%      1468 KiB  63%    12099
...

\o/

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkmnePAACgkQJdeBCYSNAAOWvQCfVVOF0sFDhybj1Hf+MWYG9NzB
mGUAn2I98b0Pz0Tiic7t3kHBzS7F6CgK
=5Xyt
-----END PGP SIGNATURE-----