Current details for split-inventory work

John Arbash Meinel john at arbash-meinel.com
Wed Dec 3 02:39:44 GMT 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Now that I've done some updates to the CHKMap.map/unmap functions to
help ensure canonical form, I have been able to complete a conversion of
bzr.dev into --development4 format. It took around 40min. I didn't
profile it, but I would guess a most of time was spent deserializing
each Inventory from the source.

Anyway, I thought some people might be interested in the results.

$ wbzr repository-details bzr.dev/
Commits: 20916
                      Raw    %    Compressed    %  Objects
Revisions:      11298 KiB   0%      8170 KiB   9%    20916
Inventories:  3097450 KiB  61%     38976 KiB  45%    20916
Texts:        1937574 KiB  38%     35677 KiB  41%    44248
Signatures:      3401 KiB   0%      3205 KiB   3%     9031
Total:        5049725 KiB 100%     86029 KiB 100%    95111


$ time wbzr repository-details d4
Commits: 20916
                      Raw    %    Compressed    %  Objects
Revisions:      11300 KiB   0%      8170 KiB   5%    20916
Inventories:   159899 KiB   7%     89580 KiB  65%   170075
Texts:        1937574 KiB  91%     35677 KiB  26%    44248
Signatures:      3401 KiB   0%      3205 KiB   2%     9031
Total:        2112176 KiB 100%    136633 KiB 100%   244270

Extra Info:           count    total  avg stddev  min  max
internal node refs    77927  1240750   15   13.2    2   35
internal p_id refs     4743    51586   10    9.1    2   23
inv depth             64955   197639    3    1.4    1    8
leaf node items       64955   376262    5    5.3    1   18
leaf p_id items        1534    17771   11   11.0    1   44
p_id depth             1534     5595    3    2.6    1    7


So the raw size of the chk inventory is very impressive, dropping from
3GB down to 160MB.

Some other bits to extract from this info:

1) On average we have 2.1 files changed for every revision. (44k texts,
21k revisions.)

2) These 2.1 changed files trigger an average of 8.1 inventory pages to
change. (1 is the inv info, the other 7.1 are chk pages.)
This is also borne out by the average inv depth of 3 and the p_id depth
of 3. So the changes trigger 3 file_id=>entry pages to be updated, and 3
p_id,basename => file_id pages to be updated.

I'm not 100% sure about those details, though, as I haven't probed
deeply. I'll note that we do get all the way to a depth of 8, which
means a single change to that leaf must create 8 new pages.


Also, I did a quick hack to try to enable knit-delta compression. Just
by using the fact that we generally use "apply-delta" which means we
have an existing key, and we modify the page generating a new key.
Anyway, with the quick hack in place the "Compressed" size of the
inventory is cut in half. (1.8M => 1.0MB.) If I hacked it a bit more to
make the actual inventory records multi-line, then I see another ~10%
reduction (948kB).

If that stayed consistent, it would drop the bzr.dev pages down from
90MB down to 45MB. Which would change it from 65% to 48.8%, which is
actually very close to the current pack-0.92 size of 45%. (It would be
45MB compressed, rather than 39MB compressed.)

I think we could still do better, but I have the feeling that if we add
in some of:

a) Hashed prefixes (giving us denser InternalNodes, and much shallower
trees)
b) A layout that splits the inventory records across multiple "lines",
to allow knit deltas to compress away the bits that don't change much
(like file_id and parent_id).
c) A delta compressor that can compress less-than-line (xdelta)
d) multi-parent compression, this could even be mixed with (c)
e) common-prefix extraction (this is a big win for the
parent_id,basename => file_id map, I'm not as sure about the
file_id=>entry maps).

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkk18XAACgkQJdeBCYSNAAObUwCdF/8Pka2TTEzze1Exd0PUKfde
ud8AoJAypYlsjGlqY4gDS2LFj3WFzQWK
=/3dD
-----END PGP SIGNATURE-----



More information about the bazaar mailing list