Current details for split-inventory work
John Arbash Meinel
john at arbash-meinel.com
Wed Dec 3 02:39:44 GMT 2008
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Now that I've done some updates to the CHKMap.map/unmap functions to
help ensure canonical form, I have been able to complete a conversion of
bzr.dev into --development4 format. It took around 40min. I didn't
profile it, but I would guess a most of time was spent deserializing
each Inventory from the source.
Anyway, I thought some people might be interested in the results.
$ wbzr repository-details bzr.dev/
Commits: 20916
Raw % Compressed % Objects
Revisions: 11298 KiB 0% 8170 KiB 9% 20916
Inventories: 3097450 KiB 61% 38976 KiB 45% 20916
Texts: 1937574 KiB 38% 35677 KiB 41% 44248
Signatures: 3401 KiB 0% 3205 KiB 3% 9031
Total: 5049725 KiB 100% 86029 KiB 100% 95111
$ time wbzr repository-details d4
Commits: 20916
Raw % Compressed % Objects
Revisions: 11300 KiB 0% 8170 KiB 5% 20916
Inventories: 159899 KiB 7% 89580 KiB 65% 170075
Texts: 1937574 KiB 91% 35677 KiB 26% 44248
Signatures: 3401 KiB 0% 3205 KiB 2% 9031
Total: 2112176 KiB 100% 136633 KiB 100% 244270
Extra Info: count total avg stddev min max
internal node refs 77927 1240750 15 13.2 2 35
internal p_id refs 4743 51586 10 9.1 2 23
inv depth 64955 197639 3 1.4 1 8
leaf node items 64955 376262 5 5.3 1 18
leaf p_id items 1534 17771 11 11.0 1 44
p_id depth 1534 5595 3 2.6 1 7
So the raw size of the chk inventory is very impressive, dropping from
3GB down to 160MB.
Some other bits to extract from this info:
1) On average we have 2.1 files changed for every revision. (44k texts,
21k revisions.)
2) These 2.1 changed files trigger an average of 8.1 inventory pages to
change. (1 is the inv info, the other 7.1 are chk pages.)
This is also borne out by the average inv depth of 3 and the p_id depth
of 3. So the changes trigger 3 file_id=>entry pages to be updated, and 3
p_id,basename => file_id pages to be updated.
I'm not 100% sure about those details, though, as I haven't probed
deeply. I'll note that we do get all the way to a depth of 8, which
means a single change to that leaf must create 8 new pages.
Also, I did a quick hack to try to enable knit-delta compression. Just
by using the fact that we generally use "apply-delta" which means we
have an existing key, and we modify the page generating a new key.
Anyway, with the quick hack in place the "Compressed" size of the
inventory is cut in half. (1.8M => 1.0MB.) If I hacked it a bit more to
make the actual inventory records multi-line, then I see another ~10%
reduction (948kB).
If that stayed consistent, it would drop the bzr.dev pages down from
90MB down to 45MB. Which would change it from 65% to 48.8%, which is
actually very close to the current pack-0.92 size of 45%. (It would be
45MB compressed, rather than 39MB compressed.)
I think we could still do better, but I have the feeling that if we add
in some of:
a) Hashed prefixes (giving us denser InternalNodes, and much shallower
trees)
b) A layout that splits the inventory records across multiple "lines",
to allow knit deltas to compress away the bits that don't change much
(like file_id and parent_id).
c) A delta compressor that can compress less-than-line (xdelta)
d) multi-parent compression, this could even be mixed with (c)
e) common-prefix extraction (this is a big win for the
parent_id,basename => file_id map, I'm not as sure about the
file_id=>entry maps).
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkk18XAACgkQJdeBCYSNAAObUwCdF/8Pka2TTEzze1Exd0PUKfde
ud8AoJAypYlsjGlqY4gDS2LFj3WFzQWK
=/3dD
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list