Split inventory prefix work
John Arbash Meinel
john at arbash-meinel.com
Mon Dec 22 21:11:40 GMT 2008
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
I have a branch of the brisbane work, which is able to extract the
common prefix for a page into a separate record. For example, imagine
storing the keys: "aaa", "aab", "aac". The old code would have:
aaa\0value\n
aab\0value\n
aac\0value\n
The prefix work would change that to:
aa\n
a\0value\n
b\0value\n
c\0value\n
In general, this makes the serialized form smaller, which means you can
fit more entries in a 4k page.
In actually implementing it, I found an slightly unexpected result. For
the first 1k revs in mysql, doing this for the Leaf nodes actually
causes the total inventory size to *increase*. Both for 'raw' and for
'compressed' size. Doing it for Internal nodes decreases the raw and a
little bit the compressed size.
My guess is that because we are fitting more entries in a Leaf node,
when one line changes, it has to write the neighbor entries out.
(Consider if you had a tree where all the leaves had a single entry,
then changing an entry would only write a single new line.) I would
expect this result to be a bit different when we introduce delta
compression, as it should remove some of the extra redundancy.
Here are some results:
Raw % Compressed % Objects
No prefix 17905 KiB 1% 9503 KiB 43% 20970
prefix internal 17102 KiB 1% 9392 KiB 43% 20957
prefix both 17258 KiB 1% 9444 KiB 43% 19678
So you can see that when including leaf nodes the number of objects
decreases, but the compressed and raw sizes actually increase.
Comparing the raw details:
Extra Info: count total avg stddev min max
internal node refs 11864 114461 9 8.1 2 27
internal p_id refs 1256 6840 5 5.7 2 24
inv depth 6005 35566 5 3.5 1 9
leaf node items 6005 34646 5 5.2 1 16
leaf p_id items 802 6566 8 9.1 1 33
p_id depth 802 7050 8 4.2 1 14
vs
Extra Info: count total avg stddev min max
internal node refs 10952 111724 10 7.6 2 27
internal p_id refs 1129 6087 5 5.7 2 24
inv depth 5991 34565 5 3.2 1 9
leaf node items 5991 37466 6 5.4 1 16
leaf p_id items 563 7951 14 13.0 1 52
p_id depth 563 4816 8 3.5 1 11
So the big win is in the parent_id,basename => file_id map. Where we go
from 800 leaf nodes to 563 leaf nodes. And a max depth of 14 to a max
depth of 11. (Note the average depth of 8 doesn't change, but we go from
33 nodes on a page up to 52, and avg 8=>14).
The actual "file_id=>entry" nodes don't change very much. 6005 leaf
nodes down to 5991 leaf nodes.
This is about what I expected, because I knew that the p_id map has a
lot more redundancy (the parent_id portion of the key is duplicated
across all of the files in that directory.)
Anyway, I'm not proposing it for merge yet. I think it is overall
beneficial, but it isn't a huge effect.
John
=:->
PS> The branch is at:
http://bzr.arbash-meinel.com/branches/bzr/brisbane/prefix
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAklQAowACgkQJdeBCYSNAAM5ggCdH/h5F5bn+W2agfpFh6cy8noZ
E0QAoL6h6skJJwNWwxTM0pV7HlVfSjt+
=F2Yh
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list