Initial results using 'hash trie'
John Arbash Meinel
john at arbash-meinel.com
Tue Jan 6 20:03:32 GMT 2009
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
John Arbash Meinel wrote:
> ...
>
>> Next I might try just bumping up the max leaf node size, and see what
>> happens. It will probably cause the raw size to go up, but I would guess
>> the compressed size wouldn't change much.
>
>> John
>> =:->
>
> This seems pretty promising, in my "1k revs" test.
>
> Commits: 1043
> Raw % Compressed % Objects
> Revisions: 3990 KiB 0% 1228 KiB 7% 1043
> Inventories: 46174 KiB 4% 3425 KiB 21% 11539
> Texts: 882272 KiB 94% 11174 KiB 70% 7226
> Signatures: 0 KiB 0% 0 KiB 0% 0
> Total: 932437 KiB 100% 15828 KiB 100% 19808
>
> Extra Info: count total avg stddev min max
> internal node refs 4253 59712 14.0 3.4 8 16
> internal p_id refs 236 2279 9.7 2.4 8 14
> inv depth 5747 17239 3.0 0.0 1 3
> leaf node items 5747 123550 21.5 4.5 4 39
> leaf p_id items 260 56773 218.4 147.3 1 516
> p_id depth 260 619 2.4 0.5 1 3
>
>
> The compressed size has gotten *better*, and as expected the 'raw' size
> went up (46k instead of 30k).
>
> Average depth for a 16-way fan out is 3, which seems good. (Previous
> 16-way fan out was closer to 4, which is also expected since 64k page
> holds 16 times more than the 4k page.)
>
> In looking at the index info, the .cix is now about 2 times the size of
> the .tix, but that is favorable when compared with the previous form's
> 5x. Number of inventory objects per commit is around 11, versus 12 with
> the 255-way fan out. So not really better there. I'm not sure how to
> explain that versus the improvement in the .cix. (It may just be too
> early for .cix to be properly measured.)
>
So I now have an updated measurement with 20k revisions of MySQL. It is
better than some of the other permutations, though I'm not sure if it is
"good enough" or not.
The code is available from:
http://bzr.arbash-meinel.com/branches/bzr/brisbane/hack
To go back and give the background (since it has been a while), these
are some of the parameters for this test.
1) I'm using the hex digits of the crc32 of the portion of the key.
So for file-ids, this boils down to '%08X' % (zlib.crc32(key[0]),)
This means that each byte of the hashed key can have 16 possible
values.
a) The other method I've experimented with is just to use the bytes
of the crc32 key. With the one caveat that '\n' is remapped to
another character (in my case '_'). This gives 255 possible values
for each byte.
2) I'm using a maximum LeafNode size of 64kB. The idea is that after a
split, we would end up with 64kB/16 = 4kB nodes. Rather than having
4kB/16 = 256B nodes (which really only holds a single entry).
3) When converting, I read the revision trees 100 at a time. I then keep
track of all 100 in a cache (plus 1 more which is the last revision
converted in the previous batch).
For every revision, I then check if the parents are available in the
cache, if none are, then I just use the last revision I converted
(this is the same as the original code that Robert wrote). However,
if one-or-more parents are in the cached set, I then compute the
delta versus all available parents, and use whatever delta is
'smallest'. From my testing about 70% of the time, this is the
left-hand parent, and 30% of the time it is the right-hand. My guess
is this is because of feature branches landing on 'trunk', and then
30% of the time trunk is then merged into a feature branch before it
is submitted. (I'm guessing here, but it seems to fit.)
This has a very high cache hit rate early on in the project, but
seems to be lacking further out. I'm considering doing something like
a rolling 200 revisions cached. So you have the "previous 100" and
the "current 100".
In general, using the parent with the shortest delta meant the
conversion was both faster and smaller.
Just as some points of comparison, using the 'last converted
revision' is the best choice around 50% of the time. It is closer to
70% in the beginning, but then gets less and less true as time goes
on. Of the remainder, approx 15% is a single parent that is already
in 'current' 100 revisions cache, 30% have more than one parent (and
70% of these have the smallest delta against the left-hand parent),
and around 5% have no direct parent in the cache, and use whatever
the last converted revision was.
I did test this on the first 1k revisions. And it seems that rotating
2 100-entry caches, I only had 1 revision that didn't have a
parent directly available. Even further, I get the same results if I
rotate 2 50-entry caches. Now, this is early on in the history, where
things are more simple, but it does at least hint that rotating 2
caches gives a better result. The 2x100 was even measurably faster at
the conversion (2m2s => 1m53s). Also of interest is that it caught
more "multiparent" nodes, where one of the extra parents would not
have been available otherwise.
4) I'm using a different format for the inventory text. Instead of an
entry being recorded as:
'file_id\0file: file_id\0parent_id\0name\0...'
I changed it to:
'file_id\0file: file_id\n'
'parent_id\n'
'name\n'
'revision\n'
'sha1\n'
'size\n'
'N\n' # executable Y/N
This helps, because "revision, sha1, size" are all clustered
together, so in general the deltas only include those 3 lines. It is
a bit of a shame that executable is at the end, rather than after
'name', but it doesn't effect the deltas, and it meant the parsing
was more consistent between how directories and files could be
parsed.
Anyway, given that a file_id is approx 60 chars, a sha1 is 40 chars,
and a revision is approx 50 chars, this means that the delta is
around 100 bytes, instead of replacing the whole record, which is
around 300 bytes (3 file_ids, name, revision, sha1 is 280 bytes).
That is before gzip compression.
5) I do a 'has_key()' check on the sha1 before adding the lines to the
versioned file. I believe Robert wanted to avoid that, but it had 2
major benefits:
a) Dramatically reduced the 'redundant' data stored on disk, that was
then removed during autopack. I mentioned this in other threads,
but it was as high as 80% of the data written to disk was then
being thrown away at autopack time. While we seem to be primarily
CPU bound, it certainly seems like it would help to not write data
to disk that we would not use.
Further, the code in 'add_lines()' was also effectively doing a
'has_key()' check, because it wasn't adding it to the index if it
already existed. So all I did was move that check up a bit, and
then use "random_id=True" to indicate that I already know the key
isn't present. The only downside is that it does mean we compute
the sha1 of the text 2 times for texts which are not already
present.
b) When giving a 'parent' list for these pages, it is very common to
have a different parent, because the basis+delta went through a
different ancestry, even though the final result is the same.
(This was especially true when we would use a purely arbitrary
basis). To work around this, I either need to check ahead of time,
or change the code to catch the error.
16-way hash, 64k leaf pages, knit-delta compression, shortest delta parent
Commits: 20043
Raw % Compressed % Objects
Revisions: 41224 KiB 0% 16009 KiB 9% 20043
Inventories: 1369654 KiB 15% 62600 KiB 36% 213062
Texts: 7304002 KiB 83% 91186 KiB 53% 81202
Signatures: 0 KiB 0% 0 KiB 0% 0
Total: 8714880 KiB 100% 169796 KiB 100% 314307
Extra Info: count total avg stddev min max
internal node refs 84280 1188144 14.1 3.4 8 16
internal p_id refs 3825 45214 11.8 3.5 4 16
inv depth 101496 304486 3.0 0.0 1 3
leaf node items 101496 4244225 41.8 11.2 4 82
leaf p_id items 3418 619223 181.2 148.1 1 537
p_id depth 3418 9696 2.8 0.4 1 4
Some comments:
A) On average we have 10.6 objects being written for each commit. 1 is
the inventory record, leaving 9.6 pages between the p_id map and the
file_id=>entry map.
B) On average, we have 5.06 filed_id=>entry Leaf Nodes written per
commit, and 4.2 file_id=>entry InternalNodes written. Other tests
showed around 5 leaf nodes, so I have the strong feeling that the
average number of changes per commit is 5 files. Which hints to me
that we pretty much never have the same leaf node contain multiple
changes. (One of the properties of hashing)
However, the inventory depth is interesting. It is a solid '3', which
means we have a root node, a single internal node, and then the leaf
nodes. Since we average only 4.2 internal nodes changed, that means
we *do* have a small amount of collisions at the internal layer. With
no collisions, we would expect to have 6 internal nodes change per
commit. (the root always changes, but would collide, and the next
layer would have 5 internals mapping to the 5 leaf nodes.)
My best guess is that the 'average' is a skewed distribution. If you
have 50 commits that change only 1 leaf node, and then 2 commits that
change 100, you end up with an average of 4.8 leaf nodes changed per
commit. However, the initial 50 commits would average 2 internal
nodes changed, but the 2 with 100 leaves would probably have 17
internal nodes changed. And (2*50 + 17*2)/52 = 2.6 average internal
nodes changed.
While if you had 50 commits that changed 5 leaf nodes each, you would
obviously average 5 leaf nodes, but then you should also average 6
internal nodes.
Given that the average internal is 4.8, I would guess it is only
slightly skewed.
C) Leaves are actually pretty full, with an average of 41 items each.
D) For the parent_id,basename=>file_id map, it hardly ever changes. We
have a total of 3825 unique internal nodes and 3418 leaf nodes. That
is an average of only 0.36 total nodes per commit, or only 0.17 leaf
pages that change per commit.
Interestingly, it has an average depth of 2.8, but does hit a depth
of 4. My guess is that there are some directories that are pretty
full, so you have a lot of "basename" entries versus the "parent_id"
prefix. At a 64kB page, each item should be around 130 bytes (60 byte
parent_id, 10 byte basename, 60 byte file_id), however the parent_id
should be common to all records, so it should be pulled out by the
common prefix code. So each record should only be 70-ish bytes.
And it takes 64kB/70 = 914 entries in the same directory to cause it
to split.
It does happen that the average file-id is actually 67.7 bytes, and
there is a single directory with 625 entries, and another with 576
entries. It doesn't seem like it would have to split, but it seems
that 600+ entries is enough for some reason or another. It also would
be possible for us to have a prefix collision on the other 2 bytes,
which is causing us to have even more nodes on that page than just
what fits in the directory.
E) How does the hash radix trie compare with a more conventional "by
directory" layout? I don't have numbers (yet) for what things change
over time, but if you just look at a snapshot...
i) You have 370 directories, for an average of 21.8 entries per
directory, with a standard deviation of 62.3 (quite large).
ii) You have 49 directories with a single record, and a maximum of
625 entries in a single directory.
iii) You have an average depth of 3 directories. This is probably
heavily influenced by the 3 largest directories:
(555, 'mysql-test/r')
(576, 'mysql-test/t')
(625, 'storage/falcon')
The next largest is only 304.
I'm attaching a plot of the number of entries per directory. You
can see that there is a very long tail of directories that have
<10 entries, and then a small handful at the end that has a lot
of entries per dir.
iv) I computed a "neighbors" metric. The idea is that if we have a
directory 2 levels deep, with 10 items at each level, then
changing one of the items in the bottom directory writes out a
page with 10 items, and then writes out the parent directory,
which has 10 items, and its parent which has 10 items. So
changing a single record causes 30 records to be re-written.
And then all 10 items at that level cause 30-rewrites, then the
10 up above cause 20 rewrites, and the 10 at the top trigger 10.
So in the simple tree of 9 files, 1 subdir, going 3 deep, you
have 30*10 + 20*10 + 10*10 = 600 "neighbors".
This would be the raw number of entries written if every item
changed 1 time, so it is an approximation of the raw inventory
size. What it doesn't take into account is any sort of
'distribution' of changes (multiple records in the same dir
changing at the same time, or 1 item in a deep directory changing
more often than something in a different dir.)
Anyway, the MySQL tree has 8078 entries, and a total of 2,240,540
"neighbors". So 'on average' changing a single item will cause
277.4 inventory items to be written.
Compare that to a 16-way fan out with 64kB pages. Going by the
above we average 41 items per leaf, an middle layer that points
to 14 leaf nodes, and then a root node that points to 16 nodes.
So changing 1 item changes 41+14+16 = 71 records, for a total
"neighbors" of 573,538.
If we pushed it further, you would have a root with 16-way fan
out, a middle 1 with 16-way fan out, a middle 2 with 16-way fan
out, and then a leaf with an average of 2 items (just going off
of 8078 items). That would be 16+16+16+2 = 50 records updated per
change, total of 403,900 neighbors.
Looking at the actual figures, we have 20k revs, and a total of
5,432,369 raw items written between the leaves and internal
nodes. So we have an average of 271 records written per commit.
If we assume 5 files changing per commit, that gives 54.2
items written per file change. Which is at least in the ballpark
of our '71' records.
So ignoring collisions and some other stuff, we would expect the
raw-inventory size for a by-directory layout to be about
277/71 = 3.9, or approx 4 times larger than the 16-way hash trie
with 64k leaf pages.
As another point of comparison, here is a different conversion, using
the 255-way fan out and 4k leaf nodes, but still doing the minimum
delta, etc. I have more revisions here, so the absolute numbers aren't
comparable, but the percentages should be:
$ wbzr repository-details d4-255-min-delta-mysql/
Commits: 41400
Raw % Compressed % Objects
Revisions: 80734 KiB 0% 32332 KiB 9% 41400
Inventories: 909026 KiB 6% 141926 KiB 40% 505996
Texts: 12137708 KiB 92% 179455 KiB 50% 156286
Signatures: 0 KiB 0% 0 KiB 0% 0
Total: 13127469 KiB 100% 353714 KiB 100% 703682
Extra Info: count total avg stddev min max
internal node refs 242991 15694658 64.6 86.5 11 255
internal p_id refs 9747 1141924 117.2 97.5 2 242
inv depth 200832 591338 2.9 0.2 1 3
leaf node items 200832 332030 1.7 2.4 1 14
leaf p_id items 11026 70411 6.4 10.2 1 58
p_id depth 11026 35817 3.2 0.8 1 4
1) Notice that the raw inventory size is much smaller (I would guess
primarily from 4k leaves, which only hold an average of 1.7 items per
leaf) However, the compressed size is not significantly better. With
a total of 40% versus 36%. However, this may just be because of how
things go with time. The 64k pages was all the way down to 21% with
only 1k revisions, so it certainly could bloat up to 40% going from
20k revs to 40k revs.
2) The average depth is pretty much the same. And there are far more
internal entries written. The average number of leaf nodes written is
pretty much the same (5.06 versus 4.85).
If we assume 1 full root node, and then semi-full middle nodes, and
then leaf nodes, then inverting the average, we should have:
(1*255 + 4.8*N) / (5.8) = 64.6
or each middle node should fan out to 25 leaf nodes.
Looking at the 'neighbors' calculation, there should be 255 + 25 + 2
= 282 items written for a single changed entry, or 255 + 5*(25+2) =
390 items written for an average of 5 changed items.
If you look, there are 15.7M internal refs, and 332k leaf
items. So 16M/41.4k = 387 items written per commit. Which is very
close.
The 16-way should have 16 + 5*(16+41) = 301, and in actuality it has
271 per commit.
The other factor, of course, is that an item in an internal node is
about 50 bytes, while an item in the leaf node is about 300 bytes.
Now if we change to 16-way fan out and 4k leaf pages, we should end
up with a depth of 4. So you would have a full root, a full middle 1
layer, a full middle 2 layer, and leaf pages with 2 items each.
That would give 16 + 5*(16 + 16 + 2) = 186 items per commit.
Anyway, I'll try a 16-way 4k page again, just to get some good data on it.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkljuRQACgkQJdeBCYSNAAOXvACgwQ1NBg42JVWUVWS4xIpX6Glb
3LIAn3oKHBzhiquzjtkEXPQSsfibOrss
=q4lO
-----END PGP SIGNATURE-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: num_entries_per_dir.png
Type: image/png
Size: 3870 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20090106/e0ce4a57/attachment.png
More information about the bazaar
mailing list