Initial results using 'hash trie'

Tue Jan 6 20:03:32 GMT 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:
> ...
> 
>> Next I might try just bumping up the max leaf node size, and see what
>> happens. It will probably cause the raw size to go up, but I would guess
>> the compressed size wouldn't change much.
> 
>> John
>> =:->
> 
> This seems pretty promising, in my "1k revs" test.
> 
> Commits: 1043
>                       Raw    %    Compressed    %  Objects
> Revisions:       3990 KiB   0%      1228 KiB   7%     1043
> Inventories:    46174 KiB   4%      3425 KiB  21%    11539
> Texts:         882272 KiB  94%     11174 KiB  70%     7226
> Signatures:         0 KiB   0%         0 KiB   0%        0
> Total:         932437 KiB 100%     15828 KiB 100%    19808
> 
> Extra Info:           count    total    avg stddev  min  max
> internal node refs     4253    59712   14.0    3.4    8   16
> internal p_id refs      236     2279    9.7    2.4    8   14
> inv depth              5747    17239    3.0    0.0    1    3
> leaf node items        5747   123550   21.5    4.5    4   39
> leaf p_id items         260    56773  218.4  147.3    1  516
> p_id depth              260      619    2.4    0.5    1    3
> 
> 
> The compressed size has gotten *better*, and as expected the 'raw' size
> went up (46k instead of 30k).
> 
> Average depth for a 16-way fan out is 3, which seems good. (Previous
> 16-way fan out was closer to 4, which is also expected since 64k page
> holds 16 times more than the 4k page.)
> 
> In looking at the index info, the .cix is now about 2 times the size of
> the .tix, but that is favorable when compared with the previous form's
> 5x. Number of inventory objects per commit is around 11, versus 12 with
> the 255-way fan out. So not really better there. I'm not sure how to
> explain that versus the improvement in the .cix. (It may just be too
> early for .cix to be properly measured.)
> 

So I now have an updated measurement with 20k revisions of MySQL. It is
better than some of the other permutations, though I'm not sure if it is
"good enough" or not.

The code is available from:
  http://bzr.arbash-meinel.com/branches/bzr/brisbane/hack

To go back and give the background (since it has been a while), these
are some of the parameters for this test.

1) I'm using the hex digits of the crc32 of the portion of the key.
   So for file-ids, this boils down to '%08X' % (zlib.crc32(key[0]),)
   This means that each byte of the hashed key can have 16 possible
   values.

   a) The other method I've experimented with is just to use the bytes
      of the crc32 key. With the one caveat that '\n' is remapped to
      another character (in my case '_'). This gives 255 possible values
      for each byte.

2) I'm using a maximum LeafNode size of 64kB. The idea is that after a
   split, we would end up with 64kB/16 = 4kB nodes. Rather than having
   4kB/16 = 256B nodes (which really only holds a single entry).

3) When converting, I read the revision trees 100 at a time. I then keep
   track of all 100 in a cache (plus 1 more which is the last revision
   converted in the previous batch).

   For every revision, I then check if the parents are available in the
   cache, if none are, then I just use the last revision I converted
   (this is the same as the original code that Robert wrote). However,
   if one-or-more parents are in the cached set, I then compute the
   delta versus all available parents, and use whatever delta is
   'smallest'. From my testing about 70% of the time, this is the
   left-hand parent, and 30% of the time it is the right-hand. My guess
   is this is because of feature branches landing on 'trunk', and then
   30% of the time trunk is then merged into a feature branch before it
   is submitted. (I'm guessing here, but it seems to fit.)

   This has a very high cache hit rate early on in the project, but
   seems to be lacking further out. I'm considering doing something like
   a rolling 200 revisions cached. So you have the "previous 100" and
   the "current 100".

   In general, using the parent with the shortest delta meant the
   conversion was both faster and smaller.

   Just as some points of comparison, using the 'last converted
   revision' is the best choice around 50% of the time. It is closer to
   70% in the beginning, but then gets less and less true as time goes
   on. Of the remainder, approx 15% is a single parent that is already
   in 'current' 100 revisions cache, 30% have more than one parent (and
   70% of these have the smallest delta against the left-hand parent),
   and around 5% have no direct parent in the cache, and use whatever
   the last converted revision was.

   I did test this on the first 1k revisions. And it seems that rotating
   2 100-entry caches, I only had 1 revision that didn't have a
   parent directly available. Even further, I get the same results if I
   rotate 2 50-entry caches. Now, this is early on in the history, where
   things are more simple, but it does at least hint that rotating 2
   caches gives a better result. The 2x100 was even measurably faster at
   the conversion (2m2s => 1m53s). Also of interest is that it caught
   more "multiparent" nodes, where one of the extra parents would not
   have been available otherwise.

4) I'm using a different format for the inventory text. Instead of an
   entry being recorded as:
   'file_id\0file: file_id\0parent_id\0name\0...'

   I changed it to:
   'file_id\0file: file_id\n'
   'parent_id\n'
   'name\n'
   'revision\n'
   'sha1\n'
   'size\n'
   'N\n' # executable Y/N

   This helps, because "revision, sha1, size" are all clustered
   together, so in general the deltas only include those 3 lines. It is
   a bit of a shame that executable is at the end, rather than after
   'name', but it doesn't effect the deltas, and it meant the parsing
   was more consistent between how directories and files could be
   parsed.

   Anyway, given that a file_id is approx 60 chars, a sha1 is 40 chars,
   and a revision is approx 50 chars, this means that the delta is
   around 100 bytes, instead of replacing the whole record, which is
   around 300 bytes (3 file_ids, name, revision, sha1 is 280 bytes).
   That is before gzip compression.

5) I do a 'has_key()' check on the sha1 before adding the lines to the
   versioned file. I believe Robert wanted to avoid that, but it had 2
   major benefits:

   a) Dramatically reduced the 'redundant' data stored on disk, that was
      then removed during autopack. I mentioned this in other threads,
      but it was as high as 80% of the data written to disk was then
      being thrown away at autopack time. While we seem to be primarily
      CPU bound, it certainly seems like it would help to not write data
      to disk that we would not use.
      Further, the code in 'add_lines()' was also effectively doing a
      'has_key()' check, because it wasn't adding it to the index if it
      already existed. So all I did was move that check up a bit, and
      then use "random_id=True" to indicate that I already know the key
      isn't present. The only downside is that it does mean we compute
      the sha1 of the text 2 times for texts which are not already
      present.

   b) When giving a 'parent' list for these pages, it is very common to
      have a different parent, because the basis+delta went through a
      different ancestry, even though the final result is the same.
      (This was especially true when we would use a purely arbitrary
      basis). To work around this, I either need to check ahead of time,
      or change the code to catch the error.

16-way hash, 64k leaf pages, knit-delta compression, shortest delta parent
Commits: 20043
                      Raw    %    Compressed    %  Objects
Revisions:      41224 KiB   0%     16009 KiB   9%    20043
Inventories:  1369654 KiB  15%     62600 KiB  36%   213062
Texts:        7304002 KiB  83%     91186 KiB  53%    81202
Signatures:         0 KiB   0%         0 KiB   0%        0
Total:        8714880 KiB 100%    169796 KiB 100%   314307

Extra Info:           count    total    avg stddev  min  max
internal node refs    84280  1188144   14.1    3.4    8   16
internal p_id refs     3825    45214   11.8    3.5    4   16
inv depth            101496   304486    3.0    0.0    1    3
leaf node items      101496  4244225   41.8   11.2    4   82
leaf p_id items        3418   619223  181.2  148.1    1  537
p_id depth             3418     9696    2.8    0.4    1    4

Some comments:

A) On average we have 10.6 objects being written for each commit. 1 is
   the inventory record, leaving 9.6 pages between the p_id map and the
   file_id=>entry map.

B) On average, we have 5.06 filed_id=>entry Leaf Nodes written per
   commit, and 4.2 file_id=>entry InternalNodes written. Other tests
   showed around 5 leaf nodes, so I have the strong feeling that the
   average number of changes per commit is 5 files. Which hints to me
   that we pretty much never have the same leaf node contain multiple
   changes. (One of the properties of hashing)

   However, the inventory depth is interesting. It is a solid '3', which
   means we have a root node, a single internal node, and then the leaf
   nodes. Since we average only 4.2 internal nodes changed, that means
   we *do* have a small amount of collisions at the internal layer. With
   no collisions, we would expect to have 6 internal nodes change per
   commit. (the root always changes, but would collide, and the next
   layer would have 5 internals mapping to the 5 leaf nodes.)

   My best guess is that the 'average' is a skewed distribution. If you
   have 50 commits that change only 1 leaf node, and then 2 commits that
   change 100, you end up with an average of 4.8 leaf nodes changed per
   commit. However, the initial 50 commits would average 2 internal
   nodes changed, but the 2 with 100 leaves would probably have 17
   internal nodes changed. And (2*50 + 17*2)/52 = 2.6 average internal
   nodes changed.

   While if you had 50 commits that changed 5 leaf nodes each, you would
   obviously average 5 leaf nodes, but then you should also average 6
   internal nodes.

   Given that the average internal is 4.8, I would guess it is only
   slightly skewed.

C) Leaves are actually pretty full, with an average of 41 items each.

D) For the parent_id,basename=>file_id map, it hardly ever changes. We
   have a total of 3825 unique internal nodes and 3418 leaf nodes. That
   is an average of only 0.36 total nodes per commit, or only 0.17 leaf
   pages that change per commit.
   Interestingly, it has an average depth of 2.8, but does hit a depth
   of 4. My guess is that there are some directories that are pretty
   full, so you have a lot of "basename" entries versus the "parent_id"
   prefix. At a 64kB page, each item should be around 130 bytes (60 byte
   parent_id, 10 byte basename, 60 byte file_id), however the parent_id
   should be common to all records, so it should be pulled out by the
   common prefix code. So each record should only be 70-ish bytes.
   And it takes 64kB/70 = 914 entries in the same directory to cause it
   to split.
   It does happen that the average file-id is actually 67.7 bytes, and
   there is a single directory with 625 entries, and another with 576
   entries. It doesn't seem like it would have to split, but it seems
   that 600+ entries is enough for some reason or another. It also would
   be possible for us to have a prefix collision on the other 2 bytes,
   which is causing us to have even more nodes on that page than just
   what fits in the directory.

E) How does the hash radix trie compare with a more conventional "by
   directory" layout? I don't have numbers (yet) for what things change
   over time, but if you just look at a snapshot...

   i) You have 370 directories, for an average of 21.8 entries per
      directory, with a standard deviation of 62.3 (quite large).
   ii) You have 49 directories with a single record, and a maximum of
       625 entries in a single directory.
   iii) You have an average depth of 3 directories. This is probably
        heavily influenced by the 3 largest directories:
          (555, 'mysql-test/r')
          (576, 'mysql-test/t')
          (625, 'storage/falcon')
        The next largest is only 304.
   	I'm attaching a plot of the number of entries per directory. You
        can see that there is a very long tail of directories that have
        <10 entries, and then a small handful at the end that has a lot
        of entries per dir.
   iv) I computed a "neighbors" metric. The idea is that if we have a
       directory 2 levels deep, with 10 items at each level, then
       changing one of the items in the bottom directory writes out a
       page with 10 items, and then writes out the parent directory,
       which has 10 items, and its parent which has 10 items. So
       changing a single record causes 30 records to be re-written.
       And then all 10 items at that level cause 30-rewrites, then the
       10 up above cause 20 rewrites, and the 10 at the top trigger 10.
       So in the simple tree of 9 files, 1 subdir, going 3 deep, you
       have 30*10 + 20*10 + 10*10 = 600 "neighbors".
       This would be the raw number of entries written if every item
       changed 1 time, so it is an approximation of the raw inventory
       size. What it doesn't take into account is any sort of
       'distribution' of changes (multiple records in the same dir
       changing at the same time, or 1 item in a deep directory changing
       more often than something in a different dir.)

       Anyway, the MySQL tree has 8078 entries, and a total of 2,240,540
       "neighbors". So 'on average' changing a single item will cause
       277.4 inventory items to be written.

       Compare that to a 16-way fan out with 64kB pages. Going by the
       above we average 41 items per leaf, an middle layer that points
       to 14 leaf nodes, and then a root node that points to 16 nodes.
       So changing 1 item changes 41+14+16 = 71 records, for a total
       "neighbors" of 573,538.

       If we pushed it further, you would have a root with 16-way fan
       out, a middle 1 with 16-way fan out, a middle 2 with 16-way fan
       out, and then a leaf with an average of 2 items (just going off
       of 8078 items). That would be 16+16+16+2 = 50 records updated per
       change, total of 403,900 neighbors.

       Looking at the actual figures, we have 20k revs, and a total of
       5,432,369 raw items written between the leaves and internal
       nodes. So we have an average of 271 records written per commit.
       If we assume 5 files changing per commit, that gives 54.2
       items written per file change. Which is at least in the ballpark
       of our '71' records.

       So ignoring collisions and some other stuff, we would expect the
       raw-inventory size for a by-directory layout to be about
       277/71 = 3.9, or approx 4 times larger than the 16-way hash trie
       with 64k leaf pages.

As another point of comparison, here is a different conversion, using
the 255-way fan out and 4k leaf nodes, but still doing the minimum
delta, etc. I have more revisions here, so the absolute numbers aren't
comparable, but the percentages should be:

$ wbzr repository-details d4-255-min-delta-mysql/
Commits: 41400
                      Raw    %    Compressed    %  Objects
Revisions:      80734 KiB   0%     32332 KiB   9%    41400
Inventories:   909026 KiB   6%    141926 KiB  40%   505996
Texts:       12137708 KiB  92%    179455 KiB  50%   156286
Signatures:         0 KiB   0%         0 KiB   0%        0
Total:       13127469 KiB 100%    353714 KiB 100%   703682

Extra Info:           count    total    avg stddev  min  max
internal node refs   242991 15694658   64.6   86.5   11  255
internal p_id refs     9747  1141924  117.2   97.5    2  242
inv depth            200832   591338    2.9    0.2    1    3
leaf node items      200832   332030    1.7    2.4    1   14
leaf p_id items       11026    70411    6.4   10.2    1   58
p_id depth            11026    35817    3.2    0.8    1    4

1) Notice that the raw inventory size is much smaller (I would guess
   primarily from 4k leaves, which only hold an average of 1.7 items per
   leaf) However, the compressed size is not significantly better. With
   a total of 40% versus 36%. However, this may just be because of how
   things go with time. The 64k pages was all the way down to 21% with
   only 1k revisions, so it certainly could bloat up to 40% going from
   20k revs to 40k revs.

2) The average depth is pretty much the same. And there are far more
   internal entries written. The average number of leaf nodes written is
   pretty much the same (5.06 versus 4.85).
   If we assume 1 full root node, and then semi-full middle nodes, and
   then leaf nodes, then inverting the average, we should have:
     (1*255 + 4.8*N) / (5.8) = 64.6
   or each middle node should fan out to 25 leaf nodes.
   Looking at the 'neighbors' calculation, there should be 255 + 25 + 2
   = 282 items written for a single changed entry, or 255 + 5*(25+2) =
   390 items written for an average of 5 changed items.
   If you look, there are 15.7M internal refs, and 332k leaf
   items. So 16M/41.4k = 387 items written per commit. Which is very
   close.

   The 16-way should have 16 + 5*(16+41) = 301, and in actuality it has
   271 per commit.
   The other factor, of course, is that an item in an internal node is
   about 50 bytes, while an item in the leaf node is about 300 bytes.

   Now if we change to 16-way fan out and 4k leaf pages, we should end
   up with a depth of 4. So you would have a full root, a full middle 1
   layer, a full middle 2 layer, and leaf pages with 2 items each.
   That would give 16 + 5*(16 + 16 + 2) = 186 items per commit.

Anyway, I'll try a 16-way 4k page again, just to get some good data on it.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkljuRQACgkQJdeBCYSNAAOXvACgwQ1NBg42JVWUVWS4xIpX6Glb
3LIAn3oKHBzhiquzjtkEXPQSsfibOrss
=q4lO
-----END PGP SIGNATURE-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: num_entries_per_dir.png
Type: image/png
Size: 3870 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20090106/e0ce4a57/attachment.png