[MERGE] Support empty keys when looking for common prefixes in CHKMap.

Thu May 14 03:50:20 BST 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jelmer Vernooij wrote:
> On Wed, May 13, 2009 at 05:05:32PM -0500, John Arbash Meinel wrote:
>> Jelmer Vernooij wrote:
>>> Trivial issue found by bzr-svn's file id map implementation based on
>>> CHKMap, which uses the entry ('',) that would otherwise cause "local
>>> variable pos not set" exceptions.
>> BB:approve
> 
>> Out of curiosity, are you using a hash function for the search key?
> 
> Not that I'm aware of.. I'm just using 1-tuples with the path of each
> file/directory. How would a hash function help exactly? More efficient
> in terms of disk space, better performance? I still need to iterate
> over all keys in a CHKMap, so I would have to store them elsewhere I
> guess if I used a hash function.
> 
> Cheers,
> 
> Jelmer

The CHK code can do that for you, just pass "search_key_func=XXX" when
you create the CHKMap.

Basically, it improves the fan-out of CHK pages. In the internals of CHK
it is a Trie, where each fan out is based on a single byte of
difference. So imagine your keys are:

path/to/bar
path/to/fam
path/to/foo

If it can't fit all of that onto a single page, it splits at the common
prefix, and you get a page of:

path/to/?
 b => path/to/bar
 f => path/to/fam
      path/to/foo

That is, 1 internal node, and 2 leaf nodes. Now part of the problem is
that you get:

path/to/fob
path/to/foo
path/to/fom
path/gar

If we assume that you can only fit 2 nodes on a given page, this splits
into:

path/?
 t => path/to/fo?
        b => path/to/fob
        o => path/to/foo
        m => path/to/fom
 g => path/gar

Now, it really depends on the data structures, but when we were
analyzing inventories, we had an average 'depth' of about 9. Switching
to a "hashed trie" means that we take

path/to/bar => adef
path/to/fam => xzep
path/to/foo => npta

And we have a much closer to statistically distributed data (the hash is
crc32, which gives pretty good fan oute). So when consuming a single
byte, you pretty much always get max fan out.

I think the problem with paths and hash tries is that if you have a
subdirectory with a *lot* of keys, and some aunts-and-uncles, you end up
with *lots* of splits that don't accomplish much, and then a final split
that actually gets something done.

very/deep/common/path/[a-z]
very/deep/common/aunt
very/deep/uncle
very/aunt

Say you can fit 20 items on a page. Having 'a-z' forces a split, because
that is 26 items, but it splits at the first common prefix, and ends up
being

very/?
  a => very/aunt
  d => very/deep/?
        u => very/deep/uncle
        c => very/deep/common/?
              a => very/deep/common/aunt
              p => very/deep/common/path/?
                     a => very/deep/common/path/a
                     b => very/deep/common/path/b
                     ...

Anyway, as I said, the max depth of an inventory was ~9, which quickly
dropped to 2 with 256-way hash.

The other thing to consider is the fact that a change in a leaf node,
effects all "internal" nodes that lead up to it.

So in the above case, if 'very/deep/common/path/b' gets a new value, it
has to change that leaf page, and 4 internal nodes.

Without hashing, internal node churn becomes a very real issue.
Depending on how the trie is laid out, and how much leaf churn you have,
this may or may not matter.

Hashing causes your depth to go way down, but your width to go up, and
you tend to lose coherency in your changes. So take a few scenarios:

1) A tree with 10,000 items, average unhashed depth of 5, average number
of changes 5 per revision. Changes are generally in the same directory.
  a) without hashing, changing a leaf node then only effects the direct
parent entries in the trie. Because all changes happen in the same dir,
the effected internal nodes 'collide'. So you get <5 new leaf nodes, and
with a depth of 5, you get 5 new internal nodes.
  b) with 255-way hashing, and fitting 200 nodes in a leaf (chk255big).
The tree is 10,000/255 = 39 < 200, so you have a depth 2 tree. You end
up with 1 internal node, and 255 leaf nodes. On average, you end up with
1 new leaf node, and 5 new leaf nodes per revision.

2) A tree with 100,000 items, average depth of 8, average 10 changes per
revision (again assuming dir coherency)
  a) without hashing, < 10 leaves, + 8 internal nodes
  b) with hashing, 100,000/255 = 392, so you get a depth 3 tree. 1 root
node, 255 internal nodes, 255*255=65025 leaf nodes. Hashing enforces
that you rarely get collisions (255:1 probability). So you have 10 leaf
nodes change, and 10 internal nodes change, and 1 root node change. (21
on average.)

So it is very data dependent, especially if you have changes that are
'coherent'. Also, you need to look at how you access the data. When you
access it, do you:

 1) look up 1 file
 2) look up all files in a directory
 3) look up everything in the tree

if it is (3) hashing or not hashing doesn't matter. If it is (2) hashing
is likely to get you into trouble, as things are not kept together. If
it is (1) then hashing doesn't hurt.

Note that if it is (2), you can actually 'cheat' like we did, with
(parent_id, basename) as the key. In your case, you could do
('directory', 'name'), and for the root, you would use ('', '').
The 'search_key_func' is applied to each part of a key independently.
(so it maps ('xxx', 'xxx') => (hash('xxx'), hash('xxx')), not
hash(('xxx', 'xxx'))).

Also note another good feature of this. If the coherency of your data is
such that things change in the same directory, you end up with coherency
in the *hashed* data as well. Which means that

2c) with hash + (dirname, basename), you'll likely get a depth 4 tree.
As a general rule adding a width to the key adds a depth to the tree.
Just because now the hash isn't as independent as it was.
So you have 1 root node, 255 internal layer 1, 65025 internal layer 2,
and a very sparse <16M leaf nodes.

  When 10 items change, <10 leaf nodes will change, <10 layer 2 nodes
will change, <<10 layer 1 nodes will change, 1 root node changes.

The question is how good is the coherency? I'm not sure, but it could
possibly be 2 leaves change, 2 layer 2 change, 1 layer 1 changes, 1
root, == 2+2+1+1 = 6 nodes change per revision. Which is now *less* than
the unhashed form.

Anyway, lots that *can* be tweaked. I mostly wanted to point out that if
you aren't passing a "search_key_func" it uses the keys directly in the
internal trie, and often that leads to *lots* of internal nodes getting
churned.

You should be able to tell, whether the time is spent in
LeafNode.serialize() or InternalNode.serialize()

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkoLhuwACgkQJdeBCYSNAAPPUACfbfbG8EU4a+yfR4DMZZE5UuXJ
/4cAn1RmgWK7hjGz6X06BK2+tRd4Lhc8
=nsW7
-----END PGP SIGNATURE-----