B+tree discussions

Robert Collins robertc at robertcollins.net
Wed Jul 2 22:59:35 BST 2008


On Wed, 2008-07-02 at 16:40 -0500, John Arbash Meinel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> I've done some more tweaking. And here are some interesting results:

Cool.

I'm glad you did all this. I was about to start on some similar
things :).

I'll see the code shortly, but:


> 1) I added a value-level cache. So that when we read a leaf node, we
> cache all the keys in that leaf node. This dramatically improved the
> "iter_random_one" tests. Such that btree becomes *faster* than graph.

Do you eject lines from the value cache when the leaf is discarded? If
the answer is (no, the value cache is a lru itself) perhaps the value
cache should be on CombinedGraphIndex, not the individual index?

> With no cache, the 100 node leaf cache is about 50% efficient (it hits
> 50k times, and misses 54k times)

Note that the leaf cache probably wants tuning :). 

> With a key cache of 10,000 entries, it hits 50% of the time (50.4k hits
> vs 55k misses), but causes the leaf node cache to drop to 16k hits
> versus 39k misses.
> With a key cache of 100,000 entries, we hit 99% of the time (104k hits
> versus 1k misses), but the leaf node cache only misses on all 1067 checks.
> With a node cache of 500, 1,000, or 10,000 we hit 99% of the time, but
> are half the speed of a key cache.
> With a node cache of 200, we hit 77k versus 27k misses. It still costs
> us a lot.

I think that that has something to do with the work of finding the node.


> 2) The LRUCache has a bit of overhead. Enough that doing (1) slowed down
> things like search_from_revid.

I think it can be tuned further. Its certainly in the critical path.

> 3) The biggest penalty in graph.iter_ancestry() is that we search for
> revision ids in indexes where it is not present. Specifically, we have
> over 60,000 lookups that miss. Bloom only catches about 6,000 (10%) of that.
> Getting rid of the leaf cache in favor of the line cache hurts a lot
> here. Because we get to the leaf node, and have to read it again to
> realize that the key we want isn't there.
> I added the ability to cache that a revision_id isn't present (you add
> _not_present_object into the cache). However, for iter_ancestry() we
> don't actually revisit the same revisions multiple times. So it doesn't
> improve things a lot. It did hit somewhat for get_componets_positions().

I'd rather not cache negative hits - if we promote the line cache to the
combined index object then a negative hit on one index should always be
a ghost or a hit on another and then not hit the low level index again.

> 4) I added more instrumentation, to measure hit rate, miss rate, cached
> node hit rate, etc. It is actually pretty interesting to have, as it
> gives some good info as to what is really going on.

Yay!

> 5) My _multi_bisect_right does help. With no line cache, and only 100
> leaf node (and internal node) caches:
> 
> get_components_positions() 28.480s => 14.574s
> search_from_rev_id() 2.517s => 1.200s
> miss_torture() 25.094s => 10.960s
> 
> It is also important to test with the exact same repository. As I
> commit, the number of packs change, and the performance is impacted.

Yup :)

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080703/30f26461/attachment.pgp 


More information about the bazaar mailing list