Btree index and blooms

Tue Jul 8 00:24:09 BST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Well, I finally spent enough time to get the code written to have a
whole-index bloom written down and loaded back. I tried a few ways to make it
cheaper to use the blooms, but in the end, they are always slower locally than
not having them.

The #1 issue is the time to convert Keys into offsets and then check in the
array to see if the bits are set. I have the feeling that this is mostly
python overhead, and if we made blooms a C extension we *might* see a benefit.

Things I tried:

1) Write a new function (_iter_sorted_entries) which allows you to sort only
one time at the CombinedIndex level.
Specifically, we know that we can iterate through the btree in sorted order. I
added a bit of logic to allow me to iterate present *and* missing keys in
sorted order. It also allows you to precompute the key => offset map, and pass
it into each child index, rather than recomputing it each time.

  a) the 'miss' test really doesn't appreciate any of this work. For #1, it
     still has to recompute everything for each index, because it uses a
     different set of keys each time, and I thought it would be cheating a bit
     to move *all* of that code out of the profiled function.

     And _iter_sorted was returning the missing keys, which means that you
     know have to iterate everything, rather than just getting an empty
     result.

  b) moving the get_raw_offsets() up did help the 'revision' test. I saw the
     time spent drop from 700 ticks => 300 ticks. I suppose it is only 50%,
     because you only recompute nodes that miss the first index.
     The big tradeoff, though, is that now we seem to be doing a lot more
     "contains" lookups. They went from 38k => 132k calls, and 247 ticks =>
     798 ticks. I'm a little confused by this, but basically any time saved by
     precomputing seems to be sucked up by *using* them more.

  c) The global bloom does 'hit' a lot more than small row-level ones.
     Specificially, I went from 6.7k filtered nodes up to 31.7k filtered
     nodes.

2) I can't help but feel I'm missing something important. But at least for the
'miss' test, the sha1sum + struct.unpack time costs more than all the
bisecting and parsing. And in 2.5 python is supposed to have fast libraries
for both (and maintains Struct objects in a 100 entry cache, so it shouldn't
be struct compile-time overhead.)

3) Even with filtering out 31.7k keys that aren't present, we still ended up
the same number of _read_nodes() calls. And as _read_nodes() is an iterator, I
think it counts the number of pages actually read. Which makes the bloom stuff
pure overhead. The whole point is to avoid reading leaf nodes where keys would
might be present that are not. But if it doesn't actually filter well enough
for that...

4) I really must be doing something else wrong, because the total number of
misses actually *increased* from 100k => 200k even though I was getting 30k
filtered nodes. Maybe iter_entries() is being called with non-unique nodes? (I
got rid of the set() call, in favor of just sorted() at a higher level, but
maybe we really need the set()).

Ultimately, I think I have a couple really good patches, and then when I
enable the global bloom filter, things get worse.

So if you want, merge revno 40 or 41, but 42 seems to be a performance loss.

41 would allows us to *write* a global bloom, but doesn't use it. I do feel
like it might be a benefit for remote IO. But so far, I don't actually see it
being beneficial.

5) I see calls to insort() which means that there *are* nodes that are
filtered by the small filters, that miss the global filter. Not sure why, as
these are only 256 byte global filters.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIcqWZJdeBCYSNAAMRAhUIAJ916d4j74m3t63EXoMuzhCNNvoo+wCgzf12
pyFF5Kk0FuTruNUPppJJRLA=
=oeyW
-----END PGP SIGNATURE-----