Btree index and blooms

Tue Jul 8 03:42:59 BST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:
> Well, I finally spent enough time to get the code written to have a
> whole-index bloom written down and loaded back. I tried a few ways to make it
> cheaper to use the blooms, but in the end, they are always slower locally than
> not having them.
> 
> The #1 issue is the time to convert Keys into offsets and then check in the
> array to see if the bits are set. I have the feeling that this is mostly
> python overhead, and if we made blooms a C extension we *might* see a benefit.
> 
> Things I tried:

...

> 2) I can't help but feel I'm missing something important. But at least for the
> 'miss' test, the sha1sum + struct.unpack time costs more than all the
> bisecting and parsing. And in 2.5 python is supposed to have fast libraries
> for both (and maintains Struct objects in a 100 entry cache, so it shouldn't
> be struct compile-time overhead.)
> 
...

> 5) I see calls to insort() which means that there *are* nodes that are
> filtered by the small filters, that miss the global filter. Not sure why, as
> these are only 256 byte global filters.

So... it turns out that I was using the wrong list. I was filtering the nodes,
but forgetting to used the filtered list.

With that changed, I don't see blooms being as bad, and I see that the node
bloom doesn't get hit at all if there is a global bloom.

What confuses me right now, is how low my actual hit rate is. I have some
fairly large, and fairly empty blooms:
 <NodeBloom @ 0x98a172c: 524288 bits,  34.3% set,~0.47% FPR>
But the miss torture test doesn't seem to care about that:
 bloom(317240,0) leaf_value(0,0) miss(1478725), internal(0,9),

I was a bit confused about the 1.5M miss rate when blooms are only filtering
317k. Especially given the <1% error rate I should have been getting.
And then I realized that the small indexes don't have blooms (if you only have
a single leaf node, one read gives you everything anyway.)

So I changed the code to only compute the key => offset mapping if it would be
used in the bloom.

Anyway, all of this made things *better*, but they still are worse in the
common case with blooms enabled than with them disabled.
I haven't tried it with --drop-caches, yet, or over a remote connection. Where
the latency effects might start to tip things in favor of blooms.

In summary, blooms don't win, though we have a compiled line reader, and not a
compiled bloom functionality. I'm also a bit curious about switching to
BloomMd5 instead of Sha1.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIctQzJdeBCYSNAAMRAnSaAJ0VDtxmS3B1P/iRmRjq3Gm5ht0o9QCgxzZv
h7QKrLao2W8TFHXgTx2wn/4=
=t0P5
-----END PGP SIGNATURE-----