Btree index and blooms
John Arbash Meinel
john at arbash-meinel.com
Tue Jul 8 15:38:06 BST 2008
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Robert Collins wrote:
| On Mon, 2008-07-07 at 21:42 -0500, John Arbash Meinel wrote:
|> In summary, blooms don't win, though we have a compiled line reader,
|> and not a
|> compiled bloom functionality. I'm also a bit curious about switching
|> to
|> BloomMd5 instead of Sha1.
|
| I think we should go ahead without blooms for now, and continue
| investigating them - they are used successfully in other projects with
| similar key lookup needs.
Seems reasonable. I'm a bit curious how much of it to filter out.
|
| My hunch is that a C bit-test facility without be a good starting point.
|
| -Rob
'without'?
Also, you mentioned that lsprof shows 6+ seconds being spent in the
large 'sorted' call, which is a rather large amount of time.
At the moment, '_multi_bisect_right' expects everything to be in sorted
order, so that it can just iterate the two lists, and match things as it
goes. I could relax that constraint and instead bisect each key into the
target keys.
We could do a couple tricks here, like when we get to the next key,
remember where the previous key was, and just do a quick check to see if
it fits in the same place. And if it doesn't, we can restrict the bisect
domain (slightly) because we've already compared the key to other keys
where we are at.
If you have 10,000 keys being spread across 100 targets, it is likely
that a nearly sorted list is going to end up getting a lot of clumps of
keys.
I could even sort them into the target list at this time, and we could
have separate lookups from there, since we know it is a sorted list.
Regardless what we do, I think the biggest win in the multi-bisect is
processing each row with all keys, rather than starting at the top and
going to the bottom for each pass.
Something to think about at least.
I wonder about including the 'bloom_pages=XXX' key, and just having it
set to 0. But then that would expect that clients would properly
recognize the bloom format, which isn't guaranteed. (If we switch to md5
as the hash, etc.)
And certainly, I need to churn out some patches to bzr itself to bring
in the updates to the merge algorithms... I guess that is why I'm
getting paid to do it, instead of just scratching the most interesting itch.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkhze84ACgkQJdeBCYSNAAOAvgCgupYOBsEkLAeJsw5WoLeJzgvk
lDkAn3768/t+AgZ9YGoMSoySaJuRHVNo
=ZrTH
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list