[RFC] Shrinking blooms

Wed Jul 2 20:02:39 BST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I was trying to work out a way to allow blooms to have different sizes
at different levels. So that a top-level bloom could have lots of bits,
but an intermediate one could have fewer.

Here are my thoughts:

1) The raw offsets for all keys is fixed, regardless of the size of the
bloom. Specifically the raw offset is the 5 integers derived from the
sha1 sum of the string.

So in theory, you could create a Bloom that used a set() object, for all
those integers. It would have an effective bit-space of 2^32.

And then when you want to generate a smaller bloom, you just take the
modulo of the raw integer, versus the length of your bloom.

2) You could also implement this 2^32-bit bloom as a 2^32-bit string (a
512 MB string). Obviously using 512MB during the build just for this
string is a bit wasteful. I'm not sure that it is *more* wasteful than
storing 5 integers for every key, certainly for small numbers of keys it is.

3) However, notice the "folding" effect. Specifically, if you use a
keyspace of 2048, you have 11 bits of info in your offsets. In my
implementation, I only use powers of 2, so I can do a bit mask and do
array_offset = raw_offset & 0x7FF

(If you were using non powers-of-2 you would just use mod with:
~ array_offset = raw_offset % array_len)

Because I'm using a mask, there is a nice treat. You can chain them
without any loss. Specifically,

~  for any integer I: ((I & 0x7FF) & 0xFF) = (I & 0xFF)

Which means that as long as you use power-of-two blooms, you can take a
large bloom and map it to a smaller bloom. Just by ORing the bits together.

So if you had one at 2048 bits, and one at 1024 bits, you can do:

small_array = large_array[:1024] | large_array[1024:]

4) This means that the 'global_bloom' object at the top level could be a
very large bloom to handle lots of keys, and then when we go to write it
out we just map it down to something with a reasonable number of keys.

At this point, it just depends what we consider "large-enough". I would
*like* to start with a bloom with at least 10 bits per entry. Since that
gives < 1% error rate. Unfortunately, I think the code is structured
such that we don't know how many nodes we have until we've gotten there.
For most cases, a 10MB array would be more than enough. It has 80
million bits, so for anything less than 8 million keys, it has an
excellent error rate. The downside is that it is 10MB of memory used
when you only have 2 keys.

1MB is also fairly reasonable for even up to 1M keys.

So is 1MB of extra memory consumption while generating an index
reasonable? We should also have a way to downsize the intermediate nodes.

5) What would be great is if we could grow the bloom as we add more rows
to the tree, but I don't think that is possible (because you've already
lost the info to the current aliasing.)

You could approximate it by just duplicating your array. So if you had:

bigger_array = small_array + small_array

This will still give all correct values for existing keys, at the
expense of a worse FPR than if you started with bigger_array. If
small_array is already mostly white, you can't recover from that. But if
the error rate was very small, you probably wouldn't lose much.

Maybe that could work. What if we start with an array with <1% error
rate for 200+ keys (what should fit in a single node). Heck, a 4096-byte
array should have an error rate of only .46% for up to 2730 keys. (12
bits / entry).

What do you think Robert? Is it worth coding the ability to upsize and
downsize Blooms?

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkhr0M8ACgkQJdeBCYSNAAO3PQCguv4eIZdgwG2Pf8cw2yJKJym2
6HkAoIkdSHGi8ylWA9epcMCG88Qg8tD1
=PIop
-----END PGP SIGNATURE-----