Bloom Filters

John Arbash Meinel john at arbash-meinel.com
Tue Mar 27 18:59:11 BST 2007


John Arbash Meinel wrote:
> Aaron Bentley wrote:
> ...
> 
>> A bloom filter could be used there.  If you are only doing a "sampling",
>> you may be able to get acceptable error rates with less than 500K.
>>
>> Aaron
> 
> Sure. I actually wrote a bloom filter plugin for playing around with it.
> It hooks into the bzr test suite, but doesn't do anything for bzr. It is
> a re-implementation of a 'pybloom' package. But I think that at this
> point, I don't think there is any common code.

I suppose I can summarize my results from it a bit, since I didn't
mention much before.

The basics of the bloom filters are that you have an array of bits of a
predetermined length. Every possible key gets mapped to some set of bits
that should be turned on for that key. You can set the number of bits
per key but commonly it is something like 4-8.

You can sort of thinking it as mapping a key to a set of offsets, and
then recording what offsets have occurred by setting the bit to True in
that location.

The original pybloom implementation was using the RC4 mixarray to get
bit diffusion, and was defined over the set of integers.

I have 2 implementations, 1 uses an MD5 hash of the string, and breaks
that up into 4 32-bit integer offsets. (Which means you have a maximum
array length of 2^32, which seems reasonable). I also did a SHA1
implementation which breaks it into 5 integers.

Generally for bloom filters, you compare their performance based on the
number of 'bits per entry' that are available. So if you have 100
entries in your bloom, and you have a 500 bit array, then you have 5
bits per entry. As you add more entries, the bits/entry goes down and
the bloom whites out (blacks out, depending on whether 1 is white or black).

So for dense arrays (lots of keys versus the number of bits), theory
says that having fewer offsets is slightly better. In testing with 12000
LP revision ids and 8000 BZR revision ids, I found that the sha1
implementation always had a lower false positive rate.


While investigating this, I also ran into a few other things...

One of our possibilities was to use a 'base64-like' filename, which was
actually a bloom filter. As long as you don't let you bloom get too
full, the randomness is pretty good, so your chance of collision between
filenames is low. This lets you have a "quick check" to see if it is
likely your revision id is contained in that file, with just a directory
listing.

However, if you consider a "safe" filename length of around 100
characters (so you still have some left for a suffix and prefix, and
don't immediately run into the 260 character MAX_PATH on windows. I was
unable to find 64 'safe' characters to do a base64-like encoding. base64
itself uses upper- and lower-case characters, which we can't do. I made
it to 58 characters but the rest were either control characters or
illegal filesystem chars on windows.

So if we cut back to a 'base32' you get 5 bits for every 8. (rather than
6/8 for base64).

So for 100 characters, you have 500 bits to work with. At 5 bits/entry
you have ~10% FPR (should be slightly less), and it is ~60% gray. (Which
is not great, but should be unique enough. You have around 250 randomly
distributed positive and negative bits).

But 5 b/e and 500 bits only gives you 100 entries.

It is slightly better if we did a psuedo base64 (we could include a few
uppercase characters and trade off a lot of extra bits for a slightly
increased collision rate). Because then you get 6/8 and you have 600
bits to work with. But you still are talking around 120 entries per
"packfile", and a 10% FPR. And 100 character random digit names are
pretty ugly (consider sha1 hash is only 40 characters).


And blooms degrade horribly if you put too many entries in there. The
difference is about 2x per "bits/entry". So if you have 5bits/entry you
are around 10%, at 6bits/entry you are around 5%, etc. It isn't quite
right, but below 4b/e your filter is so filled that pretty much
everything hits (and your filename is no longer unique, too).



Things are a lot better if you are allowed to have KB worth of data
(like in a header of the file).  First, you can get really good
bits/entry rates. At 8, you are around 2% FPR, at 10 you are at 1% FPR.

The bigger advantage, though, is that since you have a revision id, you
know what offsets to read at. So if foo at bar-xyz maps to 10, 20, 100,
1000. You don't have to read everything, you can just read the bytes
that contain those bits. Naturally you don't save a whole lot by not
reading everything (reading 1 page is about the same cost as reading 1
byte), but you might over a remote connection (it makes a great HTTP
partial request).


One thing I wanted to compare it against was a trie. You have a couple
of conflicting goals there. You could store the revision ids backwards,
which would give very quick fan-out so you could read just the first
couple of generations and know if something was there or not. However,
you lose the space savings of having common prefixes.

The other advantages of a trie is there are no false positives, and when
you have read the tree, you can put a pointer to where that record is
located.

I guess "Bloomier" filters allow you to have a reference at the end.
(They work by having a succession of bloom filters based on the current
keys, so that false positives of existing data get looked up in another
bloom filter, until you are sure there are no more misses). But that
looks like the number of root filters is based on how many "values" you
may have. And if you wanted to use 32-bit offsets, they claim you need
64 (2n) bloom filters. And at 10 bits/entry you end up with 640
bits/entry == 80 bytes. So for 1000 entries you need 80KB for the
header. Versus a linear revision id list, which is actually a bit
smaller. (our revision ids are approx 60 chars).

John
=:->




More information about the bazaar mailing list