chk_index using Blooms? (was Re: Prototype "improved_chk_index")

Tue Nov 3 16:09:40 GMT 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Short answer, we should investigate. It has the potential to reduce the
data downloaded by approx 10-20x for keys which are not present. At a
cost of around 5-10% on-disk size and overhead when checking for keys
which *are* present. It looks like if the average query has >=10% of
keys not present in the index, then the bloom filter is a 'win'.

John Arbash Meinel wrote:
> A while back I put together a draft of a few ideas I had for improving
> our index code to handle CHK pages. The draft is a bit dense with ideas,
> and certainly not polished. It is in the bzr source as
> 'doc/developers/improved_chk_index.txt', or you can read it here:
> 
>   http://doc.bazaar-vcs.org/latest/developers/improved_chk_index.html
> 
> 
> I've certainly wanted to work on it for a while, so I decided to take a
> "slack day" away from memory work, etc, and poke around at it. I was
> surprised to find that I have a basic proof-of-concept finished in about
> 1 day.
> 
>   lp:~jameinel/bzr/chk-index
> 

A bit more notes on this. I just had someone ask me for the code to my
Bloom filter implementation, so I figured I'd put something together as
to whether we want to have bloom filters in the chk-index.

1) Bloom filters are 'happiest' when they are about 50% full (have the
bits are on, half the bits are off.) You also get to tune your
false-positive rate (1%, 10%, etc). At 8bits/entry (1byte)  we have ~2%
false positive rate.

2) The really nice benefit is that we already have a sha hash, so we
don't have to hash it again. (One of the main limiting factors in my
earlier testing was the CPU overhead of hashing all of the keys.)

3) The main issue is that bloom filters are O(N) rather than being
O(log(N)) for the root pages of a tree structure. At 8-bits per key, you
need 10MB for a 10M entry index. At optimum settings, I was
theoretically able to get the per-entry cost down to ~10bytes/entry even
on 10M entry indexes. My current results are about 28 bytes/entry.

10M @ optimal would be a 10% overhead, @current it is only 3.6% overhead.

We could try for 4 bits per entry, but I see a FPR >16% there. At
5bits/entry we are down at 10% FPR. @5bits, 10M entries is 6.25MB.

4) I'm assuming that we scale the 'mini-index' such that we have enough
entries so that the 'gap' is going to be ~4k. (So mini-index @aa points
to offset 10,000 mini-index @ab points to ~14,096, etc.)

@optimal 10bytes/entry we have 100MB of data, so we want 100MB/4096 =
25,600 mini-index entries. Which is > 16k so we want 64k. (Assuming
4bits increments we have, 16, 256, 4096, 65536 anyway.)
Each mini-index entry is a 4-byte offset, so we have 256kB, or 0.25MB.
@64k index entries, each gap is 1600 bytes.

@28 bytes/entry we have 280MB of data, 280MB/4096 = 70,000. So we
*might* bump up to the next order and end up with 1M index nodes. Though
that would leave us with only 280 bytes/entry, which may be a bit small.

Even a 1M node mini-index would be 4MB, which is still smaller than the
10MB bloom filter.
@ 64k entries, the gap is 4480 bytes
@ 1M entries, the gap is 280 bytes

5) The minimum data we need to read, to tell if a single key is missing.
For a bloom filter, we need to read 4/5 bytes from the 10MB array. For
the mini-index, though we need to read the 4 bytes from the mini-index
(because of fixed-width entries we can jump exactly to that offset) and
then we need to read the 'gap'. Which is something in 280, 1600, 4480
bytes. So worst case is 4480+4 = 4484 bytes. Quite a bit more than the 5
bytes for bloom.

6) However, we are likely to read bloom filters and mini-index pages in
ranges > 1 byte. We further are probably more interested in what happens
when you have to check for a bunch of keys, not just one.

For 64k keys, we are likely to read all of the mini-index, and
64*4-5bytes for the bloom filter. That is actually a wash as both read
the same amount of data. However, the main cost for the mini-index is
that we then have to read all of the actual gap pages and read the whole
index. While for bloom, we will have a 1-10% false positive rate, and
only read 1-10% of the total index.

Note that for a 10MB bloom filter, if we chose to read in 'pages' then
we will probably hit all pages far before 64k keys. Specifically,
10MB/64k/5 = 32bytes. So for any 'page' size >= 32 bytes we would read 10MB.

7) I think in the end, we'd need a good number for the number of times
we try to access an index for a record that isn't there. Because if we
had 100% hit rate, then the 1 byte/entry is pure overhead. If we had 0%
hit rate, then cutting the amount read down to 1 byte/entry is a good
win. (Though note you have to include the FPR in the amount read.)

I was originally thinking that a bloom filter would be a bad thing, as
it is O(N) though with a very small constant. However at just 64k sha1s
read from a 10M entry index, we've already hit the point where we have
to read the whole 10M entry index to check if any keys are missing.
[note: not entirely true, (1- (1-1/65536)**65536) = 63%, So for the
65536th key there is a 63% chance that it would actually lie in
mini-index record that we already need to read, but we would still read
a significant fraction.]

8) If we approximate the effect by saying that the cost for a missing
keys goes from 1.0 => 0.1, and the cost for a present key goes from 1.0
=> 1.1, then the break even point is 1.0 = (1.1 * (1-f) + 0.1 * f) = 1.1
- - 1.1f + 0.1f = 1.1 - f, f = 0.1.

So if we have 10% of keys not present in this index (90% hit rate) then
we download the same amount of data as if we did not have the bloom
filter. Fewer missing keys results in a net *increase* in bytes
downloaded, more missing keys results in a decrease.

Which fits if you consider 'request 100 keys', 10 of them won't be
present, and those cost 1 unit of effort. 90 of them will be present,
and those now cost 99 units of effort, 99+1 = 100, which is the baseline
effort.

If we use the 28-bytes-per-entry index, then the overhead cost is 1.036
and the missing cost is 0.036. And the break-even point is yet again
3.6% of missing keys.

These are pretty hand-wavy as there are a lot of specific details to
work out. However, a ~10% missing keys yielding a benefit from having a
bloom filter is a reasonable measurement.

9) locality

One potential problem with bloom filters is that their data has no
locality, because we use a hash to spread the information around. Which
means that you average needing to read N_keys*5 locations to check the
bloom filter, while for a btree you may get all your hits on a single
sub-page. However, chk keys don't have locality either, so there isn't a
tradeoff in that respect.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkrwVcQACgkQJdeBCYSNAAMlxACgi5e1GJUMVeC03d0EQQqHDHG7
LCIAoKnXDq0ZtLEco9zgP3ehiT1zm4P6
=aFAv
-----END PGP SIGNATURE-----