chk_index using Blooms? (was Re: Prototype "improved_chk_index")

Wed Nov 4 18:05:13 GMT 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...

> 
> 8) If we approximate the effect by saying that the cost for a missing
> keys goes from 1.0 => 0.1, and the cost for a present key goes from 1.0
> => 1.1, then the break even point is 1.0 = (1.1 * (1-f) + 0.1 * f) = 1.1
> - 1.1f + 0.1f = 1.1 - f, f = 0.1.
> 
> So if we have 10% of keys not present in this index (90% hit rate) then
> we download the same amount of data as if we did not have the bloom
> filter. Fewer missing keys results in a net *increase* in bytes
> downloaded, more missing keys results in a decrease.
> 
> Which fits if you consider 'request 100 keys', 10 of them won't be
> present, and those cost 1 unit of effort. 90 of them will be present,
> and those now cost 99 units of effort, 99+1 = 100, which is the baseline
> effort.
> 
> If we use the 28-bytes-per-entry index, then the overhead cost is 1.036
> and the missing cost is 0.036. And the break-even point is yet again
> 3.6% of missing keys.
> 
> These are pretty hand-wavy as there are a lot of specific details to
> work out. However, a ~10% missing keys yielding a benefit from having a
> bloom filter is a reasonable measurement.

So if you just count the miss rate across all chk index requests for
'give me the full ancestry', I see huge miss rates.

For a sample bzr.dev repo
  CHK Key requests: 370343 requested, 109862 found, 260481 missing
For my launchpad repo
  CHK Key requests: 876570 requested, 424330 found, 452240 missing
For my 'live' bzr repo, extracting just bzr.dev:
  CHK Key requests: 768712 requested, 110360 found, 658352 missing
For my 'live' repo, extracting 'loggerhead':
  CHK Key requests: 29585 requested, 3379 found, 26206 missing

I think this fundamentally depends on what order the pack files are in
the repo. I remember explicitly re-ordering the pack files in
Packer._update_pack_order, so when you find a revision in pack X sort
that pack to the start of the search list. It is sort of a shame that we
'lost' this functionality when we switched to the 'get_record_stream' model.

So, a couple things to look at, but a bloom filter could help quite a
bit here. I would consider that the code might want to say "if I've
loaded everything, ignore the bloom filter", or something along those
lines. I suppose right now the lookup iterates the dict for matches
first, which means it would skip anything like a bloom filter lookup.

Of course, the code is also written such that it always reads all of the
groups and all of the mini-index, which we probably want to avoid for
the really big indices. (At 10M entries, that goes up to about 1MB of
content to read, .25 for the mini-index and .75 for the groups.)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkrxwlkACgkQJdeBCYSNAAN4wQCgnRgt2VWcKR4nyBrY9W7xj2bU
uV8An1eLWODVERq7qLk5fqhe14M6RChb
=O8mc
-----END PGP SIGNATURE-----