B+Tree indices: ongoing progress

Wed Jul 2 15:37:28 BST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
| On Wed, 2008-07-02 at 17:12 +1000, Robert Collins wrote:
|> On Tue, 2008-07-01 at 22:46 -0500, John Arbash Meinel wrote:
|>> -----BEGIN PGP SIGNED MESSAGE-----
|>> Hash: SHA1
|>>
|>> Robert Collins wrote:
|>> | On Wed, 2008-07-02 at 13:19 +1000, Robert Collins wrote:
|>> |
|>> |
|>> |> And some good news:
|>> |> Baseline overhead: Read 92994 keys in 7.075
|>> |> GraphIndex: Wrote 10147567 bytes in 10.949
|>> |> GraphIndex: iter_all_entries in 9.671
|>> |> GraphIndex: iter_random_one in 11.915
|>> |> GraphIndex: get_components_positions(2000) in 20.535
|>> |> GraphIndex: search_from_revid in 6.516
|>> |> GraphIndex: miss torture in 343.138
|>> |> GraphIndex: -------Done---------
|>> |> Baseline overhead: Read 92994 keys in 7.072
|>> |> BTreeIndex: Wrote 4104336 bytes in 20.299
|>> |> BTreeIndex: iter_all_entries in 1.850
|>> |> BTreeIndex: iter_random_one in 115.365
|>> |> BTreeIndex: get_components_positions(2000) in 33.332
|>> |> BTreeIndex: search_from_revid in 4.581
|>> |> BTreeIndex: miss torture in 41.803
|>> |> BTreeIndex: -------Done---------
|>> |>
|> On the same repository, with bloom filters added:
|> Baseline overhead: Read 92994 keys in 6.734
|> BTreeIndex: Wrote 4108432 bytes in 28.415
|> BTreeIndex: iter_all_entries in 1.955
|> BTreeIndex: iter_random_one in 120.980
|> BTreeIndex: get_components_positions(2000) in 40.829, bloom(144839)
|> BTreeIndex: search_from_revid in 6.911, bloom(32233)
|> BTreeIndex: miss torture in 53.916, bloom(413765)
|> BTreeIndex: -------Done---------
|>
|> Which is to say - we avoid 100's of thousands of leaf lookups but the
|> cost of the (sha1 based) bloom lookups exceeds the cost of checking
|> locally. I think the easiest way to test this is to drop disk caches
|> before running the test loop, and I'm about to do that.
|
| with cache dropping:
|
| Baseline overhead: Read 92994 keys in 11.003
| GraphIndex: Wrote 10147567 bytes in 11.606
| GraphIndex: iter_all_entries in 12.029
| GraphIndex: iter_random_one in 14.782
| GraphIndex: get_components_positions(2000) in 24.927, bloom(0)
| GraphIndex: search_from_revid in 10.226, bloom(0)
| GraphIndex: miss torture in 348.412, bloom(0)
| GraphIndex: -------Done---------
| Baseline overhead: Read 92994 keys in 11.336
| BTreeIndex: Wrote 4108432 bytes in 27.952
| BTreeIndex: iter_all_entries in 2.877
| BTreeIndex: iter_random_one in 124.600
| BTreeIndex: get_components_positions(2000) in 48.122, bloom(143304)
| BTreeIndex: search_from_revid in 7.577, bloom(32233)
| BTreeIndex: miss torture in 56.051, bloom(413765)
| BTreeIndex: -------Done---------
|
| Which I find useful; but its not conclusive. More news tomorrow I think,
| modulo an update with a better test script and so forth before I retire.
|
| -Rob
|

So, a couple of things here.

1) IO gets a lot more expensive after a cache drop, but more importantly
when accessing over remote. So we *might* end up with a bigger win.
Though if we are switching over to RemoteRPC calls, then local access to
indexes is the only real priority.

2) One problem is that you don't store any of your sha1 values. So at
each level that there is a bloom, you do *another* sha1 to check if it
exists. The sha1 would never change, and if all the bloom sizes are the
same, the offsets don't change either. Actually, we could even save the
struct.unpack() since the integers derived from the sha1 don't change.
We just mask them to get the right location in the bloom itself.
So there is some trivial caching that we could do, which should
(possibly dramatically) decrease the cost of bloom searches.

3) What if we put a single :bloom: record into the overall structure,
instead of having one at each node. (or maybe both). The idea is that
you can fill a 4096 byte page with a bloom filter, and then just go look
it up 1 time on first request. If we find it works really well, we can
change the logic so that page[0] is the root node, and page[1] is the
bloom. (Or the reverse, but having a fixed-size bloom is probably best.)

The bloom could even be spread over multiple pages. We know how many
nodes are in the index from the header. If we just always allocated N
bytes for the overall bloom filter, we would get approx 2% false
positive rate.

For the 1.1M keys in OOo, that would bloat the index by 1.1MB. But as
the index itself is 36MB, that is only 3% overhead. And if we *really*
want, you can just read the pages of the bloom for the keys you have
actually hit. Put another way...

bloom is an array of fixed length. When we check the offsets for a key,
those bits will fall at specific locations. Say we hit [10, 20, 500,
600], we don't have to read the stuff from 40-400. If we are using SHA1
as our hash, we get 5 keys. So to lookup a single key would cost (on
average) 5 page reads, or 4096*5 = 20480 bytes, rather than the full 1.1MB.

Unfortunately, as the offsets are intentionally random, adding more keys
will tend to increase the # pages linearly. So 2 keys == 10 pages, etc.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkhrkqgACgkQJdeBCYSNAAP4swCgweLoZfLu9c/k2e1lUMKr4doS
ZN8AnA8IVIg4/don2YOw9YoFqJkodfvi
=9C9O
-----END PGP SIGNATURE-----