B+Tree indices: ongoing progress
John Arbash Meinel
john at arbash-meinel.com
Wed Jul 2 16:57:29 BST 2008
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Robert Collins wrote:
| On Wed, 2008-07-02 at 17:12 +1000, Robert Collins wrote:
|> On Tue, 2008-07-01 at 22:46 -0500, John Arbash Meinel wrote:
|>> -----BEGIN PGP SIGNED MESSAGE-----
|>> Hash: SHA1
|>>
|>> Robert Collins wrote:
|>> | On Wed, 2008-07-02 at 13:19 +1000, Robert Collins wrote:
|>> |
|>> |
|>> |> And some good news:
|>> |> Baseline overhead: Read 92994 keys in 7.075
|>> |> GraphIndex: Wrote 10147567 bytes in 10.949
|>> |> GraphIndex: iter_all_entries in 9.671
|>> |> GraphIndex: iter_random_one in 11.915
|>> |> GraphIndex: get_components_positions(2000) in 20.535
|>> |> GraphIndex: search_from_revid in 6.516
|>> |> GraphIndex: miss torture in 343.138
|>> |> GraphIndex: -------Done---------
|>> |> Baseline overhead: Read 92994 keys in 7.072
|>> |> BTreeIndex: Wrote 4104336 bytes in 20.299
|>> |> BTreeIndex: iter_all_entries in 1.850
|>> |> BTreeIndex: iter_random_one in 115.365
|>> |> BTreeIndex: get_components_positions(2000) in 33.332
|>> |> BTreeIndex: search_from_revid in 4.581
|>> |> BTreeIndex: miss torture in 41.803
|>> |> BTreeIndex: -------Done---------
|>> |>
|> On the same repository, with bloom filters added:
|> Baseline overhead: Read 92994 keys in 6.734
|> BTreeIndex: Wrote 4108432 bytes in 28.415
|> BTreeIndex: iter_all_entries in 1.955
|> BTreeIndex: iter_random_one in 120.980
|> BTreeIndex: get_components_positions(2000) in 40.829, bloom(144839)
|> BTreeIndex: search_from_revid in 6.911, bloom(32233)
|> BTreeIndex: miss torture in 53.916, bloom(413765)
|> BTreeIndex: -------Done---------
|>
|> Which is to say - we avoid 100's of thousands of leaf lookups but the
|> cost of the (sha1 based) bloom lookups exceeds the cost of checking
|> locally. I think the easiest way to test this is to drop disk caches
|> before running the test loop, and I'm about to do that.
|
| with cache dropping:
|
| Baseline overhead: Read 92994 keys in 11.003
| GraphIndex: Wrote 10147567 bytes in 11.606
| GraphIndex: iter_all_entries in 12.029
| GraphIndex: iter_random_one in 14.782
| GraphIndex: get_components_positions(2000) in 24.927, bloom(0)
| GraphIndex: search_from_revid in 10.226, bloom(0)
| GraphIndex: miss torture in 348.412, bloom(0)
| GraphIndex: -------Done---------
| Baseline overhead: Read 92994 keys in 11.336
| BTreeIndex: Wrote 4108432 bytes in 27.952
| BTreeIndex: iter_all_entries in 2.877
| BTreeIndex: iter_random_one in 124.600
| BTreeIndex: get_components_positions(2000) in 48.122, bloom(143304)
| BTreeIndex: search_from_revid in 7.577, bloom(32233)
| BTreeIndex: miss torture in 56.051, bloom(413765)
| BTreeIndex: -------Done---------
|
| Which I find useful; but its not conclusive. More news tomorrow I think,
| modulo an update with a better test script and so forth before I retire.
|
| -Rob
|
Looking at the code, I'm not 100% sure that you are properly creating
intermediate blooms. I haven't been able to follow the logic terribly
well. But it feels like revisions are getting added to blooms that don't
apply.
If you have the graph:
Root: B
~ / \
~ A C
~ / \ / \
<A A-<B B-<C C-EOF
All nodes should be added to the B bloom. But only <A -> B should get
added to the A node, and only B => EOF should get added to C.
The loop itself looks like:
for row in self.rows[:-1]:
~ row.bloom.insert(string_key)
global_bloom.insert(string_key)
Anyway, it *feels* like all of the keys under A are being added to C.
Further, your _InternalBuilderRows seem to copy all of the current
global bloom, which is okay, because that is what you do when you insert
a new row at the top.
I feel like when you add a new row writer, you need to reset the bloom.
Specifically here:
if internal_row.writer is None:
~ length = _PAGE_SIZE
~ if internal_row.nodes == 0:
~ length -= 100 # padded
~ # reserve 256 for the bloom + 10 for ':bloom:\n'
~ internal_row.writer = chunk_writer.ChunkWriter(
~ length, 266)
~ internal_row.writer.write(_INTERNAL_FLAG)
~ internal_row.writer.write(_INTERNAL_OFFSET +
~ str(self.rows[pos + 1].nodes) + "\n")
I went ahead and added it. But I'll also test some other stuff in a bit.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkhrpWkACgkQJdeBCYSNAANwkACfU15Su3Jr5W4J/egMPmlITst1
zeUAoJhuHZK52ef4Z/cDNR7AH7fRKq8k
=nOPg
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list