Rev 13: Merge in Robert's latest work and get the tests passing again. in http://bzr.arbash-meinel.com/plugins/index2

Wed Jul 2 15:45:44 BST 2008

At http://bzr.arbash-meinel.com/plugins/index2

------------------------------------------------------------
revno: 13
revision-id: john at arbash-meinel.com-20080702144538-m492wdmmu7li4ceq
parent: john at arbash-meinel.com-20080702055951-2aadto8k1552yvd5
parent: robertc at robertcollins.net-20080702125159-o72w19dm8g5yrtt0
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: index2
timestamp: Wed 2008-07-02 09:45:38 -0500
message:
  Merge in Robert's latest work and get the tests passing again.
  
  Note that the bloom tests require use_blooms = True for them to actually work.
added:
  indexbench.py                  indexbench.py-20080702083855-5tju02y79rw7kkzh-1
modified:
  __init__.py                    __init__.py-20080624222253-p0x5f92uyh5hw734-5
  btree_index.py                 index.py-20080624222253-p0x5f92uyh5hw734-7
  chunk_writer.py                chunk_writer.py-20080630234519-6ggn4id17nipovny-1
  tests/test_btree_index.py      test_index.py-20080624222253-p0x5f92uyh5hw734-13
  tests/test_chunk_writer.py     test_chunk_writer.py-20080630234519-6ggn4id17nipovny-2
    ------------------------------------------------------------
    revno: 8.1.8
    revision-id: robertc at robertcollins.net-20080702125159-o72w19dm8g5yrtt0
    parent: robertc at robertcollins.net-20080702093116-q5iu1k5bpd7icqlk
    committer: Robert Collins <robertc at robertcollins.net>
    branch nick: trunk
    timestamp: Wed 2008-07-02 22:51:59 +1000
    message:
      Allow lsprofiling of individual fixtures, and start applying the results generated from that analysis.
    modified:
      btree_index.py                 index.py-20080624222253-p0x5f92uyh5hw734-7
      indexbench.py                  indexbench.py-20080702083855-5tju02y79rw7kkzh-1
    ------------------------------------------------------------
    revno: 8.1.7
    revision-id: robertc at robertcollins.net-20080702093116-q5iu1k5bpd7icqlk
    parent: robertc at robertcollins.net-20080702085603-pcd6iimxvyji6wtd
    committer: Robert Collins <robertc at robertcollins.net>
    branch nick: trunk
    timestamp: Wed 2008-07-02 19:31:16 +1000
    message:
      Allow individual fixture selection.
    modified:
      indexbench.py                  indexbench.py-20080702083855-5tju02y79rw7kkzh-1
    ------------------------------------------------------------
    revno: 8.1.6
    revision-id: robertc at robertcollins.net-20080702085603-pcd6iimxvyji6wtd
    parent: robertc at robertcollins.net-20080702084338-45774breb3cit718
    committer: Robert Collins <robertc at robertcollins.net>
    branch nick: trunk
    timestamp: Wed 2008-07-02 18:56:03 +1000
    message:
      Allow only profiling one index class.
    modified:
      indexbench.py                  indexbench.py-20080702083855-5tju02y79rw7kkzh-1
    ------------------------------------------------------------
    revno: 8.1.5
    revision-id: robertc at robertcollins.net-20080702084338-45774breb3cit718
    parent: robertc at robertcollins.net-20080702061310-nsy8pilywn78oewj
    committer: Robert Collins <robertc at robertcollins.net>
    branch nick: trunk
    timestamp: Wed 2008-07-02 18:43:38 +1000
    message:
      Crufty but works benchmark-as-plugin
    added:
      indexbench.py                  indexbench.py-20080702083855-5tju02y79rw7kkzh-1
    modified:
      __init__.py                    __init__.py-20080624222253-p0x5f92uyh5hw734-5
      btree_index.py                 index.py-20080624222253-p0x5f92uyh5hw734-7
      tests/test_btree_index.py      test_index.py-20080624222253-p0x5f92uyh5hw734-13
    ------------------------------------------------------------
    revno: 8.1.4
    revision-id: robertc at robertcollins.net-20080702061310-nsy8pilywn78oewj
    parent: robertc at robertcollins.net-20080702060810-2zn70b8n5dbc1jf8
    committer: Robert Collins <robertc at robertcollins.net>
    branch nick: trunk
    timestamp: Wed 2008-07-02 16:13:10 +1000
    message:
      Debugging can be good.
    modified:
      btree_index.py                 index.py-20080624222253-p0x5f92uyh5hw734-7
    ------------------------------------------------------------
    revno: 8.1.3
    revision-id: robertc at robertcollins.net-20080702060810-2zn70b8n5dbc1jf8
    parent: robertc at robertcollins.net-20080702043345-w3m0o6tubke13swa
    committer: Robert Collins <robertc at robertcollins.net>
    branch nick: trunk
    timestamp: Wed 2008-07-02 16:08:10 +1000
    message:
      Create bloom hashes for internal nodes.
    modified:
      btree_index.py                 index.py-20080624222253-p0x5f92uyh5hw734-7
      chunk_writer.py                chunk_writer.py-20080630234519-6ggn4id17nipovny-1
      tests/test_btree_index.py      test_index.py-20080624222253-p0x5f92uyh5hw734-13
      tests/test_chunk_writer.py     test_chunk_writer.py-20080630234519-6ggn4id17nipovny-2
    ------------------------------------------------------------
    revno: 8.1.2
    revision-id: robertc at robertcollins.net-20080702043345-w3m0o6tubke13swa
    parent: robertc at robertcollins.net-20080702042729-36uy2gxmztjhrzbw
    committer: Robert Collins <robertc at robertcollins.net>
    branch nick: trunk
    timestamp: Wed 2008-07-02 14:33:45 +1000
    message:
      Test that bloom is none for internal nodes without one.
    modified:
      tests/test_btree_index.py      test_index.py-20080624222253-p0x5f92uyh5hw734-13
    ------------------------------------------------------------
    revno: 8.1.1
    revision-id: robertc at robertcollins.net-20080702042729-36uy2gxmztjhrzbw
    parent: robertc at robertcollins.net-20080701224410-2lbqoqc2dc5v3iey
    committer: Robert Collins <robertc at robertcollins.net>
    branch nick: trunk
    timestamp: Wed 2008-07-02 14:27:29 +1000
    message:
      Add bloom filter parsing to internal nodes.
    modified:
      btree_index.py                 index.py-20080624222253-p0x5f92uyh5hw734-7
      tests/test_btree_index.py      test_index.py-20080624222253-p0x5f92uyh5hw734-13
-------------- next part --------------
=== modified file '__init__.py'

--- a/__init__.py	2008-07-01 11:37:43 +0000
+++ b/__init__.py	2008-07-02 08:43:38 +0000
@@ -47,6 +47,10 @@
     'RepositoryFormatPackBTreePlain',
     )
 
+import indexbench
+from bzrlib import commands
+commands.register_command(indexbench.cmd_indexbench)
+
 
 def test_suite():
     # Thunk across to load_tests for niceness with older bzr versions

=== modified file 'btree_index.py'
--- a/btree_index.py	2008-07-02 05:59:51 +0000
+++ b/btree_index.py	2008-07-02 14:45:38 +0000
@@ -17,14 +17,17 @@
 
 """B+Tree indices"""
 
+import array
 from bisect import bisect_left, bisect_right
+from copy import deepcopy
 import tempfile
 import zlib
 
-from bzrlib import debug, index, lru_cache, osutils
+from bzrlib import debug, index, lru_cache, osutils, trace
 from bzrlib import errors as bzrerrors
 from bzrlib.index import _OPTION_NODE_REFS, _OPTION_KEY_ELEMENTS, _OPTION_LEN
 from bzrlib.plugins.index2 import errors, chunk_writer
+from bzrlib.plugins.pybloom.pybloom import BloomSHA1
 
 
 _BTSIGNATURE = "B+Tree Graph Index 1\n"
@@ -35,6 +38,24 @@
 
 _PAGE_SIZE = 4096
 
+bloom_hits = 0
+use_blooms = True
+
+
+class NodeBloom(BloomSHA1):
+    """Subclass of BloomSHA1 that can be read from a file."""
+
+    def __init__(self, bytes):
+        # See BloomSHA1._init_array - which we don't use as we have an array
+        # ready
+        self._num_bytes = len(bytes)
+        self._num_bits = self._num_bytes * 8
+        self._bitmask = self._num_bits - 1
+        self._array = array.array('B', bytes)
+        # we don't know, so make things die if 
+        # methods needing the entry count are called.
+        self._num_entries = None
+
 
 class _BuilderRow(object):
     """The stored state accumulated while writing out a row in the index.
@@ -50,6 +71,39 @@
         self.spool = tempfile.TemporaryFile()
         self.writer = None
 
+    def finish_node(self):
+        byte_lines, _ = self.writer.finish()
+        if self.nodes == 0:
+            # padded note:
+            self.spool.write("\x00" * 100)
+        self.spool.writelines(byte_lines)
+        if self.spool.tell() % _PAGE_SIZE != 0:
+            raise AssertionError("incorrect node length")
+        self.nodes += 1
+        self.writer = None
+
+
+class _InternalBuilderRow(_BuilderRow):
+    """The stored state accumulated while writing out internal rows."""
+
+    def __init__(self, current_global_bloom):
+        """Create a _BuilderRow."""
+        _BuilderRow.__init__(self)
+        self.bloom = deepcopy(current_global_bloom)
+
+    def finish_node(self):
+        bloom_bytes = ":bloom:\n" + self.bloom._array.tostring()
+        if self.writer.write_reserved(bloom_bytes):
+            raise AssertionError("Not enough space for bloom filter.")
+        if 'index' in debug.debug_flags:
+            trace.mutter_callsite(3,
+                "Finished node with bloom filter %r." % self.bloom)
+        _BuilderRow.finish_node(self)
+
+
+class _LeafBuilderRow(_BuilderRow):
+    """The stored state accumulated while writing out a leaf rows."""
+
 
 class BTreeBuilder(index.GraphIndexBuilder):
     """A Builder for B+Tree based Graph indices.
@@ -98,19 +152,8 @@
             # (self.rows[-1]). When we finish a chunk in a row,
             # propogate the key that didn't fit (comes after the chunk) to the
             # row above, transitively.
-            self.rows.append(_BuilderRow())
-            def finish_node(row):
-                byte_lines, next_node_line = row.writer.finish()
-                if row.nodes == 0:
-                    # padded note:
-                    row.spool.write("\x00" * 100)
-                row.spool.writelines(byte_lines)
-                if row.spool.tell() % _PAGE_SIZE != 0:
-                    import pdb; pdb.set_trace()
-                    raise AssertionError("incorrect node length")
-                row.nodes += 1
-                row.writer = None
-
+            global_bloom = BloomSHA1(256 * 8)
+            self.rows.append(_LeafBuilderRow())
             def add_key(string_key, key_line):
                 """Add a key to the current chunk.
                 
@@ -126,8 +169,9 @@
                             length = _PAGE_SIZE
                             if internal_row.nodes == 0:
                                 length -= 100 # padded
+                            # reserve 256 for the bloom + 10 for ':bloom:\n'
                             internal_row.writer = chunk_writer.ChunkWriter(
-                                length)
+                                length, 266)
                             internal_row.writer.write(_INTERNAL_FLAG)
                             internal_row.writer.write(_INTERNAL_OFFSET +
                                 str(self.rows[pos + 1].nodes) + "\n")
@@ -139,7 +183,7 @@
                     self.rows[-1].writer.write(_LEAF_FLAG)
                 if self.rows[-1].writer.write(line):
                     # this key did not fit in the node:
-                    finish_node(self.rows[-1])
+                    self.rows[-1].finish_node()
                     key_line = string_key + "\n"
                     new_row = True
                     for row in reversed(self.rows[:-1]):
@@ -147,7 +191,7 @@
                         # doesn't fit then propogate upwards until we find one that
                         # it does fit into.
                         if row.writer.write(key_line):
-                            finish_node(row)
+                            row.finish_node()
                         else:
                             # We've found a node that can handle the pointer.
                             new_row = False
@@ -156,15 +200,21 @@
                     # division point, then we need a new root:
                     if new_row:
                         # We need a new row
-                        new_row = _BuilderRow()
+                        new_row = _InternalBuilderRow(global_bloom)
                         self.rows.insert(0, new_row)
                         # This will be padded, hence the -100
-                        new_row.writer = chunk_writer.ChunkWriter(_PAGE_SIZE - 100)
+                        # reserve 256 for the bloom + 10 for ':bloom:\n'
+                        new_row.writer = chunk_writer.ChunkWriter(
+                            _PAGE_SIZE - 100, 266)
                         new_row.writer.write(_INTERNAL_FLAG)
                         new_row.writer.write(_INTERNAL_OFFSET +
                             str(self.rows[1].nodes - 1) + "\n")
                         new_row.writer.write(key_line)
                     add_key(string_key, key_line)
+                else:
+                    for row in self.rows[:-1]:
+                        row.bloom.insert(string_key)
+                    global_bloom.insert(string_key)
 
             for key, (absent, references, value) in nodes:
                 if absent:
@@ -181,7 +231,7 @@
                     '\t'.join(flattened_references), value))
                 add_key(string_key, line)
             for row in reversed(self.rows):
-                finish_node(row)
+                row.finish_node()
         result = tempfile.TemporaryFile()
         lines = [_BTSIGNATURE]
         lines.append(_OPTION_NODE_REFS + str(self.reference_lists) + '\n')
@@ -250,16 +300,30 @@
     def __init__(self, bytes):
         """Parse bytes to create an internal node object."""
         # splitlines mangles the \r delimiters.. don't use it.
-        self.keys = self._parse_lines(bytes.split('\n'))
+        self.keys, self.bloom = self._parse_lines(bytes.split('\n'))
 
     def _parse_lines(self, lines):
         nodes = []
         self.offset = int(lines[1][7:])
+        bloom_lines = []
+        in_bloom = False
         for line in lines[2:]:
-            if line == '':
-                return nodes
-            nodes.append(tuple(line.split('\0')))
-        return nodes
+            if not in_bloom:
+                if line == '':
+                    break
+                if line == ':bloom:':
+                    in_bloom = True
+                    continue
+                nodes.append(tuple(line.split('\0')))
+            else:
+                bloom_lines.append(line)
+        if bloom_lines:
+            # put the bloom back together again
+            bloom_bytes = '\n'.join(bloom_lines)
+            bloom = NodeBloom(bloom_bytes)
+        else:
+            bloom = None
+        return nodes, bloom
 
 
 class BTreeGraphIndex(object):
@@ -505,7 +569,15 @@
             keys supplied. No additional keys will be returned, and every
             key supplied that is in the index will be returned.
         """
-        keys = sorted(keys)
+        if self._key_count is None:
+            self._cache_nodes([0])
+        if not self._key_count:
+            return
+        # 6 seconds spent in miss_torture using the sorted() line.
+        # Even with out of order disk IO it seems faster not to sort it when
+        # large queries are being made.
+        # keys = sorted(keys)
+        keys = frozenset(keys)
         if not keys:
             return
         if self._key_count is None:
@@ -514,6 +586,7 @@
             return
         nodes_and_keys = [(0, keys)]
         row_offset = 0
+        _use_bloom = use_blooms
         for row_length in self._row_lengths[:-1]:
             node_indexes = [idx for idx, s_keys in nodes_and_keys]
             nodes = self._get_nodes(node_indexes)
@@ -523,6 +596,15 @@
             next_nodes_and_keys = []
             for node_index, sub_keys in nodes_and_keys:
                 node = nodes[node_index]
+                if _use_bloom and node.bloom is not None:
+                    # If using blooms, filter out any nodes that don't hit.
+                    remaining_sub_keys = [sub_key for sub_key in sub_keys
+                                          if '\x00'.join(sub_key) in node.bloom]
+                    global bloom_hits
+                    bloom_hits += len(sub_keys) - len(remaining_sub_keys)
+                    sub_keys = remaining_sub_keys
+                if isinstance(sub_keys, frozenset):
+                    sub_keys = sorted(sub_keys)
                 positions = self._multi_bisect_right(sub_keys, node.keys)
                 node_offset = row_offset + node.offset
                 next_nodes_and_keys.extend([(node_offset + pos, s_keys)

=== modified file 'chunk_writer.py'
--- a/chunk_writer.py	2008-07-02 05:37:39 +0000
+++ b/chunk_writer.py	2008-07-02 14:45:38 +0000
@@ -35,8 +35,14 @@
 
     _max_repack = 2
 
-    def __init__(self, chunk_size):
-        """Create a ChunkWriter to write chunk_size chunks."""
+    def __init__(self, chunk_size, reserved=0):
+        """Create a ChunkWriter to write chunk_size chunks.
+        
+        :param chunk_size: The total byte count to emit at the end of the
+            chunk.
+        :param reserved: How many bytes to allow for reserved data. reserved
+            data space can only be written to via the write_reserved method.
+        """
         self.chunk_size = chunk_size
         self.compressor = zlib.compressobj()
         self.bytes_in = []
@@ -45,6 +51,7 @@
         self.seen_bytes = 0
         self.num_repack = 0
         self.unused_bytes = None
+        self.reserved_size = reserved
 
     def finish(self):
         """Finish the chunk.
@@ -86,10 +93,25 @@
         If the bytes fit, False is returned. Otherwise True is returned
         and the bytes have not been added to the chunk.
         """
+        return self._write(bytes, False)
+
+    def write_reserved(self, bytes):
+        """Write some bytes to the chunk bypassing the reserved check.
+
+        If the bytes fit, False is returned. Otherwise True is returned
+        and the bytes have not been added to the chunk.
+        """
+        return self._write(bytes, True)
+
+    def _write(self, bytes, reserved):
+        if reserved:
+            capacity = self.chunk_size
+        else:
+            capacity = self.chunk_size - self.reserved_size
         # Check quickly to see if this is likely to put us outside of our
         # budget:
         next_seen_size = self.seen_bytes + len(bytes)
-        if (next_seen_size < 2 * self.chunk_size):
+        if (next_seen_size < 2 * capacity):
             # No need, we assume this will "just fit"
             out = self.compressor.compress(bytes)
             self.bytes_in.append(bytes)
@@ -97,7 +119,7 @@
             if out:
                 self.bytes_list.append(out)
         else:
-            if self.num_repack >= self._max_repack:
+            if not reserved and self.num_repack >= self._max_repack:
                 # We have packed too many times already.
                 return True
             # This may or may not fit, try to add it with Z_SYNC_FLUSH
@@ -109,13 +131,13 @@
                 self.bytes_list.append(out)
             total_len = sum(len(b) for b in self.bytes_list)
             # Give us some extra room for a final Z_FINISH call.
-            if total_len + 10 > self.chunk_size:
+            if total_len + 10 > capacity:
                 # We are over budget, try to squeeze this in without any
                 # Z_SYNC_FLUSH calls
                 self.num_repack += 1
                 bytes_out, compressor = self._recompress_all_bytes_in(bytes)
                 this_len = sum(len(b) for b in bytes_out)
-                if this_len + 10 > self.chunk_size:
+                if this_len + 10 > capacity:
                     # No way we can add anymore, we need to re-pack because our
                     # compressor is now out of sync
                     bytes_out, compressor = self._recompress_all_bytes_in()

=== added file 'indexbench.py'
--- a/indexbench.py	1970-01-01 00:00:00 +0000
+++ b/indexbench.py	2008-07-02 12:51:59 +0000
@@ -0,0 +1,220 @@
+from bzrlib.bzrdir import BzrDir
+import tempfile
+from bzrlib.transport import get_transport
+import sys
+import time
+import os
+import gc
+from random import shuffle, sample
+from bzrlib.plugins.index2.btree_index import BTreeGraphIndex, BTreeBuilder
+from bzrlib.plugins.index2 import btree_index
+from bzrlib.index import GraphIndex, GraphIndexBuilder, CombinedGraphIndex
+from bzrlib.knit import _KnitGraphIndex, KnitVersionedFiles
+from bzrlib.graph import Graph
+from bzrlib.commands import Command
+from bzrlib.option import Option, ListOption
+from bzrlib.lsprof import profile
+
+
+key_count = 0
+
+def drop_caches():
+    """Call a script called drop-caches which should flush os caches."""
+    os.system('drop-caches')
+
+def iter_indices(names, t, factory):
+    for name in sorted(names):
+        size = t.stat(name).st_size
+        yield name, factory(t, name, size)
+
+def copy_indices(indices, target, factory):
+    size = 0
+    for name, index in indices:
+        index.key_count()
+        key_length = index._key_length
+        ref_lists = index.node_ref_lists
+        builder = factory(reference_lists=ref_lists, key_elements=key_length)
+        for node in index.iter_all_entries():
+            builder.add_node(*node[1:])
+        size += target.put_file(name, builder.finish())
+    return size
+
+
+def profile_fixture(class_name, fixture_name, fixture, *args, **kwargs):
+    """Profile a fixture."""
+    _, stats = profile(fixture, *args, **kwargs)
+    output = file(class_name + '.' + fixture_name + '.callgrind', 'wb')
+    stats.calltree(output)
+    output.close()
+
+
+def run_fixture(class_name, fixture_name, fixture, *args, **kwargs):
+    fixture(*args, **kwargs)
+
+
+def time_index(names, source, factory, builder, target, label, name_keys,
+    text_keys, use_drop_cache, fixtures, lsprof):
+    if use_drop_cache:
+        drop_cache = drop_caches
+    else:
+        drop_cache = lambda:None
+    if lsprof:
+        run = profile_fixture
+    else:
+        run = run_fixture
+    drop_cache()
+# Baseline time
+    now = time.time()
+    for name, index in iter_indices(names, source, GraphIndex):
+        index.key_count()
+        for node in index.iter_all_entries():
+            pass
+    finish = time.time()
+    print "Baseline overhead: Read %d keys in %0.3f" % (key_count, finish - now)
+# write time
+    read_time = finish - now
+    drop_cache()
+    now = time.time()
+    length = copy_indices(iter_indices(names, source, GraphIndex),
+        target, builder)
+    finish = time.time()
+    print "%s: Wrote %d bytes in %0.3f" % (label, length, finish - now - read_time)
+# iter_all
+    if 'all' in fixtures:
+        drop_cache()
+        now = time.time()
+        for name, index in iter_indices(names, target, factory):
+            for item in index.iter_all_entries():
+                pass
+        finish = time.time()
+        print "%s: iter_all_entries in %0.3f" % (label, finish - now)
+# random shuffle, all keys (name_keys comes preshuffled)
+    if 'shuffle' in fixtures:
+        drop_cache()
+        now = time.time()
+        for name, index in iter_indices(names, target, factory):
+            order = name_keys[name]
+            shuffle(order)
+            for key in order:
+                index.iter_entries([key]).next()
+        finish = time.time()
+        print "%s: iter_random_one in %0.3f" % (label, finish - now)
+# text extraction simulation (follow a compression graph) for text_keys
+    if 'text' in fixtures:
+        text_names = [name for name in names if name.endswith('.tix')]
+        text_indices = [index for name, index in iter_indices(text_names, target, factory)]
+        text_index = CombinedGraphIndex(text_indices)
+        btree_index.bloom_hits = 0
+        drop_cache()
+        now = time.time()
+        # VersionedFiles can do multi-key extraction, so we can use that.
+        
+        index = _KnitGraphIndex(text_index, lambda:True, deltas=True, parents=True)
+        vf = KnitVersionedFiles(index, None)
+        vf._get_components_positions(text_keys)
+        finish = time.time()
+        print "%s: get_components_positions(%d) in %0.3f, bloom(%d)" % (label, len(text_keys),
+            finish - now, btree_index.bloom_hits)
+# follow a revision graph
+    if 'revision' in fixtures:
+        rev_names = [name for name in names if name.endswith('.rix')]
+        rev_indices = [index for name, index in iter_indices(rev_names, target, factory)]
+        rev_index = CombinedGraphIndex(rev_indices)
+        # pick the lexograph middle revision, just because.
+        all_revs = sorted(node[1] for node in rev_index.iter_all_entries())
+        revid = all_revs[len(all_revs)/2]
+        # reopen indices
+        rev_indices = [index for name, index in iter_indices(rev_names, target, factory)]
+        rev_index = CombinedGraphIndex(rev_indices)
+        btree_index.bloom_hits = 0
+        drop_cache()
+        now = time.time()
+        # VersionedFiles can do multi-key extraction, so we can use that.
+        index = _KnitGraphIndex(rev_index, lambda:True, deltas=True, parents=True)
+        vf = KnitVersionedFiles(index, None)
+        graph = Graph(vf)
+        search = graph._make_breadth_first_searcher([revid])
+        while True:
+            try:
+                search.next()
+            except StopIteration:
+                break
+        finish = time.time()
+        print "%s: search_from_revid in %0.3f, bloom(%d)" % (label, finish - now, btree_index.bloom_hits)
+# pathologic miss-lookups: ask each index of each type for the keys held by the
+# others of that type
+    if 'miss' in fixtures:
+        run(label, 'miss', miss_torture, label, drop_cache, names, target, factory, name_keys)
+    print "%s: -------Done---------" % (label,)
+
+def miss_torture(label, drop_cache, names, target, factory, name_keys):
+        btree_index.bloom_hits = 0
+        drop_cache()
+        now = time.time()
+        for name, index in iter_indices(names, target, factory):
+            suffix = name[-4:]
+            other_keys = set()
+            for othername, keys in name_keys.items():
+                if othername != name and othername.endswith(suffix):
+                    other_keys.update(keys)
+            for node in index.iter_entries(other_keys):
+                pass
+        finish = time.time()
+        print "%s: miss torture in %0.3f, bloom(%d)" % (label, finish - now, btree_index.bloom_hits)
+
+
+class cmd_indexbench(Command):
+    """Benchmark index performance."""
+
+    takes_options = [
+        Option('graph', help='bench the GraphIndex class'),
+        Option('btree', help='bench the BTreeGraphIndex class'),
+        Option('drop-cache', help='drop caches between fixtures'),
+        ListOption('fixture', type=str,
+            help='fixtures to test: one of all, shuffle, text, revision, miss'),
+        # lspro because --lsprof is a global option, and they interfere with each other.
+        Option('lspro', help='generate class.fixture.callgrind lsprof files'),
+        ]
+
+
+    takes_args = ['sample_repository']
+
+    def run(self, sample_repository, graph=True, btree=True, drop_cache=False,
+        fixture=None,lspro=False):
+        if not fixture:
+            fixture = ['all', 'shuffle', 'text', 'revision', 'miss']
+        source = get_transport(sample_repository)
+        source = source.clone('.bzr/repository/indices')
+        names = source.list_dir('.')
+        name_keys = {}
+        text_keys = set()
+        for name, index in iter_indices(names, source, GraphIndex):
+            name_keys[name] = []
+            for item in index.iter_all_entries():
+                name_keys[name].append(item[1])
+            # Randomise for later tests
+            shuffle(name_keys[name])
+            if name.endswith('.tix'):
+                text_keys.update(name_keys[name])
+        text_keys = list(text_keys)
+        # pick 2000 text keys to pretend to reconstruct
+        text_keys = sample(text_keys, 2000)
+        global key_count
+        key_count = sum(map(len, name_keys.itervalues()))
+
+        if btree:
+            self.test_class(names, source, BTreeGraphIndex, BTreeBuilder,
+                'BTreeIndex', name_keys, text_keys, drop_cache, fixture, lspro)
+        if graph:
+            self.test_class(names, source, GraphIndex, GraphIndexBuilder,
+                'GraphIndex', name_keys, text_keys, drop_cache, fixture, lsprof)
+
+    def test_class(self, names, source, factory, builder, label, name_keys,
+        text_keys, drop_cache, fixtures, lsprof):
+        workingdir = tempfile.mkdtemp()
+        t = get_transport(workingdir)
+        try:
+            time_index(names, source, factory, builder, t, label, name_keys,
+                text_keys, drop_cache, fixtures, lsprof)
+        finally:
+            t.delete_tree('.')

=== modified file 'tests/test_btree_index.py'
--- a/tests/test_btree_index.py	2008-07-02 05:40:02 +0000
+++ b/tests/test_btree_index.py	2008-07-02 14:45:38 +0000
@@ -22,6 +22,7 @@
 from bzrlib import tests
 from bzrlib.plugins import index2
 from bzrlib.plugins.index2 import errors, btree_index
+from bzrlib.plugins.pybloom.pybloom import BloomSHA1
 from bzrlib.tests import TestCaseWithTransport
 from bzrlib.transport import get_transport
 
@@ -148,10 +149,19 @@
         leaf1 = content[4096:8192]
         leaf2 = content[8192:]
         root_bytes = zlib.decompress(root)
+        # Create a little bloom by hand
+        bloom = BloomSHA1(256 * 8)
+        # set a a bit to test
+        for node in nodes:
+            bloom.insert('\x00'.join(node[0]))
+        # get bytes
+        bloom_bytes = bloom._array.tostring()
         expected_root = (
             "type=internal\n"
             "offset=0\n"
             "505505505505505505505505505505505505505505505505505505505505505505505505505505505505505505505505505505505505505505505505\n"
+            ":bloom:\n"
+            + bloom_bytes
             )
         self.assertEqual(expected_root, root_bytes)
         # We already know serialisation works for leaves, check key selection:
@@ -165,7 +175,7 @@
         self.assertEqual(800 - 451, len(node.keys))
         self.assertEqual(sorted_node_keys[451:], sorted(node.keys))
 
-    def test_second_internal_node_pointer(self):
+    def test_three_level_tree_details(self):
         # The left most pointer in the second internal node in a row should
         # pointer to the second node that the internal node is for, _not_
         # the first, otherwise the first node overlaps with the last node of
@@ -184,6 +194,8 @@
         self.assertEqual(3, len(index._row_lengths),
             "Not enough rows: %r" % index._row_lengths)
         internal_node1 = index._get_node(1)
+        # Must have a bloom on the first node.
+        self.assertNotEqual(None, internal_node1.bloom)
         internal_node2 = index._get_node(2)
         # The left most node node2 points at should be one after the right most node pointed at by
         # node1.
@@ -194,6 +206,17 @@
         pos = index._row_lengths[0] + index._row_lengths[1] + internal_node2.offset + 1
         leaf = index._get_node(pos)
         self.assertTrue(internal_node2.keys[0] in leaf.keys)
+        # Check the bloom filter for internal_node2: all the keys in the leaf
+        # should appear to be present
+        for key in leaf.keys:
+            self.assertTrue('\x00'.join(key) in internal_node2.bloom)
+        # Check the bloom filter for internal_node1 with its first two nodes in
+        # the same fashion.
+        for offset in [0, 1]:
+            pos = index._row_lengths[0] + index._row_lengths[1] + offset
+            leaf = index._get_node(pos)
+            for key in leaf.keys:
+                self.assertTrue('\x00'.join(key) in internal_node1.bloom)
 
     def test_2_leaves_2_2(self):
         builder = btree_index.BTreeBuilder(key_elements=2, reference_lists=2)
@@ -210,12 +233,21 @@
         leaf1 = content[4096:8192]
         leaf2 = content[8192:]
         root_bytes = zlib.decompress(root)
+        # Create a little bloom by hand
+        bloom = BloomSHA1(256 * 8)
+        # set a a bit to test
+        for node in nodes:
+            bloom.insert('\x00'.join(node[0]))
+        # get bytes
+        bloom_bytes = bloom._array.tostring()
         expected_root = (
             "type=internal\n"
             "offset=0\n"
             "1111111111111111111111111111111111111111\x00"
             "127127127127127127127127127127127127127127127127127127127"
             "127127127127127127127127127127127127127127127127127127127127127\n"
+            ":bloom:\n"
+            + bloom_bytes
             )
         self.assertEqual(expected_root, root_bytes)
         # We assume the other leaf nodes have been written correctly - layering FTW.
@@ -438,6 +470,26 @@
              ('readv',  'index', [(4096, 4096), ], False, None)],
             transport._activity)
 
+    def test_iter_entries_skips_bloom_miss(self):
+        # On a miss, if the bloom says no, no IO is performed.
+        builder = btree_index.BTreeBuilder(key_elements=1)
+        # We need a two level index.
+        nodes = self.make_nodes(1200, 1, 0)
+        for node in nodes:
+            builder.add_node(*node)
+        transport = get_transport('trace+' + self.get_url(''))
+        size = transport.put_file('index', builder.finish())
+        del builder
+        index = btree_index.BTreeGraphIndex(transport, 'index', size)
+        del transport._activity[:]
+        self.assertEqual([], transport._activity)
+        # search for one key
+        found_nodes = list(index.iter_entries([("foo",)]))
+        self.assertEqual([], found_nodes)
+        # Should have read the root node only
+        self.assertEqual([('readv', 'index', [(0, 4096)], False, None)],
+            transport._activity)
+
 
 class TestBTreeNodes(BTreeTestCase):
 
@@ -486,18 +538,50 @@
             "1111111111111111111111111111111111111111\n"
             "2222222222222222222222222222222222222222\n"
             "3333333333333333333333333333333333333333\n"
-            "4444444444444444444444444444444444444444\n")
-        node = btree_index._InternalNode(node_bytes)
-        # We want to bisect to find the right children from this node, so a
-        # vector is most useful.
-        self.assertEqual([
-            ("0000000000000000000000000000000000000000",),
-            ("1111111111111111111111111111111111111111",),
-            ("2222222222222222222222222222222222222222",),
-            ("3333333333333333333333333333333333333333",),
-            ("4444444444444444444444444444444444444444",),
-            ], node.keys)
-        self.assertEqual(1, node.offset)
+            "4444444444444444444444444444444444444444\n"
+            )
+        node = btree_index._InternalNode(node_bytes)
+        # We want to bisect to find the right children from this node, so a
+        # vector is most useful.
+        self.assertEqual([
+            ("0000000000000000000000000000000000000000",),
+            ("1111111111111111111111111111111111111111",),
+            ("2222222222222222222222222222222222222222",),
+            ("3333333333333333333333333333333333333333",),
+            ("4444444444444444444444444444444444444444",),
+            ], node.keys)
+        self.assertEqual(1, node.offset)
+        self.assertEqual(None, node.bloom)
+
+    def test_InternalNode_bloom(self):
+        # Create a little bloom so we can check it is parsed right
+        bloom = BloomSHA1(256 * 8)
+        # set a a bit to test
+        bloom.insert("foo-bar")
+        # get bytes
+        bloom_bytes = bloom._array.tostring()
+        node_bytes = ("type=internal\n"
+            "offset=1\n"
+            "0000000000000000000000000000000000000000\n"
+            "1111111111111111111111111111111111111111\n"
+            "2222222222222222222222222222222222222222\n"
+            "3333333333333333333333333333333333333333\n"
+            "4444444444444444444444444444444444444444\n"
+            ":bloom:\n"
+            + bloom_bytes)
+        node = btree_index._InternalNode(node_bytes)
+        # We want to bisect to find the right children from this node, so a
+        # vector is most useful.
+        self.assertEqual([
+            ("0000000000000000000000000000000000000000",),
+            ("1111111111111111111111111111111111111111",),
+            ("2222222222222222222222222222222222222222",),
+            ("3333333333333333333333333333333333333333",),
+            ("4444444444444444444444444444444444444444",),
+            ], node.keys)
+        self.assertEqual(1, node.offset)
+        self.assertTrue("foo-bar" in node.bloom)
+        self.assertFalse("other" in node.bloom)
 
     def test_LeafNode_2_2(self):
         node_bytes = ("type=leaf\n"

=== modified file 'tests/test_chunk_writer.py'
--- a/tests/test_chunk_writer.py	2008-07-01 18:38:55 +0000
+++ b/tests/test_chunk_writer.py	2008-07-02 06:08:10 +0000
@@ -64,3 +64,24 @@
         self.assertEqualDiff(expected_bytes, node_bytes)
         # And the line that failed should have been saved for us
         self.assertEqual(lines[46], unused)
+
+    def test_too_much_data_preserves_reserve_space(self):
+        # Generate enough data to exceed 4K
+        lines = []
+        for group in range(48):
+            offset = group * 50
+            numbers = range(offset, offset + 50)
+            # Create a line with this group
+            lines.append(''.join(map(str, numbers)) + '\n')
+        writer = chunk_writer.ChunkWriter(4096, 256)
+        for line in lines:
+            if writer.write(line):
+                break
+        self.assertFalse(writer.write_reserved("A"*256))
+        bytes_list, unused = writer.finish()
+        node_bytes = self.check_chunk(bytes_list, 4096)
+        # the first 44 lines should have been added
+        expected_bytes = ''.join(lines[:44]) + "A"*256
+        self.assertEqualDiff(expected_bytes, node_bytes)
+        # And the line that failed should have been saved for us
+        self.assertEqual(lines[44], unused)