Rev 13: Merge in Robert's latest work and get the tests passing again. in http://bzr.arbash-meinel.com/plugins/index2
John Arbash Meinel
john at arbash-meinel.com
Wed Jul 2 15:45:44 BST 2008
At http://bzr.arbash-meinel.com/plugins/index2
------------------------------------------------------------
revno: 13
revision-id: john at arbash-meinel.com-20080702144538-m492wdmmu7li4ceq
parent: john at arbash-meinel.com-20080702055951-2aadto8k1552yvd5
parent: robertc at robertcollins.net-20080702125159-o72w19dm8g5yrtt0
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: index2
timestamp: Wed 2008-07-02 09:45:38 -0500
message:
Merge in Robert's latest work and get the tests passing again.
Note that the bloom tests require use_blooms = True for them to actually work.
added:
indexbench.py indexbench.py-20080702083855-5tju02y79rw7kkzh-1
modified:
__init__.py __init__.py-20080624222253-p0x5f92uyh5hw734-5
btree_index.py index.py-20080624222253-p0x5f92uyh5hw734-7
chunk_writer.py chunk_writer.py-20080630234519-6ggn4id17nipovny-1
tests/test_btree_index.py test_index.py-20080624222253-p0x5f92uyh5hw734-13
tests/test_chunk_writer.py test_chunk_writer.py-20080630234519-6ggn4id17nipovny-2
------------------------------------------------------------
revno: 8.1.8
revision-id: robertc at robertcollins.net-20080702125159-o72w19dm8g5yrtt0
parent: robertc at robertcollins.net-20080702093116-q5iu1k5bpd7icqlk
committer: Robert Collins <robertc at robertcollins.net>
branch nick: trunk
timestamp: Wed 2008-07-02 22:51:59 +1000
message:
Allow lsprofiling of individual fixtures, and start applying the results generated from that analysis.
modified:
btree_index.py index.py-20080624222253-p0x5f92uyh5hw734-7
indexbench.py indexbench.py-20080702083855-5tju02y79rw7kkzh-1
------------------------------------------------------------
revno: 8.1.7
revision-id: robertc at robertcollins.net-20080702093116-q5iu1k5bpd7icqlk
parent: robertc at robertcollins.net-20080702085603-pcd6iimxvyji6wtd
committer: Robert Collins <robertc at robertcollins.net>
branch nick: trunk
timestamp: Wed 2008-07-02 19:31:16 +1000
message:
Allow individual fixture selection.
modified:
indexbench.py indexbench.py-20080702083855-5tju02y79rw7kkzh-1
------------------------------------------------------------
revno: 8.1.6
revision-id: robertc at robertcollins.net-20080702085603-pcd6iimxvyji6wtd
parent: robertc at robertcollins.net-20080702084338-45774breb3cit718
committer: Robert Collins <robertc at robertcollins.net>
branch nick: trunk
timestamp: Wed 2008-07-02 18:56:03 +1000
message:
Allow only profiling one index class.
modified:
indexbench.py indexbench.py-20080702083855-5tju02y79rw7kkzh-1
------------------------------------------------------------
revno: 8.1.5
revision-id: robertc at robertcollins.net-20080702084338-45774breb3cit718
parent: robertc at robertcollins.net-20080702061310-nsy8pilywn78oewj
committer: Robert Collins <robertc at robertcollins.net>
branch nick: trunk
timestamp: Wed 2008-07-02 18:43:38 +1000
message:
Crufty but works benchmark-as-plugin
added:
indexbench.py indexbench.py-20080702083855-5tju02y79rw7kkzh-1
modified:
__init__.py __init__.py-20080624222253-p0x5f92uyh5hw734-5
btree_index.py index.py-20080624222253-p0x5f92uyh5hw734-7
tests/test_btree_index.py test_index.py-20080624222253-p0x5f92uyh5hw734-13
------------------------------------------------------------
revno: 8.1.4
revision-id: robertc at robertcollins.net-20080702061310-nsy8pilywn78oewj
parent: robertc at robertcollins.net-20080702060810-2zn70b8n5dbc1jf8
committer: Robert Collins <robertc at robertcollins.net>
branch nick: trunk
timestamp: Wed 2008-07-02 16:13:10 +1000
message:
Debugging can be good.
modified:
btree_index.py index.py-20080624222253-p0x5f92uyh5hw734-7
------------------------------------------------------------
revno: 8.1.3
revision-id: robertc at robertcollins.net-20080702060810-2zn70b8n5dbc1jf8
parent: robertc at robertcollins.net-20080702043345-w3m0o6tubke13swa
committer: Robert Collins <robertc at robertcollins.net>
branch nick: trunk
timestamp: Wed 2008-07-02 16:08:10 +1000
message:
Create bloom hashes for internal nodes.
modified:
btree_index.py index.py-20080624222253-p0x5f92uyh5hw734-7
chunk_writer.py chunk_writer.py-20080630234519-6ggn4id17nipovny-1
tests/test_btree_index.py test_index.py-20080624222253-p0x5f92uyh5hw734-13
tests/test_chunk_writer.py test_chunk_writer.py-20080630234519-6ggn4id17nipovny-2
------------------------------------------------------------
revno: 8.1.2
revision-id: robertc at robertcollins.net-20080702043345-w3m0o6tubke13swa
parent: robertc at robertcollins.net-20080702042729-36uy2gxmztjhrzbw
committer: Robert Collins <robertc at robertcollins.net>
branch nick: trunk
timestamp: Wed 2008-07-02 14:33:45 +1000
message:
Test that bloom is none for internal nodes without one.
modified:
tests/test_btree_index.py test_index.py-20080624222253-p0x5f92uyh5hw734-13
------------------------------------------------------------
revno: 8.1.1
revision-id: robertc at robertcollins.net-20080702042729-36uy2gxmztjhrzbw
parent: robertc at robertcollins.net-20080701224410-2lbqoqc2dc5v3iey
committer: Robert Collins <robertc at robertcollins.net>
branch nick: trunk
timestamp: Wed 2008-07-02 14:27:29 +1000
message:
Add bloom filter parsing to internal nodes.
modified:
btree_index.py index.py-20080624222253-p0x5f92uyh5hw734-7
tests/test_btree_index.py test_index.py-20080624222253-p0x5f92uyh5hw734-13
-------------- next part --------------
=== modified file '__init__.py'
--- a/__init__.py 2008-07-01 11:37:43 +0000
+++ b/__init__.py 2008-07-02 08:43:38 +0000
@@ -47,6 +47,10 @@
'RepositoryFormatPackBTreePlain',
)
+import indexbench
+from bzrlib import commands
+commands.register_command(indexbench.cmd_indexbench)
+
def test_suite():
# Thunk across to load_tests for niceness with older bzr versions
=== modified file 'btree_index.py'
--- a/btree_index.py 2008-07-02 05:59:51 +0000
+++ b/btree_index.py 2008-07-02 14:45:38 +0000
@@ -17,14 +17,17 @@
"""B+Tree indices"""
+import array
from bisect import bisect_left, bisect_right
+from copy import deepcopy
import tempfile
import zlib
-from bzrlib import debug, index, lru_cache, osutils
+from bzrlib import debug, index, lru_cache, osutils, trace
from bzrlib import errors as bzrerrors
from bzrlib.index import _OPTION_NODE_REFS, _OPTION_KEY_ELEMENTS, _OPTION_LEN
from bzrlib.plugins.index2 import errors, chunk_writer
+from bzrlib.plugins.pybloom.pybloom import BloomSHA1
_BTSIGNATURE = "B+Tree Graph Index 1\n"
@@ -35,6 +38,24 @@
_PAGE_SIZE = 4096
+bloom_hits = 0
+use_blooms = True
+
+
+class NodeBloom(BloomSHA1):
+ """Subclass of BloomSHA1 that can be read from a file."""
+
+ def __init__(self, bytes):
+ # See BloomSHA1._init_array - which we don't use as we have an array
+ # ready
+ self._num_bytes = len(bytes)
+ self._num_bits = self._num_bytes * 8
+ self._bitmask = self._num_bits - 1
+ self._array = array.array('B', bytes)
+ # we don't know, so make things die if
+ # methods needing the entry count are called.
+ self._num_entries = None
+
class _BuilderRow(object):
"""The stored state accumulated while writing out a row in the index.
@@ -50,6 +71,39 @@
self.spool = tempfile.TemporaryFile()
self.writer = None
+ def finish_node(self):
+ byte_lines, _ = self.writer.finish()
+ if self.nodes == 0:
+ # padded note:
+ self.spool.write("\x00" * 100)
+ self.spool.writelines(byte_lines)
+ if self.spool.tell() % _PAGE_SIZE != 0:
+ raise AssertionError("incorrect node length")
+ self.nodes += 1
+ self.writer = None
+
+
+class _InternalBuilderRow(_BuilderRow):
+ """The stored state accumulated while writing out internal rows."""
+
+ def __init__(self, current_global_bloom):
+ """Create a _BuilderRow."""
+ _BuilderRow.__init__(self)
+ self.bloom = deepcopy(current_global_bloom)
+
+ def finish_node(self):
+ bloom_bytes = ":bloom:\n" + self.bloom._array.tostring()
+ if self.writer.write_reserved(bloom_bytes):
+ raise AssertionError("Not enough space for bloom filter.")
+ if 'index' in debug.debug_flags:
+ trace.mutter_callsite(3,
+ "Finished node with bloom filter %r." % self.bloom)
+ _BuilderRow.finish_node(self)
+
+
+class _LeafBuilderRow(_BuilderRow):
+ """The stored state accumulated while writing out a leaf rows."""
+
class BTreeBuilder(index.GraphIndexBuilder):
"""A Builder for B+Tree based Graph indices.
@@ -98,19 +152,8 @@
# (self.rows[-1]). When we finish a chunk in a row,
# propogate the key that didn't fit (comes after the chunk) to the
# row above, transitively.
- self.rows.append(_BuilderRow())
- def finish_node(row):
- byte_lines, next_node_line = row.writer.finish()
- if row.nodes == 0:
- # padded note:
- row.spool.write("\x00" * 100)
- row.spool.writelines(byte_lines)
- if row.spool.tell() % _PAGE_SIZE != 0:
- import pdb; pdb.set_trace()
- raise AssertionError("incorrect node length")
- row.nodes += 1
- row.writer = None
-
+ global_bloom = BloomSHA1(256 * 8)
+ self.rows.append(_LeafBuilderRow())
def add_key(string_key, key_line):
"""Add a key to the current chunk.
@@ -126,8 +169,9 @@
length = _PAGE_SIZE
if internal_row.nodes == 0:
length -= 100 # padded
+ # reserve 256 for the bloom + 10 for ':bloom:\n'
internal_row.writer = chunk_writer.ChunkWriter(
- length)
+ length, 266)
internal_row.writer.write(_INTERNAL_FLAG)
internal_row.writer.write(_INTERNAL_OFFSET +
str(self.rows[pos + 1].nodes) + "\n")
@@ -139,7 +183,7 @@
self.rows[-1].writer.write(_LEAF_FLAG)
if self.rows[-1].writer.write(line):
# this key did not fit in the node:
- finish_node(self.rows[-1])
+ self.rows[-1].finish_node()
key_line = string_key + "\n"
new_row = True
for row in reversed(self.rows[:-1]):
@@ -147,7 +191,7 @@
# doesn't fit then propogate upwards until we find one that
# it does fit into.
if row.writer.write(key_line):
- finish_node(row)
+ row.finish_node()
else:
# We've found a node that can handle the pointer.
new_row = False
@@ -156,15 +200,21 @@
# division point, then we need a new root:
if new_row:
# We need a new row
- new_row = _BuilderRow()
+ new_row = _InternalBuilderRow(global_bloom)
self.rows.insert(0, new_row)
# This will be padded, hence the -100
- new_row.writer = chunk_writer.ChunkWriter(_PAGE_SIZE - 100)
+ # reserve 256 for the bloom + 10 for ':bloom:\n'
+ new_row.writer = chunk_writer.ChunkWriter(
+ _PAGE_SIZE - 100, 266)
new_row.writer.write(_INTERNAL_FLAG)
new_row.writer.write(_INTERNAL_OFFSET +
str(self.rows[1].nodes - 1) + "\n")
new_row.writer.write(key_line)
add_key(string_key, key_line)
+ else:
+ for row in self.rows[:-1]:
+ row.bloom.insert(string_key)
+ global_bloom.insert(string_key)
for key, (absent, references, value) in nodes:
if absent:
@@ -181,7 +231,7 @@
'\t'.join(flattened_references), value))
add_key(string_key, line)
for row in reversed(self.rows):
- finish_node(row)
+ row.finish_node()
result = tempfile.TemporaryFile()
lines = [_BTSIGNATURE]
lines.append(_OPTION_NODE_REFS + str(self.reference_lists) + '\n')
@@ -250,16 +300,30 @@
def __init__(self, bytes):
"""Parse bytes to create an internal node object."""
# splitlines mangles the \r delimiters.. don't use it.
- self.keys = self._parse_lines(bytes.split('\n'))
+ self.keys, self.bloom = self._parse_lines(bytes.split('\n'))
def _parse_lines(self, lines):
nodes = []
self.offset = int(lines[1][7:])
+ bloom_lines = []
+ in_bloom = False
for line in lines[2:]:
- if line == '':
- return nodes
- nodes.append(tuple(line.split('\0')))
- return nodes
+ if not in_bloom:
+ if line == '':
+ break
+ if line == ':bloom:':
+ in_bloom = True
+ continue
+ nodes.append(tuple(line.split('\0')))
+ else:
+ bloom_lines.append(line)
+ if bloom_lines:
+ # put the bloom back together again
+ bloom_bytes = '\n'.join(bloom_lines)
+ bloom = NodeBloom(bloom_bytes)
+ else:
+ bloom = None
+ return nodes, bloom
class BTreeGraphIndex(object):
@@ -505,7 +569,15 @@
keys supplied. No additional keys will be returned, and every
key supplied that is in the index will be returned.
"""
- keys = sorted(keys)
+ if self._key_count is None:
+ self._cache_nodes([0])
+ if not self._key_count:
+ return
+ # 6 seconds spent in miss_torture using the sorted() line.
+ # Even with out of order disk IO it seems faster not to sort it when
+ # large queries are being made.
+ # keys = sorted(keys)
+ keys = frozenset(keys)
if not keys:
return
if self._key_count is None:
@@ -514,6 +586,7 @@
return
nodes_and_keys = [(0, keys)]
row_offset = 0
+ _use_bloom = use_blooms
for row_length in self._row_lengths[:-1]:
node_indexes = [idx for idx, s_keys in nodes_and_keys]
nodes = self._get_nodes(node_indexes)
@@ -523,6 +596,15 @@
next_nodes_and_keys = []
for node_index, sub_keys in nodes_and_keys:
node = nodes[node_index]
+ if _use_bloom and node.bloom is not None:
+ # If using blooms, filter out any nodes that don't hit.
+ remaining_sub_keys = [sub_key for sub_key in sub_keys
+ if '\x00'.join(sub_key) in node.bloom]
+ global bloom_hits
+ bloom_hits += len(sub_keys) - len(remaining_sub_keys)
+ sub_keys = remaining_sub_keys
+ if isinstance(sub_keys, frozenset):
+ sub_keys = sorted(sub_keys)
positions = self._multi_bisect_right(sub_keys, node.keys)
node_offset = row_offset + node.offset
next_nodes_and_keys.extend([(node_offset + pos, s_keys)
=== modified file 'chunk_writer.py'
--- a/chunk_writer.py 2008-07-02 05:37:39 +0000
+++ b/chunk_writer.py 2008-07-02 14:45:38 +0000
@@ -35,8 +35,14 @@
_max_repack = 2
- def __init__(self, chunk_size):
- """Create a ChunkWriter to write chunk_size chunks."""
+ def __init__(self, chunk_size, reserved=0):
+ """Create a ChunkWriter to write chunk_size chunks.
+
+ :param chunk_size: The total byte count to emit at the end of the
+ chunk.
+ :param reserved: How many bytes to allow for reserved data. reserved
+ data space can only be written to via the write_reserved method.
+ """
self.chunk_size = chunk_size
self.compressor = zlib.compressobj()
self.bytes_in = []
@@ -45,6 +51,7 @@
self.seen_bytes = 0
self.num_repack = 0
self.unused_bytes = None
+ self.reserved_size = reserved
def finish(self):
"""Finish the chunk.
@@ -86,10 +93,25 @@
If the bytes fit, False is returned. Otherwise True is returned
and the bytes have not been added to the chunk.
"""
+ return self._write(bytes, False)
+
+ def write_reserved(self, bytes):
+ """Write some bytes to the chunk bypassing the reserved check.
+
+ If the bytes fit, False is returned. Otherwise True is returned
+ and the bytes have not been added to the chunk.
+ """
+ return self._write(bytes, True)
+
+ def _write(self, bytes, reserved):
+ if reserved:
+ capacity = self.chunk_size
+ else:
+ capacity = self.chunk_size - self.reserved_size
# Check quickly to see if this is likely to put us outside of our
# budget:
next_seen_size = self.seen_bytes + len(bytes)
- if (next_seen_size < 2 * self.chunk_size):
+ if (next_seen_size < 2 * capacity):
# No need, we assume this will "just fit"
out = self.compressor.compress(bytes)
self.bytes_in.append(bytes)
@@ -97,7 +119,7 @@
if out:
self.bytes_list.append(out)
else:
- if self.num_repack >= self._max_repack:
+ if not reserved and self.num_repack >= self._max_repack:
# We have packed too many times already.
return True
# This may or may not fit, try to add it with Z_SYNC_FLUSH
@@ -109,13 +131,13 @@
self.bytes_list.append(out)
total_len = sum(len(b) for b in self.bytes_list)
# Give us some extra room for a final Z_FINISH call.
- if total_len + 10 > self.chunk_size:
+ if total_len + 10 > capacity:
# We are over budget, try to squeeze this in without any
# Z_SYNC_FLUSH calls
self.num_repack += 1
bytes_out, compressor = self._recompress_all_bytes_in(bytes)
this_len = sum(len(b) for b in bytes_out)
- if this_len + 10 > self.chunk_size:
+ if this_len + 10 > capacity:
# No way we can add anymore, we need to re-pack because our
# compressor is now out of sync
bytes_out, compressor = self._recompress_all_bytes_in()
=== added file 'indexbench.py'
--- a/indexbench.py 1970-01-01 00:00:00 +0000
+++ b/indexbench.py 2008-07-02 12:51:59 +0000
@@ -0,0 +1,220 @@
+from bzrlib.bzrdir import BzrDir
+import tempfile
+from bzrlib.transport import get_transport
+import sys
+import time
+import os
+import gc
+from random import shuffle, sample
+from bzrlib.plugins.index2.btree_index import BTreeGraphIndex, BTreeBuilder
+from bzrlib.plugins.index2 import btree_index
+from bzrlib.index import GraphIndex, GraphIndexBuilder, CombinedGraphIndex
+from bzrlib.knit import _KnitGraphIndex, KnitVersionedFiles
+from bzrlib.graph import Graph
+from bzrlib.commands import Command
+from bzrlib.option import Option, ListOption
+from bzrlib.lsprof import profile
+
+
+key_count = 0
+
+def drop_caches():
+ """Call a script called drop-caches which should flush os caches."""
+ os.system('drop-caches')
+
+def iter_indices(names, t, factory):
+ for name in sorted(names):
+ size = t.stat(name).st_size
+ yield name, factory(t, name, size)
+
+def copy_indices(indices, target, factory):
+ size = 0
+ for name, index in indices:
+ index.key_count()
+ key_length = index._key_length
+ ref_lists = index.node_ref_lists
+ builder = factory(reference_lists=ref_lists, key_elements=key_length)
+ for node in index.iter_all_entries():
+ builder.add_node(*node[1:])
+ size += target.put_file(name, builder.finish())
+ return size
+
+
+def profile_fixture(class_name, fixture_name, fixture, *args, **kwargs):
+ """Profile a fixture."""
+ _, stats = profile(fixture, *args, **kwargs)
+ output = file(class_name + '.' + fixture_name + '.callgrind', 'wb')
+ stats.calltree(output)
+ output.close()
+
+
+def run_fixture(class_name, fixture_name, fixture, *args, **kwargs):
+ fixture(*args, **kwargs)
+
+
+def time_index(names, source, factory, builder, target, label, name_keys,
+ text_keys, use_drop_cache, fixtures, lsprof):
+ if use_drop_cache:
+ drop_cache = drop_caches
+ else:
+ drop_cache = lambda:None
+ if lsprof:
+ run = profile_fixture
+ else:
+ run = run_fixture
+ drop_cache()
+# Baseline time
+ now = time.time()
+ for name, index in iter_indices(names, source, GraphIndex):
+ index.key_count()
+ for node in index.iter_all_entries():
+ pass
+ finish = time.time()
+ print "Baseline overhead: Read %d keys in %0.3f" % (key_count, finish - now)
+# write time
+ read_time = finish - now
+ drop_cache()
+ now = time.time()
+ length = copy_indices(iter_indices(names, source, GraphIndex),
+ target, builder)
+ finish = time.time()
+ print "%s: Wrote %d bytes in %0.3f" % (label, length, finish - now - read_time)
+# iter_all
+ if 'all' in fixtures:
+ drop_cache()
+ now = time.time()
+ for name, index in iter_indices(names, target, factory):
+ for item in index.iter_all_entries():
+ pass
+ finish = time.time()
+ print "%s: iter_all_entries in %0.3f" % (label, finish - now)
+# random shuffle, all keys (name_keys comes preshuffled)
+ if 'shuffle' in fixtures:
+ drop_cache()
+ now = time.time()
+ for name, index in iter_indices(names, target, factory):
+ order = name_keys[name]
+ shuffle(order)
+ for key in order:
+ index.iter_entries([key]).next()
+ finish = time.time()
+ print "%s: iter_random_one in %0.3f" % (label, finish - now)
+# text extraction simulation (follow a compression graph) for text_keys
+ if 'text' in fixtures:
+ text_names = [name for name in names if name.endswith('.tix')]
+ text_indices = [index for name, index in iter_indices(text_names, target, factory)]
+ text_index = CombinedGraphIndex(text_indices)
+ btree_index.bloom_hits = 0
+ drop_cache()
+ now = time.time()
+ # VersionedFiles can do multi-key extraction, so we can use that.
+
+ index = _KnitGraphIndex(text_index, lambda:True, deltas=True, parents=True)
+ vf = KnitVersionedFiles(index, None)
+ vf._get_components_positions(text_keys)
+ finish = time.time()
+ print "%s: get_components_positions(%d) in %0.3f, bloom(%d)" % (label, len(text_keys),
+ finish - now, btree_index.bloom_hits)
+# follow a revision graph
+ if 'revision' in fixtures:
+ rev_names = [name for name in names if name.endswith('.rix')]
+ rev_indices = [index for name, index in iter_indices(rev_names, target, factory)]
+ rev_index = CombinedGraphIndex(rev_indices)
+ # pick the lexograph middle revision, just because.
+ all_revs = sorted(node[1] for node in rev_index.iter_all_entries())
+ revid = all_revs[len(all_revs)/2]
+ # reopen indices
+ rev_indices = [index for name, index in iter_indices(rev_names, target, factory)]
+ rev_index = CombinedGraphIndex(rev_indices)
+ btree_index.bloom_hits = 0
+ drop_cache()
+ now = time.time()
+ # VersionedFiles can do multi-key extraction, so we can use that.
+ index = _KnitGraphIndex(rev_index, lambda:True, deltas=True, parents=True)
+ vf = KnitVersionedFiles(index, None)
+ graph = Graph(vf)
+ search = graph._make_breadth_first_searcher([revid])
+ while True:
+ try:
+ search.next()
+ except StopIteration:
+ break
+ finish = time.time()
+ print "%s: search_from_revid in %0.3f, bloom(%d)" % (label, finish - now, btree_index.bloom_hits)
+# pathologic miss-lookups: ask each index of each type for the keys held by the
+# others of that type
+ if 'miss' in fixtures:
+ run(label, 'miss', miss_torture, label, drop_cache, names, target, factory, name_keys)
+ print "%s: -------Done---------" % (label,)
+
+def miss_torture(label, drop_cache, names, target, factory, name_keys):
+ btree_index.bloom_hits = 0
+ drop_cache()
+ now = time.time()
+ for name, index in iter_indices(names, target, factory):
+ suffix = name[-4:]
+ other_keys = set()
+ for othername, keys in name_keys.items():
+ if othername != name and othername.endswith(suffix):
+ other_keys.update(keys)
+ for node in index.iter_entries(other_keys):
+ pass
+ finish = time.time()
+ print "%s: miss torture in %0.3f, bloom(%d)" % (label, finish - now, btree_index.bloom_hits)
+
+
+class cmd_indexbench(Command):
+ """Benchmark index performance."""
+
+ takes_options = [
+ Option('graph', help='bench the GraphIndex class'),
+ Option('btree', help='bench the BTreeGraphIndex class'),
+ Option('drop-cache', help='drop caches between fixtures'),
+ ListOption('fixture', type=str,
+ help='fixtures to test: one of all, shuffle, text, revision, miss'),
+ # lspro because --lsprof is a global option, and they interfere with each other.
+ Option('lspro', help='generate class.fixture.callgrind lsprof files'),
+ ]
+
+
+ takes_args = ['sample_repository']
+
+ def run(self, sample_repository, graph=True, btree=True, drop_cache=False,
+ fixture=None,lspro=False):
+ if not fixture:
+ fixture = ['all', 'shuffle', 'text', 'revision', 'miss']
+ source = get_transport(sample_repository)
+ source = source.clone('.bzr/repository/indices')
+ names = source.list_dir('.')
+ name_keys = {}
+ text_keys = set()
+ for name, index in iter_indices(names, source, GraphIndex):
+ name_keys[name] = []
+ for item in index.iter_all_entries():
+ name_keys[name].append(item[1])
+ # Randomise for later tests
+ shuffle(name_keys[name])
+ if name.endswith('.tix'):
+ text_keys.update(name_keys[name])
+ text_keys = list(text_keys)
+ # pick 2000 text keys to pretend to reconstruct
+ text_keys = sample(text_keys, 2000)
+ global key_count
+ key_count = sum(map(len, name_keys.itervalues()))
+
+ if btree:
+ self.test_class(names, source, BTreeGraphIndex, BTreeBuilder,
+ 'BTreeIndex', name_keys, text_keys, drop_cache, fixture, lspro)
+ if graph:
+ self.test_class(names, source, GraphIndex, GraphIndexBuilder,
+ 'GraphIndex', name_keys, text_keys, drop_cache, fixture, lsprof)
+
+ def test_class(self, names, source, factory, builder, label, name_keys,
+ text_keys, drop_cache, fixtures, lsprof):
+ workingdir = tempfile.mkdtemp()
+ t = get_transport(workingdir)
+ try:
+ time_index(names, source, factory, builder, t, label, name_keys,
+ text_keys, drop_cache, fixtures, lsprof)
+ finally:
+ t.delete_tree('.')
=== modified file 'tests/test_btree_index.py'
--- a/tests/test_btree_index.py 2008-07-02 05:40:02 +0000
+++ b/tests/test_btree_index.py 2008-07-02 14:45:38 +0000
@@ -22,6 +22,7 @@
from bzrlib import tests
from bzrlib.plugins import index2
from bzrlib.plugins.index2 import errors, btree_index
+from bzrlib.plugins.pybloom.pybloom import BloomSHA1
from bzrlib.tests import TestCaseWithTransport
from bzrlib.transport import get_transport
@@ -148,10 +149,19 @@
leaf1 = content[4096:8192]
leaf2 = content[8192:]
root_bytes = zlib.decompress(root)
+ # Create a little bloom by hand
+ bloom = BloomSHA1(256 * 8)
+ # set a a bit to test
+ for node in nodes:
+ bloom.insert('\x00'.join(node[0]))
+ # get bytes
+ bloom_bytes = bloom._array.tostring()
expected_root = (
"type=internal\n"
"offset=0\n"
"505505505505505505505505505505505505505505505505505505505505505505505505505505505505505505505505505505505505505505505505\n"
+ ":bloom:\n"
+ + bloom_bytes
)
self.assertEqual(expected_root, root_bytes)
# We already know serialisation works for leaves, check key selection:
@@ -165,7 +175,7 @@
self.assertEqual(800 - 451, len(node.keys))
self.assertEqual(sorted_node_keys[451:], sorted(node.keys))
- def test_second_internal_node_pointer(self):
+ def test_three_level_tree_details(self):
# The left most pointer in the second internal node in a row should
# pointer to the second node that the internal node is for, _not_
# the first, otherwise the first node overlaps with the last node of
@@ -184,6 +194,8 @@
self.assertEqual(3, len(index._row_lengths),
"Not enough rows: %r" % index._row_lengths)
internal_node1 = index._get_node(1)
+ # Must have a bloom on the first node.
+ self.assertNotEqual(None, internal_node1.bloom)
internal_node2 = index._get_node(2)
# The left most node node2 points at should be one after the right most node pointed at by
# node1.
@@ -194,6 +206,17 @@
pos = index._row_lengths[0] + index._row_lengths[1] + internal_node2.offset + 1
leaf = index._get_node(pos)
self.assertTrue(internal_node2.keys[0] in leaf.keys)
+ # Check the bloom filter for internal_node2: all the keys in the leaf
+ # should appear to be present
+ for key in leaf.keys:
+ self.assertTrue('\x00'.join(key) in internal_node2.bloom)
+ # Check the bloom filter for internal_node1 with its first two nodes in
+ # the same fashion.
+ for offset in [0, 1]:
+ pos = index._row_lengths[0] + index._row_lengths[1] + offset
+ leaf = index._get_node(pos)
+ for key in leaf.keys:
+ self.assertTrue('\x00'.join(key) in internal_node1.bloom)
def test_2_leaves_2_2(self):
builder = btree_index.BTreeBuilder(key_elements=2, reference_lists=2)
@@ -210,12 +233,21 @@
leaf1 = content[4096:8192]
leaf2 = content[8192:]
root_bytes = zlib.decompress(root)
+ # Create a little bloom by hand
+ bloom = BloomSHA1(256 * 8)
+ # set a a bit to test
+ for node in nodes:
+ bloom.insert('\x00'.join(node[0]))
+ # get bytes
+ bloom_bytes = bloom._array.tostring()
expected_root = (
"type=internal\n"
"offset=0\n"
"1111111111111111111111111111111111111111\x00"
"127127127127127127127127127127127127127127127127127127127"
"127127127127127127127127127127127127127127127127127127127127127\n"
+ ":bloom:\n"
+ + bloom_bytes
)
self.assertEqual(expected_root, root_bytes)
# We assume the other leaf nodes have been written correctly - layering FTW.
@@ -438,6 +470,26 @@
('readv', 'index', [(4096, 4096), ], False, None)],
transport._activity)
+ def test_iter_entries_skips_bloom_miss(self):
+ # On a miss, if the bloom says no, no IO is performed.
+ builder = btree_index.BTreeBuilder(key_elements=1)
+ # We need a two level index.
+ nodes = self.make_nodes(1200, 1, 0)
+ for node in nodes:
+ builder.add_node(*node)
+ transport = get_transport('trace+' + self.get_url(''))
+ size = transport.put_file('index', builder.finish())
+ del builder
+ index = btree_index.BTreeGraphIndex(transport, 'index', size)
+ del transport._activity[:]
+ self.assertEqual([], transport._activity)
+ # search for one key
+ found_nodes = list(index.iter_entries([("foo",)]))
+ self.assertEqual([], found_nodes)
+ # Should have read the root node only
+ self.assertEqual([('readv', 'index', [(0, 4096)], False, None)],
+ transport._activity)
+
class TestBTreeNodes(BTreeTestCase):
@@ -486,18 +538,50 @@
"1111111111111111111111111111111111111111\n"
"2222222222222222222222222222222222222222\n"
"3333333333333333333333333333333333333333\n"
- "4444444444444444444444444444444444444444\n")
- node = btree_index._InternalNode(node_bytes)
- # We want to bisect to find the right children from this node, so a
- # vector is most useful.
- self.assertEqual([
- ("0000000000000000000000000000000000000000",),
- ("1111111111111111111111111111111111111111",),
- ("2222222222222222222222222222222222222222",),
- ("3333333333333333333333333333333333333333",),
- ("4444444444444444444444444444444444444444",),
- ], node.keys)
- self.assertEqual(1, node.offset)
+ "4444444444444444444444444444444444444444\n"
+ )
+ node = btree_index._InternalNode(node_bytes)
+ # We want to bisect to find the right children from this node, so a
+ # vector is most useful.
+ self.assertEqual([
+ ("0000000000000000000000000000000000000000",),
+ ("1111111111111111111111111111111111111111",),
+ ("2222222222222222222222222222222222222222",),
+ ("3333333333333333333333333333333333333333",),
+ ("4444444444444444444444444444444444444444",),
+ ], node.keys)
+ self.assertEqual(1, node.offset)
+ self.assertEqual(None, node.bloom)
+
+ def test_InternalNode_bloom(self):
+ # Create a little bloom so we can check it is parsed right
+ bloom = BloomSHA1(256 * 8)
+ # set a a bit to test
+ bloom.insert("foo-bar")
+ # get bytes
+ bloom_bytes = bloom._array.tostring()
+ node_bytes = ("type=internal\n"
+ "offset=1\n"
+ "0000000000000000000000000000000000000000\n"
+ "1111111111111111111111111111111111111111\n"
+ "2222222222222222222222222222222222222222\n"
+ "3333333333333333333333333333333333333333\n"
+ "4444444444444444444444444444444444444444\n"
+ ":bloom:\n"
+ + bloom_bytes)
+ node = btree_index._InternalNode(node_bytes)
+ # We want to bisect to find the right children from this node, so a
+ # vector is most useful.
+ self.assertEqual([
+ ("0000000000000000000000000000000000000000",),
+ ("1111111111111111111111111111111111111111",),
+ ("2222222222222222222222222222222222222222",),
+ ("3333333333333333333333333333333333333333",),
+ ("4444444444444444444444444444444444444444",),
+ ], node.keys)
+ self.assertEqual(1, node.offset)
+ self.assertTrue("foo-bar" in node.bloom)
+ self.assertFalse("other" in node.bloom)
def test_LeafNode_2_2(self):
node_bytes = ("type=leaf\n"
=== modified file 'tests/test_chunk_writer.py'
--- a/tests/test_chunk_writer.py 2008-07-01 18:38:55 +0000
+++ b/tests/test_chunk_writer.py 2008-07-02 06:08:10 +0000
@@ -64,3 +64,24 @@
self.assertEqualDiff(expected_bytes, node_bytes)
# And the line that failed should have been saved for us
self.assertEqual(lines[46], unused)
+
+ def test_too_much_data_preserves_reserve_space(self):
+ # Generate enough data to exceed 4K
+ lines = []
+ for group in range(48):
+ offset = group * 50
+ numbers = range(offset, offset + 50)
+ # Create a line with this group
+ lines.append(''.join(map(str, numbers)) + '\n')
+ writer = chunk_writer.ChunkWriter(4096, 256)
+ for line in lines:
+ if writer.write(line):
+ break
+ self.assertFalse(writer.write_reserved("A"*256))
+ bytes_list, unused = writer.finish()
+ node_bytes = self.check_chunk(bytes_list, 4096)
+ # the first 44 lines should have been added
+ expected_bytes = ''.join(lines[:44]) + "A"*256
+ self.assertEqualDiff(expected_bytes, node_bytes)
+ # And the line that failed should have been saved for us
+ self.assertEqual(lines[44], unused)
More information about the bazaar-commits
mailing list