Rev 20: New disk format, one step closer to being in packs, and fixes the issue with overflowing bzrlib's index bisection capabilities. in http://people.ubuntu.com/~robertc/baz2.0/plugins/search/trunk
Robert Collins
robertc at robertcollins.net
Wed Jun 11 06:16:33 BST 2008
At http://people.ubuntu.com/~robertc/baz2.0/plugins/search/trunk
------------------------------------------------------------
revno: 20
revision-id: robertc at robertcollins.net-20080611051632-qqi4nd9g1500702j
parent: robertc at robertcollins.net-20080611040706-n0orfqtmjah3k6d4
committer: Robert Collins <robertc at robertcollins.net>
branch nick: trunk
timestamp: Wed 2008-06-11 15:16:32 +1000
message:
New disk format, one step closer to being in packs, and fixes the issue with overflowing bzrlib's index bisection capabilities.
modified:
BUGS bugs-20080609101902-m23i5z2ojdgkeyof-1
DESIGN design-20080608072426-vjoj110dtykfyb7g-1
README readme-20080608052041-z5bahsl8kwl0uf4x-3
__init__.py __init__.py-20080608052041-z5bahsl8kwl0uf4x-4
index.py index.py-20080608055509-hnimeek7q8tctkqf-2
tests/test_index.py test_index.py-20080608055509-hnimeek7q8tctkqf-4
=== modified file 'BUGS'
--- a/BUGS 2008-06-09 12:34:11 +0000
+++ b/BUGS 2008-06-11 05:16:32 +0000
@@ -3,6 +3,6 @@
Some key caveats though (not bugs per se):
- - scaling: The current disk format will fail when a single term has more than
- 2000 (or so) references, because the bzr index bisection logic will not
- find the end of the row.
+ - scaling: The current disk format creates thousands of small files on disk.
+ This may impede performance on some file systems. When these are put into
+ a single pack this will be rectified.
=== modified file 'DESIGN'
--- a/DESIGN 2008-06-09 10:17:16 +0000
+++ b/DESIGN 2008-06-11 05:16:32 +0000
@@ -67,38 +67,25 @@
A names file listing the indices.
-First cut
-=========
-
-This is designed to get the system up and running with least effort, not to
-scale.
-
-Each index is then GraphIndex(1, 3) containing:
-('x', 'x', term) -> posting list
-('x', 'r', revid) -> () # indicates revid is indexed.
-
-Note that this contrains terms to be valid index element keys, but we can
-always encode during the index operation if needed.
-
-For scaling we need to handle extremely long posting lists.
-
-Second cut
-==========
+Second cut - TODO
+=================
Each name in the names file is a pack, its length, and two read vectors for the
terms and revisions lists. The pack contains a series of posting lists trailed
-by a top level index of terms and indexed revisions.
+by a top level index of document ids, terms and indexed revisions.
Posting list
------------
-A GraphIndex(0, 3)
-
-(type, file_id, versioned_id) -> nothing
-
-Note that a single index has the full posting list for any document present in
-the index - this allows optimisations where we do not examine document names
-during discarding the document.
+A GraphIndex(0, 1)
+(doc_id -> "")
+
+Document ids
+------------
+
+A GraphIndex(0, 1)
+(doc_id -> doc_key)
+
Terms index
-----------
=== modified file 'README'
--- a/README 2008-06-08 05:20:44 +0000
+++ b/README 2008-06-11 05:16:32 +0000
@@ -30,8 +30,10 @@
`bzr search TERM` will return the list of documents TERM occurs in.
+`bzr index [URL]` will create an index of a branch.
+
Documentation
=============
-See `bzr help search` for help on the search command.
+See `bzr help search` or `bzr help plugins/search`.
=== modified file '__init__.py'
--- a/__init__.py 2008-06-08 06:21:27 +0000
+++ b/__init__.py 2008-06-11 05:16:32 +0000
@@ -17,7 +17,17 @@
"""search is a bzr plugin for searching bzr content.
-The command ``search`` will perform a search.
+Commands
+========
+
+`bzr search TERM` will return the list of documents TERM occurs in.
+
+`bzr index [URL]` will create an index of a branch.
+
+Documentation
+=============
+
+See `bzr help search` or `bzr help plugins/search`.
"""
import bzrlib.commands
=== modified file 'index.py'
--- a/index.py 2008-06-11 04:07:06 +0000
+++ b/index.py 2008-06-11 05:16:32 +0000
@@ -166,10 +166,20 @@
"""
self._refresh_indices()
result = {}
- for index in self._term_indices:
+ for value, component in self._current_names.values():
+ index = component.term_index
for node in index.iter_all_entries():
+ # XXX: Duplicated logic with search().
term = node[1][0]
- doc_ids = [(node[0], doc_id) for doc_id in node[2].split(' ')]
+ term_id, posting_count, posting_length = node[2].split(" ")
+ posting_count = int(posting_count)
+ posting_length = int(posting_length)
+ post_name = component.name + "." + term_id
+ post_index = GraphIndex(self._indices_transport, post_name,
+ posting_length)
+ doc_ids = set([node[1][0] for node in
+ post_index.iter_all_entries()])
+ doc_ids = [(index, doc_id) for doc_id in doc_ids]
posting_list = result.setdefault(term, set())
posting_list.update(self._document_ids_to_keys(doc_ids))
return result.iteritems()
@@ -313,22 +323,37 @@
found_documents = []
# Use a set to remove duplicates
term_keys = set([(term,) for term in termlist])
+ if len(term_keys) == 0:
+ return
for value, component in self._current_names.values():
term_index = component.term_index
- doc_references = {}
+ # TODO: push into Component
+ # TODO: use a dequeue?
+ term_info = []
for node in term_index.iter_entries(term_keys):
- doc_references[node[1]] = node[2]
- if len(doc_references) != len(term_keys):
- # One or more terms missing
+ term_id, posting_count, posting_length = node[2].split(" ")
+ term_info.append((int(posting_count), term_id,
+ int(posting_length)))
+ if len(term_info) != len(term_keys):
+ # One or more terms missing - no hits are possible.
continue
- # Maybe not the cheapest, but certainly easy to read the code for:
- common_doc_ids = None
- for reference_string in doc_references.itervalues():
- reference_set = set(reference_string.split(' '))
- if common_doc_ids is None:
- common_doc_ids = reference_set
- else:
- common_doc_ids.difference_update(reference_set)
+ # load the first document list:
+ _, term_id, posting_length = term_info.pop(0)
+ post_name = component.name + "." + term_id
+ post_index = GraphIndex(self._indices_transport, post_name,
+ posting_length)
+ common_doc_keys = set([node[1] for node in
+ post_index.iter_all_entries()])
+ # Now we whittle down the nodes we need - still going in sorted
+ # order. (possibly doing concurrent reduction would be better).
+ while common_doc_keys and term_info:
+ _, term_id, posting_length = term_info.pop(0)
+ post_name = component.name + "." + term_id
+ post_index = GraphIndex(self._indices_transport, post_name,
+ posting_length)
+ common_doc_keys = set([node[1] for node in
+ post_index.iter_entries(common_doc_keys)])
+ common_doc_ids = [key[0] for key in common_doc_keys]
found_documents = [(term_index, doc_id) for doc_id in
common_doc_ids]
for doc_key in self._document_ids_to_keys(found_documents):
@@ -417,6 +442,7 @@
self.revision_index = rev_index
self.term_index = term_index
self.document_index = doc_index
+ self.name = name
class ComponentIndexBuilder(object):
@@ -425,8 +451,9 @@
def __init__(self):
self.document_index = InMemoryGraphIndex(0, 1)
self._document_ids = {}
- self.term_index = InMemoryGraphIndex(0, 1)
+ self.terms = {}
self.revision_index = InMemoryGraphIndex(0, 1)
+ self.posting_lists = {}
def add_document(self, document_key):
"""Add a document key to the index.
@@ -449,18 +476,44 @@
indexes.
:return: None.
"""
- document_ids = []
- for document_key in posting_list:
- document_ids.append(self.add_document(document_key))
- document_ids.sort()
if type(term) != str:
raise ValueError("terms must be bytestrings at this layer %r" % term)
- self.term_index.add_node((term,), " ".join(document_ids), ())
+ term_id = self.term_id(term)
+ if term_id is None:
+ term_id = str(len(self.terms))
+ self.terms[term] = term_id
+ self.posting_lists[term_id] = set()
+ existing_posting_list = self.posting_lists[term_id]
+ for document_key in posting_list:
+ existing_posting_list.add(self.add_document(document_key))
def add_revision(self, revision_id):
"""List a revision as having been indexed by this index."""
self.revision_index.add_node((revision_id,), '', ())
+ def posting_list(self, term):
+ """Return an iterable of document ids for term.
+
+ Unindexed terms return an empty iterator.
+ """
+ term_id = self.term_id(term)
+ if term_id is None:
+ return []
+ else:
+ return self.posting_lists[term_id]
+
+ def term_id(self, term):
+ """Return the term id of term.
+
+ :param term: The term to get an id for.
+ :return: None for a term not in the component, otherwise the string
+ term id.
+ """
+ try:
+ return self.terms[term]
+ except KeyError:
+ return None
+
def upload_index(self, upload_transport):
"""Upload the index in preparation for insertion.
@@ -476,13 +529,30 @@
index_bytes)
rev_length = len(index_bytes)
# XXX: We should be able to do:
- # term_length = upload_transport.put_file_non_atomic(
- # index_name + '.tix', index.term_index.finish())
# doc_length = upload_transport.put_file_non_atomic(
# index_name + '.dix', index.document_index.finish())
# but, put_file_non_atomic is not interface tested to return the byte
# length.
- index_bytes = self.term_index.finish().read()
+ # generate a new term index with the length of the serialised posting
+ # lists.
+ elements = []
+ term_index = InMemoryGraphIndex(0, 1)
+ for term, term_id in self.terms.iteritems():
+ posting_list = self.posting_lists[term_id]
+ post_index = InMemoryGraphIndex(0, 1)
+ for doc_id in posting_list:
+ post_index.add_node((doc_id,), "", ())
+ posting_bytes = post_index.finish().read()
+ posting_name = index_name + "." + term_id
+ upload_transport.put_bytes_non_atomic(posting_name, posting_bytes)
+ elements.append(posting_name)
+ # How many document ids, and how long the file is (for use when we
+ # put these in packs).
+ term_value = "%s %d %d" % (term_id, len(str(posting_list)),
+ len(posting_bytes))
+ term_index.add_node((term,), term_value, ())
+ del posting_bytes
+ index_bytes = term_index.finish().read()
term_length = len(index_bytes)
upload_transport.put_bytes_non_atomic(
index_name + '.tix', index_bytes)
@@ -492,7 +562,6 @@
index_name + '.dix', index_bytes)
del index_bytes
index_value = "%d %d %d" % (rev_length, term_length, doc_length)
- elements = []
for suffix in [".rix", ".dix", ".tix"]:
elements.append(index_name + suffix)
return index_name, index_value, elements
=== modified file 'tests/test_index.py'
--- a/tests/test_index.py 2008-06-09 12:34:11 +0000
+++ b/tests/test_index.py 2008-06-11 05:16:32 +0000
@@ -165,10 +165,8 @@
nodes = sorted(list(doc_index.iter_all_entries()))
self.assertEqual([(doc_index, ("0",), "r revid"),
(doc_index, ("1",), "r other-revid")], nodes)
- term_index = builder.term_index
# and the term refers to document ids
- self.assertEqual([(term_index, ("term1",), "0 1")],
- sorted(list(term_index.iter_all_entries())))
+ self.assertEqual(set(["0", "1"]), set(builder.posting_list("term1")))
# adding a new term adds unique documents
builder.add_term("term2", [('r', '', 'revid'), ('r', '', 'third-revid')])
nodes = sorted(list(doc_index.iter_all_entries()))
@@ -176,9 +174,20 @@
self.assertEqual([(doc_index, ("0",), "r revid"),
(doc_index, ("1",), "r other-revid"),
(doc_index, ("2",), "r third-revid")], nodes)
- self.assertEqual([(term_index, ("term1",), "0 1"),
- (term_index, ("term2",), "0 2")],
- sorted(list(term_index.iter_all_entries())))
+ self.assertEqual(set(["0", "1"]), set(builder.posting_list("term1")))
+ self.assertEqual(set(["0", "2"]), set(builder.posting_list("term2")))
+ # adding a term twice extends the posting list rather than replacing it
+ # or erroring.
+ builder.add_term("term1", [('r', '', 'revid'), ('r', '', 'fourth-revid')])
+ nodes = sorted(list(doc_index.iter_all_entries()))
+ # and refers to the correct ids
+ self.assertEqual([(doc_index, ("0",), "r revid"),
+ (doc_index, ("1",), "r other-revid"),
+ (doc_index, ("2",), "r third-revid"),
+ (doc_index, ("3",), "r fourth-revid"),
+ ], nodes)
+ self.assertEqual(set(["0", "1", "3"]), set(builder.posting_list("term1")))
+ self.assertEqual(set(["0", "2"]), set(builder.posting_list("term2")))
def test_add_revision(self):
builder = index.ComponentIndexBuilder()
@@ -187,7 +196,6 @@
builder.add_revision('foo')
nodes = sorted(list(builder.document_index.iter_all_entries()))
self.assertEqual([], nodes)
- nodes = sorted(list(builder.term_index.iter_all_entries()))
- self.assertEqual([], nodes)
+ self.assertEqual({}, builder.terms)
nodes = sorted(list(builder.revision_index.iter_all_entries()))
self.assertEqual([(builder.revision_index, ("foo",), "")], nodes)
More information about the bazaar-commits
mailing list