Rev 20: New disk format, one step closer to being in packs, and fixes the issue with overflowing bzrlib's index bisection capabilities. in http://people.ubuntu.com/~robertc/baz2.0/plugins/search/trunk

Wed Jun 11 06:16:33 BST 2008

At http://people.ubuntu.com/~robertc/baz2.0/plugins/search/trunk

------------------------------------------------------------
revno: 20
revision-id: robertc at robertcollins.net-20080611051632-qqi4nd9g1500702j
parent: robertc at robertcollins.net-20080611040706-n0orfqtmjah3k6d4
committer: Robert Collins <robertc at robertcollins.net>
branch nick: trunk
timestamp: Wed 2008-06-11 15:16:32 +1000
message:
  New disk format, one step closer to being in packs, and fixes the issue with overflowing bzrlib's index bisection capabilities.
modified:
  BUGS                           bugs-20080609101902-m23i5z2ojdgkeyof-1
  DESIGN                         design-20080608072426-vjoj110dtykfyb7g-1
  README                         readme-20080608052041-z5bahsl8kwl0uf4x-3
  __init__.py                    __init__.py-20080608052041-z5bahsl8kwl0uf4x-4
  index.py                       index.py-20080608055509-hnimeek7q8tctkqf-2
  tests/test_index.py            test_index.py-20080608055509-hnimeek7q8tctkqf-4
=== modified file 'BUGS'

--- a/BUGS	2008-06-09 12:34:11 +0000
+++ b/BUGS	2008-06-11 05:16:32 +0000
@@ -3,6 +3,6 @@
 
 Some key caveats though (not bugs per se):
 
- - scaling: The current disk format will fail when a single term has more than
-   2000 (or so) references, because the bzr index bisection logic will not 
-   find the end of the row. 
+ - scaling: The current disk format creates thousands of small files on disk.
+   This may impede performance on some file systems. When these are put into
+   a single pack this will be rectified.

=== modified file 'DESIGN'
--- a/DESIGN	2008-06-09 10:17:16 +0000
+++ b/DESIGN	2008-06-11 05:16:32 +0000
@@ -67,38 +67,25 @@
 
 A names file listing the indices.
 
-First cut
-=========
-
-This is designed to get the system up and running with least effort, not to
-scale.
-
-Each index is then GraphIndex(1, 3) containing:
-('x', 'x', term) -> posting list
-('x', 'r', revid) -> () # indicates revid is indexed.
-
-Note that this contrains terms to be valid index element keys, but we can
-always encode during the index operation if needed.
-
-For scaling we need to handle extremely long posting lists.
-
-Second cut
-==========
+Second cut - TODO
+=================
 
 Each name in the names file is a pack, its length, and two read vectors for the
 terms and revisions lists. The pack contains a series of posting lists trailed
-by a top level index of terms and indexed revisions.
+by a top level index of document ids, terms and indexed revisions.
 
 Posting list
 ------------
 
-A GraphIndex(0, 3)
-
-(type, file_id, versioned_id) -> nothing
-
-Note that a single index has the full posting list for any document present in
-the index - this allows optimisations where we do not examine document names 
-during discarding the document.
+A GraphIndex(0, 1)
+(doc_id -> "")
+
+Document ids
+------------
+
+A GraphIndex(0, 1)
+(doc_id -> doc_key)
+
 
 Terms index
 -----------

=== modified file 'README'
--- a/README	2008-06-08 05:20:44 +0000
+++ b/README	2008-06-11 05:16:32 +0000
@@ -30,8 +30,10 @@
 
 `bzr search TERM` will return the list of documents TERM occurs in.
 
+`bzr index [URL]` will create an index of a branch.
+
 Documentation
 =============
 
-See `bzr help search` for help on the search command.
+See `bzr help search` or `bzr help plugins/search`.
 

=== modified file '__init__.py'
--- a/__init__.py	2008-06-08 06:21:27 +0000
+++ b/__init__.py	2008-06-11 05:16:32 +0000
@@ -17,7 +17,17 @@
 
 """search is a bzr plugin for searching bzr content.
 
-The command ``search`` will perform a search.
+Commands
+========
+
+`bzr search TERM` will return the list of documents TERM occurs in.
+
+`bzr index [URL]` will create an index of a branch.
+
+Documentation
+=============
+
+See `bzr help search` or `bzr help plugins/search`.
 """
 
 import bzrlib.commands

=== modified file 'index.py'
--- a/index.py	2008-06-11 04:07:06 +0000
+++ b/index.py	2008-06-11 05:16:32 +0000
@@ -166,10 +166,20 @@
         """
         self._refresh_indices()
         result = {}
-        for index in self._term_indices:
+        for value, component in self._current_names.values():
+            index = component.term_index
             for node in index.iter_all_entries():
+                # XXX: Duplicated logic with search().
                 term = node[1][0]
-                doc_ids = [(node[0], doc_id) for doc_id in node[2].split(' ')]
+                term_id, posting_count, posting_length = node[2].split(" ")
+                posting_count = int(posting_count)
+                posting_length = int(posting_length)
+                post_name = component.name + "." + term_id
+                post_index = GraphIndex(self._indices_transport, post_name,
+                    posting_length)
+                doc_ids = set([node[1][0] for node in
+                    post_index.iter_all_entries()])
+                doc_ids = [(index, doc_id) for doc_id in doc_ids]
                 posting_list = result.setdefault(term, set())
                 posting_list.update(self._document_ids_to_keys(doc_ids))
         return result.iteritems()
@@ -313,22 +323,37 @@
         found_documents = []
         # Use a set to remove duplicates
         term_keys = set([(term,) for term in termlist])
+        if len(term_keys) == 0:
+            return
         for value, component in self._current_names.values():
             term_index = component.term_index
-            doc_references = {}
+            # TODO: push into Component
+            # TODO: use a dequeue?
+            term_info = []
             for node in term_index.iter_entries(term_keys):
-                doc_references[node[1]] = node[2]
-            if len(doc_references) != len(term_keys):
-                # One or more terms missing
+                term_id, posting_count, posting_length = node[2].split(" ")
+                term_info.append((int(posting_count), term_id,
+                    int(posting_length)))
+            if len(term_info) != len(term_keys):
+                # One or more terms missing - no hits are possible.
                 continue
-            # Maybe not the cheapest, but certainly easy to read the code for:
-            common_doc_ids = None
-            for reference_string in doc_references.itervalues():
-                reference_set = set(reference_string.split(' '))
-                if common_doc_ids is None:
-                    common_doc_ids = reference_set
-                else:
-                    common_doc_ids.difference_update(reference_set)
+            # load the first document list: 
+            _, term_id, posting_length = term_info.pop(0)
+            post_name = component.name + "." + term_id
+            post_index = GraphIndex(self._indices_transport, post_name,
+                posting_length)
+            common_doc_keys = set([node[1] for node in
+                post_index.iter_all_entries()])
+            # Now we whittle down the nodes we need - still going in sorted
+            # order. (possibly doing concurrent reduction would be better).
+            while common_doc_keys and term_info:
+                _, term_id, posting_length = term_info.pop(0)
+                post_name = component.name + "." + term_id
+                post_index = GraphIndex(self._indices_transport, post_name,
+                    posting_length)
+                common_doc_keys = set([node[1] for node in
+                    post_index.iter_entries(common_doc_keys)])
+            common_doc_ids = [key[0] for key in common_doc_keys]
             found_documents = [(term_index, doc_id) for doc_id in
                 common_doc_ids]
             for doc_key in self._document_ids_to_keys(found_documents):
@@ -417,6 +442,7 @@
         self.revision_index = rev_index
         self.term_index = term_index
         self.document_index = doc_index
+        self.name = name
 
 
 class ComponentIndexBuilder(object):
@@ -425,8 +451,9 @@
     def __init__(self):
         self.document_index = InMemoryGraphIndex(0, 1)
         self._document_ids = {}
-        self.term_index = InMemoryGraphIndex(0, 1)
+        self.terms = {}
         self.revision_index = InMemoryGraphIndex(0, 1)
+        self.posting_lists = {}
 
     def add_document(self, document_key):
         """Add a document key to the index.
@@ -449,18 +476,44 @@
             indexes.
         :return: None.
         """
-        document_ids = []
-        for document_key in posting_list:
-            document_ids.append(self.add_document(document_key))
-        document_ids.sort()
         if type(term) != str:
             raise ValueError("terms must be bytestrings at this layer %r" % term)
-        self.term_index.add_node((term,), " ".join(document_ids), ())
+        term_id = self.term_id(term)
+        if term_id is None:
+            term_id = str(len(self.terms))
+            self.terms[term] = term_id
+            self.posting_lists[term_id] = set()
+        existing_posting_list = self.posting_lists[term_id]
+        for document_key in posting_list:
+            existing_posting_list.add(self.add_document(document_key))
 
     def add_revision(self, revision_id):
         """List a revision as having been indexed by this index."""
         self.revision_index.add_node((revision_id,), '',  ())
 
+    def posting_list(self, term):
+        """Return an iterable of document ids for term.
+
+        Unindexed terms return an empty iterator.
+        """
+        term_id = self.term_id(term)
+        if term_id is None:
+            return []
+        else:
+            return self.posting_lists[term_id]
+
+    def term_id(self, term):
+        """Return the term id of term. 
+
+        :param term: The term to get an id for.
+        :return: None for a term not in the component, otherwise the string
+            term id.
+        """
+        try:
+            return self.terms[term]
+        except KeyError:
+            return None
+
     def upload_index(self, upload_transport):
         """Upload the index in preparation for insertion.
 
@@ -476,13 +529,30 @@
             index_bytes)
         rev_length = len(index_bytes)
         # XXX: We should be able to do:
-        # term_length = upload_transport.put_file_non_atomic(
-        #     index_name + '.tix', index.term_index.finish())
         # doc_length = upload_transport.put_file_non_atomic(
         #     index_name + '.dix', index.document_index.finish())
         # but, put_file_non_atomic is not interface tested to return the byte
         # length.
-        index_bytes = self.term_index.finish().read()
+        # generate a new term index with the length of the serialised posting
+        # lists.
+        elements = []
+        term_index = InMemoryGraphIndex(0, 1)
+        for term, term_id in self.terms.iteritems():
+            posting_list = self.posting_lists[term_id]
+            post_index = InMemoryGraphIndex(0, 1)
+            for doc_id in posting_list:
+                post_index.add_node((doc_id,), "", ())
+            posting_bytes = post_index.finish().read()
+            posting_name = index_name + "." + term_id
+            upload_transport.put_bytes_non_atomic(posting_name, posting_bytes)
+            elements.append(posting_name)
+            # How many document ids, and how long the file is (for use when we
+            # put these in packs).
+            term_value = "%s %d %d" % (term_id, len(str(posting_list)),
+                len(posting_bytes))
+            term_index.add_node((term,), term_value, ())
+            del posting_bytes
+        index_bytes = term_index.finish().read()
         term_length = len(index_bytes)
         upload_transport.put_bytes_non_atomic(
             index_name + '.tix', index_bytes)
@@ -492,7 +562,6 @@
             index_name + '.dix', index_bytes)
         del index_bytes
         index_value = "%d %d %d" % (rev_length, term_length, doc_length)
-        elements = []
         for suffix in [".rix", ".dix", ".tix"]:
             elements.append(index_name + suffix)
         return index_name, index_value, elements

=== modified file 'tests/test_index.py'
--- a/tests/test_index.py	2008-06-09 12:34:11 +0000
+++ b/tests/test_index.py	2008-06-11 05:16:32 +0000
@@ -165,10 +165,8 @@
         nodes = sorted(list(doc_index.iter_all_entries()))
         self.assertEqual([(doc_index, ("0",), "r  revid"),
             (doc_index, ("1",), "r  other-revid")], nodes)
-        term_index = builder.term_index
         # and the term refers to document ids
-        self.assertEqual([(term_index, ("term1",), "0 1")],
-            sorted(list(term_index.iter_all_entries())))
+        self.assertEqual(set(["0", "1"]), set(builder.posting_list("term1")))
         # adding a new term adds unique documents
         builder.add_term("term2", [('r', '', 'revid'), ('r', '', 'third-revid')])
         nodes = sorted(list(doc_index.iter_all_entries()))
@@ -176,9 +174,20 @@
         self.assertEqual([(doc_index, ("0",), "r  revid"),
             (doc_index, ("1",), "r  other-revid"),
             (doc_index, ("2",), "r  third-revid")], nodes)
-        self.assertEqual([(term_index, ("term1",), "0 1"),
-            (term_index, ("term2",), "0 2")],
-            sorted(list(term_index.iter_all_entries())))
+        self.assertEqual(set(["0", "1"]), set(builder.posting_list("term1")))
+        self.assertEqual(set(["0", "2"]), set(builder.posting_list("term2")))
+        # adding a term twice extends the posting list rather than replacing it
+        # or erroring.
+        builder.add_term("term1", [('r', '', 'revid'), ('r', '', 'fourth-revid')])
+        nodes = sorted(list(doc_index.iter_all_entries()))
+        # and refers to the correct ids
+        self.assertEqual([(doc_index, ("0",), "r  revid"),
+            (doc_index, ("1",), "r  other-revid"),
+            (doc_index, ("2",), "r  third-revid"),
+            (doc_index, ("3",), "r  fourth-revid"),
+            ], nodes)
+        self.assertEqual(set(["0", "1", "3"]), set(builder.posting_list("term1")))
+        self.assertEqual(set(["0", "2"]), set(builder.posting_list("term2")))
 
     def test_add_revision(self):
         builder = index.ComponentIndexBuilder()
@@ -187,7 +196,6 @@
         builder.add_revision('foo')
         nodes = sorted(list(builder.document_index.iter_all_entries()))
         self.assertEqual([], nodes)
-        nodes = sorted(list(builder.term_index.iter_all_entries()))
-        self.assertEqual([], nodes)
+        self.assertEqual({}, builder.terms)
         nodes = sorted(list(builder.revision_index.iter_all_entries()))
         self.assertEqual([(builder.revision_index, ("foo",), "")], nodes)