Rev 25: Combine all index component files into a single .pack file, further reducing file system friction. in http://people.ubuntu.com/~robertc/baz2.0/plugins/search/trunk

Thu Jun 12 05:08:44 BST 2008

At http://people.ubuntu.com/~robertc/baz2.0/plugins/search/trunk

------------------------------------------------------------
revno: 25
revision-id: robertc at robertcollins.net-20080612040843-zh3lsf88epwfhabd
parent: robertc at robertcollins.net-20080612013002-kl2rmh8imu8bknum
committer: Robert Collins <robertc at robertcollins.net>
branch nick: trunk
timestamp: Thu 2008-06-12 14:08:43 +1000
message:
  Combine all index component files into a single .pack file, further reducing file system friction.
modified:
  DESIGN                         design-20080608072426-vjoj110dtykfyb7g-1
  index.py                       index.py-20080608055509-hnimeek7q8tctkqf-2
=== modified file 'DESIGN'

--- a/DESIGN	2008-06-12 01:30:02 +0000
+++ b/DESIGN	2008-06-12 04:08:43 +0000
@@ -26,8 +26,12 @@
 
 The current file text indexing solution uses iter_lines_added_or_present_in -
 which means either that more, or less, hits than are interesting (depending on
-what you consider interesting) are returned. Indexing bzr.dev spikes at nearly
-600MB. Indexing very large trees will likely thrash.
+what you consider interesting) are returned. Indexing bzr.dev stabilises at 
+180MB for me. (Very) large trees have been observed to need 1GB of ram.
+If thrashing occurs, change the group size in inde.py down from 2500 revisions
+- the memory drop is roughly proportional to that figure. More index components
+causes incipient problems though, due to no component combining being done, so
+each component is searched sequentially.
 
 Indexing non-text
 =================
@@ -79,9 +83,12 @@
 Second cut - in progress
 ========================
 
-Each name in the names file is a pack, its length, and two read vectors for the
-terms and revisions lists. The pack contains a series of posting lists trailed
-by a top level index of document ids, terms and indexed revisions.
+Each name in the names file is a pack and the start, length coordinates in the
+pack for the three top level indices that make up a component. These are the
+revisions index, the documents index and the terms index. These can be
+recreated by scanning a component's pack, if they were to get lost.
+
+The pack contains these top level indices and a set of posting list indices.
 
 Posting list
 ------------
@@ -89,19 +96,39 @@
 A GraphIndex(0, 1)
 (doc_id -> "")
 
+This allows set based querying to identify document ids that are matches. If we
+have already selected (say) ids 45 and 200, there is no need to examine all the
+document ids in a particular posting list.
+
 Document ids
 ------------
 
 A GraphIndex(0, 1)
 (doc_id -> doc_key)
 
+Document ids are a proxy for a specific document in the corpus. Actual corpus
+identifiers are long (10's of bytes), so using a simple serial number for each
+document we refer keeps the index size compact. Once a search is complete the
+selected document ids are resolved into document keys within the branch - and
+they can then be used to show the results to users.
+
 
 Terms index
 -----------
 
 A GraphIndex(0, 1)
-(term,) -> start, end of the terms posting list in the pack.
+(term,) -> document_count, start, end of the terms posting list in the pack.
+
+This allows the construction of a posting list index object for the term, which
+is then used to whittle down/expand the set of documents to return.
+
+The document count is used in one fairly cheap optimisation - we select the
+term with the smallest number of references to start our search, in the hope
+that that will reduce the total IO commensurately.
 
 Revisions index
 ---------------
 (revision,) -> nothing
+
+This index is used purely to detect whether a revision is already indexed - if
+it is we don't need to index it again.

=== modified file 'index.py'
--- a/index.py	2008-06-12 01:30:02 +0000
+++ b/index.py	2008-06-12 04:08:43 +0000
@@ -180,7 +180,7 @@
                 posting_start = int(posting_start)
                 posting_length = int(posting_length)
                 posting_stop = posting_start + posting_length
-                post_name = component.name + "." + term_id
+                post_name = "term_list." + term_id
                 filemap = {post_name:(posting_start, posting_stop)}
                 view = FileView(self._indices_transport,
                     component.name + '.pack', filemap)
@@ -359,7 +359,7 @@
             # load the first document list: 
             _, term_id, posting_start, posting_length = term_info.pop(0)
             posting_stop = posting_start + posting_length
-            post_name = component.name + "." + term_id
+            post_name = "term_list." + term_id
             filemap = {post_name:(posting_start, posting_stop)}
             view = FileView(self._indices_transport, component.name + '.pack',
                 filemap)
@@ -514,9 +514,16 @@
         :param value: The value string from the names list for this component.
         """
         lengths = value.split(' ')
-        rev_index = GraphIndex(transport, name + '.rix', int(lengths[0]))
-        term_index = GraphIndex(transport, name + '.tix', int(lengths[1]))
-        doc_index = GraphIndex(transport, name + '.dix', int(lengths[2]))
+        lengths = [int(length) for length in lengths]
+        filemap = {
+            "revisions": (lengths[0], lengths[0] + lengths[1]),
+            "terms": (lengths[2], lengths[2] + lengths[3]),
+            "documents": (lengths[4], lengths[4] + lengths[5]),
+            }
+        view = FileView(transport, name + '.pack', filemap)
+        rev_index = GraphIndex(view, "revisions", lengths[1])
+        term_index = GraphIndex(view, "terms", lengths[3])
+        doc_index = GraphIndex(view, "documents", lengths[5])
         self.revision_index = rev_index
         self.term_index = term_index
         self.document_index = doc_index
@@ -592,6 +599,27 @@
         except KeyError:
             return None
 
+    def _add_index_to_pack(self, index, name, writer, index_bytes=None):
+        """Add an index to a pack.
+
+        This ensures the index is encoded as plain bytes in the pack allowing
+        arbitrary readvs.
+
+        :param index: The index to write to the pack.
+        :param name: The name of the index in the pack.
+        :param writer: a ContainerWriter.
+        :param index_bytes: Optional - the contents of the serialised index.
+        :return: A start, length tuple for reading the index back from the
+            pack.
+        """
+        if index_bytes is None:
+            index_bytes = index.finish().read()
+        pos, size = writer.add_bytes_record(index_bytes, [(name,)])
+        length = len(index_bytes)
+        offset = size - length
+        start = pos + offset
+        return start, length
+
     def upload_index(self, upload_transport):
         """Upload the index in preparation for insertion.
 
@@ -603,53 +631,34 @@
         # write to disc.
         index_bytes = self.revision_index.finish().read()
         index_name = md5.new(index_bytes).hexdigest()
-        upload_transport.put_bytes_non_atomic(index_name + ".rix",
-            index_bytes)
-        rev_length = len(index_bytes)
-        # XXX: We should be able to do:
-        # doc_length = upload_transport.put_file_non_atomic(
-        #     index_name + '.dix', index.document_index.finish())
-        # but, put_file_non_atomic is not interface tested to return the byte
-        # length.
+        write_stream = upload_transport.open_write_stream(index_name + ".pack")
+        writer = ContainerWriter(write_stream.write)
+        writer.begin()
+        rev_start, rev_length = self._add_index_to_pack(self.revision_index,
+            "revisions", writer, index_bytes)
+        del index_bytes
         # generate a new term index with the length of the serialised posting
         # lists.
-        elements = []
         term_index = InMemoryGraphIndex(0, 1)
-        write_stream = upload_transport.open_write_stream(index_name + ".pack")
-        elements.append(index_name + ".pack")
-        writer = ContainerWriter(write_stream.write)
-        writer.begin()
         for term, term_id in self.terms.iteritems():
             posting_list = self.posting_lists[term_id]
             post_index = InMemoryGraphIndex(0, 1)
             for doc_id in posting_list:
                 post_index.add_node((doc_id,), "", ())
-            posting_bytes = post_index.finish().read()
-            posting_name = index_name + "." + term_id
-            pos, size = writer.add_bytes_record(posting_bytes, [(posting_name,)])
-            length = len(posting_bytes)
-            offset = size - length
-            start = pos + offset
-            ### upload_transport.put_bytes_non_atomic(posting_name, posting_bytes)
-            ### elements.append(posting_name)
+            posting_name = "term_list." + term_id
+            start, length = self._add_index_to_pack(post_index, posting_name,
+                writer)
             # How many document ids, and the range for the file view when we
             # read the pack later.
-            term_value = "%s %d %d %d" % (term_id, len(str(posting_list)),
-                start, length)
-            del posting_bytes
+            term_value = "%s %d %d %d" % (term_id, len(posting_list), start,
+                length)
             term_index.add_node((term,), term_value, ())
+        term_start, term_length = self._add_index_to_pack(term_index, "terms", writer)
+        doc_start, doc_length = self._add_index_to_pack(self.document_index,
+            "documents", writer)
         writer.end()
         write_stream.close()
-        index_bytes = term_index.finish().read()
-        term_length = len(index_bytes)
-        upload_transport.put_bytes_non_atomic(
-            index_name + '.tix', index_bytes)
-        index_bytes = self.document_index.finish().read()
-        doc_length = len(index_bytes)
-        upload_transport.put_bytes_non_atomic(
-            index_name + '.dix', index_bytes)
-        del index_bytes
-        index_value = "%d %d %d" % (rev_length, term_length, doc_length)
-        for suffix in [".rix", ".dix", ".tix"]:
-            elements.append(index_name + suffix)
+        index_value = "%d %d %d %d %d %d" % (rev_start, rev_length, term_start,
+            term_length, doc_start, doc_length)
+        elements = [index_name + ".pack"]
         return index_name, index_value, elements