Rev 25: Combine all index component files into a single .pack file, further reducing file system friction. in http://people.ubuntu.com/~robertc/baz2.0/plugins/search/trunk
Robert Collins
robertc at robertcollins.net
Thu Jun 12 05:08:44 BST 2008
At http://people.ubuntu.com/~robertc/baz2.0/plugins/search/trunk
------------------------------------------------------------
revno: 25
revision-id: robertc at robertcollins.net-20080612040843-zh3lsf88epwfhabd
parent: robertc at robertcollins.net-20080612013002-kl2rmh8imu8bknum
committer: Robert Collins <robertc at robertcollins.net>
branch nick: trunk
timestamp: Thu 2008-06-12 14:08:43 +1000
message:
Combine all index component files into a single .pack file, further reducing file system friction.
modified:
DESIGN design-20080608072426-vjoj110dtykfyb7g-1
index.py index.py-20080608055509-hnimeek7q8tctkqf-2
=== modified file 'DESIGN'
--- a/DESIGN 2008-06-12 01:30:02 +0000
+++ b/DESIGN 2008-06-12 04:08:43 +0000
@@ -26,8 +26,12 @@
The current file text indexing solution uses iter_lines_added_or_present_in -
which means either that more, or less, hits than are interesting (depending on
-what you consider interesting) are returned. Indexing bzr.dev spikes at nearly
-600MB. Indexing very large trees will likely thrash.
+what you consider interesting) are returned. Indexing bzr.dev stabilises at
+180MB for me. (Very) large trees have been observed to need 1GB of ram.
+If thrashing occurs, change the group size in inde.py down from 2500 revisions
+- the memory drop is roughly proportional to that figure. More index components
+causes incipient problems though, due to no component combining being done, so
+each component is searched sequentially.
Indexing non-text
=================
@@ -79,9 +83,12 @@
Second cut - in progress
========================
-Each name in the names file is a pack, its length, and two read vectors for the
-terms and revisions lists. The pack contains a series of posting lists trailed
-by a top level index of document ids, terms and indexed revisions.
+Each name in the names file is a pack and the start, length coordinates in the
+pack for the three top level indices that make up a component. These are the
+revisions index, the documents index and the terms index. These can be
+recreated by scanning a component's pack, if they were to get lost.
+
+The pack contains these top level indices and a set of posting list indices.
Posting list
------------
@@ -89,19 +96,39 @@
A GraphIndex(0, 1)
(doc_id -> "")
+This allows set based querying to identify document ids that are matches. If we
+have already selected (say) ids 45 and 200, there is no need to examine all the
+document ids in a particular posting list.
+
Document ids
------------
A GraphIndex(0, 1)
(doc_id -> doc_key)
+Document ids are a proxy for a specific document in the corpus. Actual corpus
+identifiers are long (10's of bytes), so using a simple serial number for each
+document we refer keeps the index size compact. Once a search is complete the
+selected document ids are resolved into document keys within the branch - and
+they can then be used to show the results to users.
+
Terms index
-----------
A GraphIndex(0, 1)
-(term,) -> start, end of the terms posting list in the pack.
+(term,) -> document_count, start, end of the terms posting list in the pack.
+
+This allows the construction of a posting list index object for the term, which
+is then used to whittle down/expand the set of documents to return.
+
+The document count is used in one fairly cheap optimisation - we select the
+term with the smallest number of references to start our search, in the hope
+that that will reduce the total IO commensurately.
Revisions index
---------------
(revision,) -> nothing
+
+This index is used purely to detect whether a revision is already indexed - if
+it is we don't need to index it again.
=== modified file 'index.py'
--- a/index.py 2008-06-12 01:30:02 +0000
+++ b/index.py 2008-06-12 04:08:43 +0000
@@ -180,7 +180,7 @@
posting_start = int(posting_start)
posting_length = int(posting_length)
posting_stop = posting_start + posting_length
- post_name = component.name + "." + term_id
+ post_name = "term_list." + term_id
filemap = {post_name:(posting_start, posting_stop)}
view = FileView(self._indices_transport,
component.name + '.pack', filemap)
@@ -359,7 +359,7 @@
# load the first document list:
_, term_id, posting_start, posting_length = term_info.pop(0)
posting_stop = posting_start + posting_length
- post_name = component.name + "." + term_id
+ post_name = "term_list." + term_id
filemap = {post_name:(posting_start, posting_stop)}
view = FileView(self._indices_transport, component.name + '.pack',
filemap)
@@ -514,9 +514,16 @@
:param value: The value string from the names list for this component.
"""
lengths = value.split(' ')
- rev_index = GraphIndex(transport, name + '.rix', int(lengths[0]))
- term_index = GraphIndex(transport, name + '.tix', int(lengths[1]))
- doc_index = GraphIndex(transport, name + '.dix', int(lengths[2]))
+ lengths = [int(length) for length in lengths]
+ filemap = {
+ "revisions": (lengths[0], lengths[0] + lengths[1]),
+ "terms": (lengths[2], lengths[2] + lengths[3]),
+ "documents": (lengths[4], lengths[4] + lengths[5]),
+ }
+ view = FileView(transport, name + '.pack', filemap)
+ rev_index = GraphIndex(view, "revisions", lengths[1])
+ term_index = GraphIndex(view, "terms", lengths[3])
+ doc_index = GraphIndex(view, "documents", lengths[5])
self.revision_index = rev_index
self.term_index = term_index
self.document_index = doc_index
@@ -592,6 +599,27 @@
except KeyError:
return None
+ def _add_index_to_pack(self, index, name, writer, index_bytes=None):
+ """Add an index to a pack.
+
+ This ensures the index is encoded as plain bytes in the pack allowing
+ arbitrary readvs.
+
+ :param index: The index to write to the pack.
+ :param name: The name of the index in the pack.
+ :param writer: a ContainerWriter.
+ :param index_bytes: Optional - the contents of the serialised index.
+ :return: A start, length tuple for reading the index back from the
+ pack.
+ """
+ if index_bytes is None:
+ index_bytes = index.finish().read()
+ pos, size = writer.add_bytes_record(index_bytes, [(name,)])
+ length = len(index_bytes)
+ offset = size - length
+ start = pos + offset
+ return start, length
+
def upload_index(self, upload_transport):
"""Upload the index in preparation for insertion.
@@ -603,53 +631,34 @@
# write to disc.
index_bytes = self.revision_index.finish().read()
index_name = md5.new(index_bytes).hexdigest()
- upload_transport.put_bytes_non_atomic(index_name + ".rix",
- index_bytes)
- rev_length = len(index_bytes)
- # XXX: We should be able to do:
- # doc_length = upload_transport.put_file_non_atomic(
- # index_name + '.dix', index.document_index.finish())
- # but, put_file_non_atomic is not interface tested to return the byte
- # length.
+ write_stream = upload_transport.open_write_stream(index_name + ".pack")
+ writer = ContainerWriter(write_stream.write)
+ writer.begin()
+ rev_start, rev_length = self._add_index_to_pack(self.revision_index,
+ "revisions", writer, index_bytes)
+ del index_bytes
# generate a new term index with the length of the serialised posting
# lists.
- elements = []
term_index = InMemoryGraphIndex(0, 1)
- write_stream = upload_transport.open_write_stream(index_name + ".pack")
- elements.append(index_name + ".pack")
- writer = ContainerWriter(write_stream.write)
- writer.begin()
for term, term_id in self.terms.iteritems():
posting_list = self.posting_lists[term_id]
post_index = InMemoryGraphIndex(0, 1)
for doc_id in posting_list:
post_index.add_node((doc_id,), "", ())
- posting_bytes = post_index.finish().read()
- posting_name = index_name + "." + term_id
- pos, size = writer.add_bytes_record(posting_bytes, [(posting_name,)])
- length = len(posting_bytes)
- offset = size - length
- start = pos + offset
- ### upload_transport.put_bytes_non_atomic(posting_name, posting_bytes)
- ### elements.append(posting_name)
+ posting_name = "term_list." + term_id
+ start, length = self._add_index_to_pack(post_index, posting_name,
+ writer)
# How many document ids, and the range for the file view when we
# read the pack later.
- term_value = "%s %d %d %d" % (term_id, len(str(posting_list)),
- start, length)
- del posting_bytes
+ term_value = "%s %d %d %d" % (term_id, len(posting_list), start,
+ length)
term_index.add_node((term,), term_value, ())
+ term_start, term_length = self._add_index_to_pack(term_index, "terms", writer)
+ doc_start, doc_length = self._add_index_to_pack(self.document_index,
+ "documents", writer)
writer.end()
write_stream.close()
- index_bytes = term_index.finish().read()
- term_length = len(index_bytes)
- upload_transport.put_bytes_non_atomic(
- index_name + '.tix', index_bytes)
- index_bytes = self.document_index.finish().read()
- doc_length = len(index_bytes)
- upload_transport.put_bytes_non_atomic(
- index_name + '.dix', index_bytes)
- del index_bytes
- index_value = "%d %d %d" % (rev_length, term_length, doc_length)
- for suffix in [".rix", ".dix", ".tix"]:
- elements.append(index_name + suffix)
+ index_value = "%d %d %d %d %d %d" % (rev_start, rev_length, term_start,
+ term_length, doc_start, doc_length)
+ elements = [index_name + ".pack"]
return index_name, index_value, elements
More information about the bazaar-commits
mailing list