Rev 2594: More repository doco. in http://people.ubuntu.com/~robertc/baz2.0/repository

Thu Jul 12 11:07:55 BST 2007

At http://people.ubuntu.com/~robertc/baz2.0/repository

------------------------------------------------------------
revno: 2594
revision-id: robertc at robertcollins.net-20070712100752-4e333owrhp07ymdy
parent: robertc at robertcollins.net-20070712063328-h0i90tr4vd8d19yf
committer: Robert Collins <robertc at robertcollins.net>
branch nick: repository
timestamp: Thu 2007-07-12 20:07:52 +1000
message:
  More repository doco.
modified:
  doc/developers/repository.txt  repository.txt-20070709152006-xkhlek456eclha4u-1
=== modified file 'doc/developers/repository.txt'

--- a/doc/developers/repository.txt	2007-07-12 06:33:28 +0000
+++ b/doc/developers/repository.txt	2007-07-12 10:07:52 +0000
@@ -173,7 +173,138 @@
 increased locality of reference, reducing the impact of a large indices
 without needing careful page size management or other tricks.
 
+We need repository wide indices. For the current repositories this is
+achieved by dividing the keyspace (revisions, signatures, inventories,
+per-fileid) and then having an append only index within each keyspace.
+For pack based repositories we will want some means to query the index of
+each component pack, presumably as a single logical index.
+
+It would be nice if indexing was made cleanly separate from storage. So
+that suggests indices don't know the meaning of the lookup; indices which
+offer particular ordering, or graph walking facilities will clearly need
+that information, but perhaps they don't need to know the semantics ?
+
+Index size
+~~~~~~~~~~
+
+Smaller indexes are good. We could go with one big index, or a different
+index for different operation styles. As multiple indices will occupy more
+space in total we should consider carefully about adding indices.
+
+Index ordering
+~~~~~~~~~~~~~~
+
+Looking at the data access patterns some operations such as graph walking
+can clearly be made more efficient by offering direct iteration rather
+than repeated reentry into the index - so having indices that support
+iteration in such a style would be useful eventually.
+
+Services
+~~~~~~~~
+
+An initial implementation of indexing can probably get away with a small
+number of primitives. Assuming we have write once index files:
+
+Build index
+^^^^^^^^^^^
+
+This should be done by creating an ``IndexBuilder`` and then calling
+``insert(key, value)`` many times. (Indices that support sorting,
+topological sorting etc, will want specialised insert methods).
+
+When the keys have all been added, a ``finish`` method should be called,
+which will return a file stream to read the index data from.
+
+Retrieve entries from the index
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This should allow random access to the index using readv, so we probably
+want to open the index on a ``Transport``, then use ``iter_entries(keys)``,
+which can return an iterator that yields ``(key, value)`` pairs in
+whatever order makes sense for the index.
+
+Merging of indices
+^^^^^^^^^^^^^^^^^^
+
+Merging of N indices requires a concordance of the keys of the index. So
+we should offer a ``iter_all_entries`` call that has the same return type
+as the ``iter_entries`` call.
+
+Changing our current indexes
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We can consider bring in cleaner indices in advance of bringing in a full
+pack based repository.
+
+There are many possibilities for this, but I've chosen one that seems ok
+to me for illustration.
+
+A key element is to consider when indices are updated. I think that the
+update style proposed for pack based repositories - write once, then when
+we group data again rewrite a new single index.
+
+Decoration
+^^^^^^^^^^
+
+We could simply write a new index at the end of every bzr transaction
+indexing the new data introduced by the bzr operation. e.g. at the end of
+fetch.
+
+If the new index was a specialised index with parent pointers that are
+native pointers inside the index - something like:
+ * key
+ * list of byte locations parent keys start at
+ * byte offset for the data record in the knit
+ * byte length for the data record in the knit
+ * byte locations for parent key it is compressed against, -1 for full
+ text
+ * sha1sum ? (Do we have sufficient sha1 pointers to not need this in the
+ index?)
+
+We could map knit indices to this by:
+ - giving ghosts their own record with -1 as the byte offset
+ - making operations like get_parents resolve pointers
+
+Its important to note that knit repositories cannot be regenerated by
+scanning .knits, .kndx is needed too, so a .knit based store still
+requires all the information 
+
 Data 
+----
+
+Moving to pack based repositories
+---------------------------------
+
+We have a number of challenges to solve.
+
+Naming of files
+~~~~~~~~~~~~~~~
+
+As long as the file name is unique it does not really matter. It might be
+interesting to have it be deterministic based on content, but that does
+solve a problem for us and would require hashing the full file. OTOH
+hashing the full file is a cheap way to detect bit-errors in transfer
+(such as windows corruption).
+
+Housing files
+~~~~~~~~~~~~~
+
+Combining indices on demand
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Merging data on push
+~~~~~~~~~~~~~~~~~~~~
+
+A trivial implementation would be to make a pack which has just the data
+needed for the push, then send that. More sophisticated things would be
+streaming single-pass creation, and also using this as an opportunity to
+increase the packedness of the local repo.
+
+Choosing compression/delta support
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+
+
 
 ..
    vim: ft=rst tw=74 ai