Rev 2594: More repository doco. in

Robert Collins robertc at
Thu Jul 12 11:07:55 BST 2007


revno: 2594
revision-id: robertc at
parent: robertc at
committer: Robert Collins <robertc at>
branch nick: repository
timestamp: Thu 2007-07-12 20:07:52 +1000
  More repository doco.
  doc/developers/repository.txt  repository.txt-20070709152006-xkhlek456eclha4u-1
=== modified file 'doc/developers/repository.txt'
--- a/doc/developers/repository.txt	2007-07-12 06:33:28 +0000
+++ b/doc/developers/repository.txt	2007-07-12 10:07:52 +0000
@@ -173,7 +173,138 @@
 increased locality of reference, reducing the impact of a large indices
 without needing careful page size management or other tricks.
+We need repository wide indices. For the current repositories this is
+achieved by dividing the keyspace (revisions, signatures, inventories,
+per-fileid) and then having an append only index within each keyspace.
+For pack based repositories we will want some means to query the index of
+each component pack, presumably as a single logical index.
+It would be nice if indexing was made cleanly separate from storage. So
+that suggests indices don't know the meaning of the lookup; indices which
+offer particular ordering, or graph walking facilities will clearly need
+that information, but perhaps they don't need to know the semantics ?
+Index size
+Smaller indexes are good. We could go with one big index, or a different
+index for different operation styles. As multiple indices will occupy more
+space in total we should consider carefully about adding indices.
+Index ordering
+Looking at the data access patterns some operations such as graph walking
+can clearly be made more efficient by offering direct iteration rather
+than repeated reentry into the index - so having indices that support
+iteration in such a style would be useful eventually.
+An initial implementation of indexing can probably get away with a small
+number of primitives. Assuming we have write once index files:
+Build index
+This should be done by creating an ``IndexBuilder`` and then calling
+``insert(key, value)`` many times. (Indices that support sorting,
+topological sorting etc, will want specialised insert methods).
+When the keys have all been added, a ``finish`` method should be called,
+which will return a file stream to read the index data from.
+Retrieve entries from the index
+This should allow random access to the index using readv, so we probably
+want to open the index on a ``Transport``, then use ``iter_entries(keys)``,
+which can return an iterator that yields ``(key, value)`` pairs in
+whatever order makes sense for the index.
+Merging of indices
+Merging of N indices requires a concordance of the keys of the index. So
+we should offer a ``iter_all_entries`` call that has the same return type
+as the ``iter_entries`` call.
+Changing our current indexes
+We can consider bring in cleaner indices in advance of bringing in a full
+pack based repository.
+There are many possibilities for this, but I've chosen one that seems ok
+to me for illustration.
+A key element is to consider when indices are updated. I think that the
+update style proposed for pack based repositories - write once, then when
+we group data again rewrite a new single index.
+We could simply write a new index at the end of every bzr transaction
+indexing the new data introduced by the bzr operation. e.g. at the end of
+If the new index was a specialised index with parent pointers that are
+native pointers inside the index - something like:
+ * key
+ * list of byte locations parent keys start at
+ * byte offset for the data record in the knit
+ * byte length for the data record in the knit
+ * byte locations for parent key it is compressed against, -1 for full
+ text
+ * sha1sum ? (Do we have sufficient sha1 pointers to not need this in the
+ index?)
+We could map knit indices to this by:
+ - giving ghosts their own record with -1 as the byte offset
+ - making operations like get_parents resolve pointers
+Its important to note that knit repositories cannot be regenerated by
+scanning .knits, .kndx is needed too, so a .knit based store still
+requires all the information 
+Moving to pack based repositories
+We have a number of challenges to solve.
+Naming of files
+As long as the file name is unique it does not really matter. It might be
+interesting to have it be deterministic based on content, but that does
+solve a problem for us and would require hashing the full file. OTOH
+hashing the full file is a cheap way to detect bit-errors in transfer
+(such as windows corruption).
+Housing files
+Combining indices on demand
+Merging data on push
+A trivial implementation would be to make a pack which has just the data
+needed for the push, then send that. More sophisticated things would be
+streaming single-pass creation, and also using this as an opportunity to
+increase the packedness of the local repo.
+Choosing compression/delta support
    vim: ft=rst tw=74 ai

More information about the bazaar-commits mailing list