Rev 2595: More speculation and repository docs. in http://people.ubuntu.com/~robertc/baz2.0/repository

Thu Jul 12 14:06:19 BST 2007

At http://people.ubuntu.com/~robertc/baz2.0/repository

------------------------------------------------------------
revno: 2595
revision-id: robertc at robertcollins.net-20070712130616-7c8rum60e382krml
parent: robertc at robertcollins.net-20070712100752-4e333owrhp07ymdy
committer: Robert Collins <robertc at robertcollins.net>
branch nick: repository
timestamp: Thu 2007-07-12 23:06:16 +1000
message:
  More speculation and repository docs.
modified:
  doc/developers/repository.txt  repository.txt-20070709152006-xkhlek456eclha4u-1
=== modified file 'doc/developers/repository.txt'

--- a/doc/developers/repository.txt	2007-07-12 10:07:52 +0000
+++ b/doc/developers/repository.txt	2007-07-12 13:06:16 +0000
@@ -233,41 +233,85 @@
 Changing our current indexes
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-We can consider bring in cleaner indices in advance of bringing in a full
-pack based repository.
+We can consider introducing cleaner indices in advance of a full pack
+based repository.
 
 There are many possibilities for this, but I've chosen one that seems ok
 to me for illustration.
 
 A key element is to consider when indices are updated. I think that the
 update style proposed for pack based repositories - write once, then when
-we group data again rewrite a new single index.
-
-Decoration
-^^^^^^^^^^
-
-We could simply write a new index at the end of every bzr transaction
-indexing the new data introduced by the bzr operation. e.g. at the end of
-fetch.
-
-If the new index was a specialised index with parent pointers that are
-native pointers inside the index - something like:
- * key
- * list of byte locations parent keys start at
+we group data again rewrite a new single index - is sufficent.
+
+Replace .kndx
+^^^^^^^^^^^^^
+
+We could discard the per-knit .kndx by writing a new index at the end of
+every bzr transaction indexing the new data introduced by the bzr
+operation. e.g. at the end of fetch.
+
+We can keep the knit data file if  the new index was a specialised index
+with parent pointers that are native pointers inside the index values -
+something like:
+ * list of byte locations for the parent keys entries in this index or
+   [-1] for not present in the index (its just a name to be pointed at)
  * byte offset for the data record in the knit
  * byte length for the data record in the knit
  * byte locations for parent key it is compressed against, -1 for full
- text
- * sha1sum ? (Do we have sufficient sha1 pointers to not need this in the
- index?)
+   text
+ * sha1sum ? (Do we have sufficient sha1 pointers to not need this in the
+   index?)
+ * noeol will need a flag too as that does not appear to be in the zip
+ data.
+
+Separation of concerns, and having something that can be used outside
+knits suggests splitting this differently. Lets build an index that can
+store a graph efficiently. So the index itself understands:
+ * key
+ * parents list
+ * value
+And then in the value we can serialise:
+ * byte offset for the data record in the knit
+ * byte length for the data record in the knit
+ * full text/not full text. (no less general than knit indices).
+ * sha1sum ? (Do we have sufficient sha1 pointers to not need this in the
+   index?)
+ * noeol will need a flag too as that does not appear to be in the zip
+ data.
+
+Trading off some complexity we could have the index understand:
+ * key
+ * A list of node-referencing lists (e.g. 2 lists of parents)
+ * value
+And then in the value we serialise:
+ * byte offset for the data record in the knit
+ * byte length for the data record in the knit
+ * sha1sum ? (Do we have sufficient sha1 pointers to not need this in the
+   index?)
+ * noeol will need a flag too as that does not appear to be in the zip
+ data.
+ In this scenario we will have the first parents list be the graph
+ parents, and the second parents list be the compression parents. (empty
+ for full text)
+
+Index merging can take place easily because all the data that we may
+choose to dictionary compress within the index is maintained by the index,
+the only data in the value for each entry is data solely relevant to the
+knit data file.
 
 We could map knit indices to this by:
  - giving ghosts their own record with -1 as the byte offset
  - making operations like get_parents resolve pointers
 
 Its important to note that knit repositories cannot be regenerated by
-scanning .knits, .kndx is needed too, so a .knit based store still
-requires all the information 
+scanning .knits, data from .kndx is needed too, so a .knit based store still
+requires all the information that the current .kndx contains.
+
+A potential improvement exists by specialising this further to not record
+data that is not needed - e.g. an index of revisions does not need to
+support a pointer to a parent compressed text as revisions.knit is not
+delta-compressed ever. Likewise signatures do not need the parent pointers
+as there is no 'signature graph'.
 
 Data 
 ----
@@ -281,10 +325,18 @@
 ~~~~~~~~~~~~~~~
 
 As long as the file name is unique it does not really matter. It might be
-interesting to have it be deterministic based on content, but that does
-solve a problem for us and would require hashing the full file. OTOH
-hashing the full file is a cheap way to detect bit-errors in transfer
-(such as windows corruption).
+interesting to have it be deterministic based on content, but there are no
+specific problems we have solved by doing that, and doing so would require
+hashing the full file. OTOH hashing the full file is a cheap way to detect
+bit-errors in transfer (such as windows corruption).
+
+Discovery of files
+~~~~~~~~~~~~~~~~~~
+
+With non listable transports how should the collection of pack/index files
+be found ? Initially record a list of all the pack/index files from
+write actions. (Require writable transports to be listable). We can then
+use a heuristic to statically combine pack/index files later.
 
 Housing files
 ~~~~~~~~~~~~~