Rev 2626: Cleanup docs. in http://people.ubuntu.com/~robertc/baz2.0/repository

Fri Jul 13 16:05:40 BST 2007

At http://people.ubuntu.com/~robertc/baz2.0/repository

------------------------------------------------------------
revno: 2626
revision-id: robertc at robertcollins.net-20070713150536-hqtkufys7aiqxl1t
parent: robertc at robertcollins.net-20070713141921-wv9fiwi3i5u2qv4f
committer: Robert Collins <robertc at robertcollins.net>
branch nick: repository
timestamp: Sat 2007-07-14 01:05:36 +1000
message:
  Cleanup docs.
added:
  doc/developers/indices.txt     indices.txt-20070713142939-m5cdnp31u8ape0td-1
modified:
  NEWS                           NEWS-20050323055033-4e00b5db738777ff
  bzrlib/index.py                index.py-20070712131115-lolkarso50vjr64s-1
  doc/developers/index.txt       index.txt-20070508041241-qznziunkg0nffhiw-1
  doc/developers/repository.txt  repository.txt-20070709152006-xkhlek456eclha4u-1
=== added file 'doc/developers/indices.txt'

--- a/doc/developers/indices.txt	1970-01-01 00:00:00 +0000
+++ b/doc/developers/indices.txt	2007-07-13 15:05:36 +0000
@@ -0,0 +1,101 @@
+=======
+Indices
+=======
+
+Status
+======
+
+:Date: 2007-07-14
+
+This document describes the indexing facilities within bzrlib.
+
+
+.. contents::
+
+
+Motivation
+==========
+
+To provide a clean concept of index that can be reused by different
+components within the codebase rather than being rewritten every time
+by different components.
+
+
+Terminology
+===========
+
+An **index** is a dictionary mapping opaque keys to opaque values.
+Different index types may allow some of the value data to be interpreted
+by the index. For example the ``GraphIndex`` index stores a graph between
+keys as part of the index.
+
+
+Overview
+========
+
+bzr is moving to a write-once model for repository storage in order to
+achieve lock-free repositories eventually. In order to support this we are
+making our new index classes **immutable**. That is, one creates a new
+index in a single operation, and after that it is read only. To combine
+two indices a ``Combined*`` index may be used, or an **index merge** may
+be performed by reading the entire value of two (or more) indices and
+writing them into a new index.
+
+General Index API
+=================
+
+While different indices will likely require different interfaces, we
+should keep these consistent to encourage easy adaption and reuse between
+indices. For instance a simple key-value index should be layerable on top
+of a more complex graph-aware index.
+
+Services
+--------
+
+An initial implementation of indexing can probably get away with a small
+number of primitives. Assuming we have write once index files:
+
+Build index
+~~~~~~~~~~~
+
+This should be done by creating an ``IndexBuilder`` and then calling
+``insert(key, value)`` many times. (Indices that support sorting,
+topological sorting etc, will want specialised insert methods).
+
+When the keys have all been added, a ``finish`` method should be called,
+which will return a file stream to read the index data from.
+
+Retrieve entries from the index
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This should allow random access to the index using readv, so we probably
+want to open the index on a ``Transport``, then use ``iter_entries(keys)``,
+which can return an iterator that yields ``(key, value)`` pairs in
+whatever order makes sense for the index.
+
+Merging of indices
+~~~~~~~~~~~~~~~~~~
+
+Merging of N indices requires a concordance of the keys of the index. So
+we should offer a ``iter_all_entries`` call that has the same return type
+as the ``iter_entries`` call.
+
+Index implementations
+=====================
+
+GraphIndex
+----------
+
+``GraphIndex`` supports graph based lookups. While currently unoptimised
+for reading, the index is quite space efficient at storing the revision
+graph index for bzr. The ``GraphIndexBuilder`` may be used to create one
+of these indices by calling ``add_node`` until all nodes are added, then
+``finish`` to obtain a file stream containing the index data. Multiple
+indices may be queried using the ``CombinedGraphIndex`` class.
+
+
+
+
+..
+   vim: ft=rst tw=74 ai
+

=== modified file 'NEWS'
--- a/NEWS	2007-07-13 06:12:27 +0000
+++ b/NEWS	2007-07-13 15:05:36 +0000
@@ -47,6 +47,14 @@
       the null revision, and consider using ``None`` for this purpose
       deprecated.  (Aaron Bentley)
 
+    * New ``index`` module with abstract index functionality. This will be
+      used during the planned changes in the repository layer. Currently the
+      index layer provides a graph aware immutable index, a builder for the
+      same index type to allow creating them, and finally a composer for
+      such indices to allow the use of many indices in a single query. The
+      index performance is not optimised, however the API is stable to allow
+      development on top of the index. (Robert Collins)
+
   TESTING:
 
     * Remove selftest ``--clean-output``, ``--numbered-dirs`` and

=== modified file 'bzrlib/index.py'
--- a/bzrlib/index.py	2007-07-13 14:17:17 +0000
+++ b/bzrlib/index.py	2007-07-13 15:05:36 +0000
@@ -174,6 +174,11 @@
     terminated string without any newlines.
 
     It is presumed that the index will not be mutated - it is static data.
+
+    Currently successive iter_entries/iter_all_entries calls will read the
+    entire index each time. Additionally iter_entries calls will read the
+    entire index always. XXX: This must be fixed before the index is 
+    suitable for production use. :XXX
     """
 
     def __init__(self, transport, name):

=== modified file 'doc/developers/index.txt'
--- a/doc/developers/index.txt	2007-07-12 06:33:28 +0000
+++ b/doc/developers/index.txt	2007-07-13 15:05:36 +0000
@@ -39,6 +39,10 @@
 
   Notes on a container format for streaming and storing Bazaar data.
 
+* `Indices <indices.txt>`_
+
+  The index facilities available within bzrlib.
+
 * `Repositories <repository.htm>`_
 
   What repositories do and are used for.

=== modified file 'doc/developers/repository.txt'
--- a/doc/developers/repository.txt	2007-07-12 13:06:16 +0000
+++ b/doc/developers/repository.txt	2007-07-13 15:05:36 +0000
@@ -199,37 +199,6 @@
 than repeated reentry into the index - so having indices that support
 iteration in such a style would be useful eventually.
 
-Services
-~~~~~~~~
-
-An initial implementation of indexing can probably get away with a small
-number of primitives. Assuming we have write once index files:
-
-Build index
-^^^^^^^^^^^
-
-This should be done by creating an ``IndexBuilder`` and then calling
-``insert(key, value)`` many times. (Indices that support sorting,
-topological sorting etc, will want specialised insert methods).
-
-When the keys have all been added, a ``finish`` method should be called,
-which will return a file stream to read the index data from.
-
-Retrieve entries from the index
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-This should allow random access to the index using readv, so we probably
-want to open the index on a ``Transport``, then use ``iter_entries(keys)``,
-which can return an iterator that yields ``(key, value)`` pairs in
-whatever order makes sense for the index.
-
-Merging of indices
-^^^^^^^^^^^^^^^^^^
-
-Merging of N indices requires a concordance of the keys of the index. So
-we should offer a ``iter_all_entries`` call that has the same return type
-as the ``iter_entries`` call.
-
 Changing our current indexes
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -248,70 +217,30 @@
 
 We could discard the per-knit .kndx by writing a new index at the end of
 every bzr transaction indexing the new data introduced by the bzr
-operation. e.g. at the end of fetch.
-
-We can keep the knit data file if  the new index was a specialised index
-with parent pointers that are native pointers inside the index values -
-something like:
- * list of byte locations for the parent keys entries in this index or
-   [-1] for not present in the index (its just a name to be pointed at)
- * byte offset for the data record in the knit
- * byte length for the data record in the knit
- * byte locations for parent key it is compressed against, -1 for full
-   text
- * sha1sum ? (Do we have sufficient sha1 pointers to not need this in the
-   index?)
- * noeol will need a flag too as that does not appear to be in the zip
- data.
-
-Separation of concerns, and having something that can be used outside
-knits suggests splitting this differently. Lets build an index that can
-store a graph efficiently. So the index itself understands:
- * key
- * parents list
- * value
-And then in the value we can serialise:
- * byte offset for the data record in the knit
- * byte length for the data record in the knit
- * full text/not full text. (no less general than knit indices).
- * sha1sum ? (Do we have sufficient sha1 pointers to not need this in the
-   index?)
- * noeol will need a flag too as that does not appear to be in the zip
- data.
-
-Trading off some complexity we could have the index understand:
- * key
- * A list of node-referencing lists (e.g. 2 lists of parents)
- * value
-And then in the value we serialise:
- * byte offset for the data record in the knit
- * byte length for the data record in the knit
- * sha1sum ? (Do we have sufficient sha1 pointers to not need this in the
-   index?)
- * noeol will need a flag too as that does not appear to be in the zip
- data.
- In this scenario we will have the first parents list be the graph
- parents, and the second parents list be the compression parents. (empty
- for full text)
-
-Index merging can take place easily because all the data that we may
-choose to dictionary compress within the index is maintained by the index,
-the only data in the value for each entry is data solely relevant to the
-knit data file.
-
-We could map knit indices to this by:
- - giving ghosts their own record with -1 as the byte offset
- - making operations like get_parents resolve pointers
+operation. e.g. at the end of fetch. This can be based on the new
+``GraphIndex`` index type.
+
+Encoding a knit entry into a ``GraphIndex`` can be done as follows:
+
+* Change the key to include a prefix of the knit name, to allow filtering
+  out of data from different knits.
+* Encode the parents from the knit as the zeroth node reference list.
+* If the knit hunk was delta compressed encode the node it was delta
+  compressed against as the 1st node reference list (otherwise the 1st
+  node reference list will be empty to indicate no compression parents).
+* For the value encode similarly to the current knit format the byte
+  offset for the data record in the knit, the byte length for the data
+  record in the knit and the no-end-of-line flag.
 
 Its important to note that knit repositories cannot be regenerated by
-scanning .knits, data from .kndx is needed too, so a .knit based store still
-requires all the information that the current .kndx contains.
+scanning .knits, so a mapped index is still irreplaceable and must be
+transmitted on push/pull.
 
 A potential improvement exists by specialising this further to not record
 data that is not needed - e.g. an index of revisions does not need to
 support a pointer to a parent compressed text as revisions.knit is not
 delta-compressed ever. Likewise signatures do not need the parent pointers
-as there is no 'signature graph'.
+at all as there is no 'signature graph'.
 
 Data 
 ----