Inventory formats

Robert Collins robertc at robertcollins.net
Tue Sep 12 00:48:54 BST 2006


On Mon, 2006-09-11 at 12:12 -0400, Aaron Bentley wrote:

> > 
> > 1) WorkingTree will most likely be storing it in a different format. So
> > at the least, it can only register what revision ids it knows about, and
> > expect to be called at the appropriate time.
> 
> The current handling is hybrid, with Repository storing the file texts,
> but the tree holding the inventory.  There's an advantage to using the
> same inventory format in both repository and the cached basis, because
> that allows us to copy the text directly.  However, there may be more
> powerful arguments for using disjoint formats.  I don't know.

I'd like to expand on this, separately to this api discussion

When we split WorkingTree/Branch/Repository it took some time for us to
get it clean enough to start to see the access needs more clearly. But
today, I think its pretty clear that WorkingTree *primarily* works in
terms of file paths:
 * We need the local FS path to be used, not unicode.
 * 'All' user requests arrive as local path references

And that the repositories storage of trees works primarily in terms of
file-ids:
 * We need to group all data for a single file-id in a knit,
 * We need to ascertain what knits to examine during fetch operations by
file-id.

And our inventory object, which is used in both revision tree and
working tree has to bridge these two worlds: it needs to support
efficient identification of file_ids from paths, and it needs to support
efficient manipulation of file ids, and finally efficient conversion
back to paths.

We use file_ids as a form of inode, to make it easier to perform complex
merge operations without bugs like 'merge across rename blows up'.

Now, some operations need to operate on entire inventories:
 * commit [needs the entire inventory to commit, which *may* require
reading the full basis inventory, and *may* require reading the full
working inventory, but *always* requires reading at least a full
inventory from somewhere]
 * full-tree update [needs the full working inventory, and the inventory
of the target revision and the basis revision]
 * full-tree diff, full-tree status etc.

But other operations do not need entire inventories:
 * repository fetch operations only need the fileid:revision sets, where
revision is a subset of the fetched revisions.
 * subtree status and diff only need the path-prefixed set of entries,
where the path-prefix hauls in entries from the compared-to-revisions
tree by their id.

The point of all this is quite simple:
 * Generally, *incremental* operations about WorkingTree are
path-orientated.
 * Generally, *incremental* operations about RevisionTree are file_id
orientated. 

And this is the argument for having disjoint formats. Once dirstate is
finished I intend to design a matching repository format - its been on
the table for a year now, and I think we have enough information to
design it well.

So - dirstate is:
 * A helper object for WorkingTree that manages a path-indexed,
disk-based, mutable, inventory, including the current association of
parents to paths. 
 * fast for subpath operations (allows binary searching 
 * highly localised diskio for related operations 
 * able to generate a full inventory on command

And what we need in the repository is a similar *but different* set of
constraints (I dont have all the use cases handy, but we'll need them to
do as good a job as dirstate):
 * data-access within delta-stored inventories (that is, we need access
to some data without reconstructing a full inventory) [fetch]
 * incrementally storable (that is, we need to be able to create a delta
for storage without referencing the full inventory in some conditions).
For instance, if we commit a change to one file, we know that we cannot
have invalidated the invariant of file-id uniqueness, so we want to be
able to store just
[file_id, parent_id, name, last_changed, size, sha] - or perhaps even a
subset of that is compatible with the data-access within delta-stored
inventories requirement above. [commit subtree]
 * Efficient mapping from file_id:revision pairs to paths [log of a
subpath]
 * Efficient common-case tree-delta creation from inventory to/from its
left-most-parent. [status -r x..y and diff -r x..y]

I think this is doable, but there will be model changes (I am quite sure
now we need more information about last-changed of directories).

What I dont think is doable is a single format that meets the above use
cases, and the use cases for WorkingTree. So having disjoint formats
lets us optimise these mostly separate needs.

We will need to pay special attention to the manner in which we update
the parent information of WorkingTree during
commit/uncommit/update/merge - falling back to 'hand over a fall tree'
is a safe pattern, but one that we should be trying to avoid in the
future I think.

-Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 191 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060912/53e38a6f/attachment.pgp 


More information about the bazaar mailing list