journalled inventories

Thu Oct 25 07:58:21 BST 2007

On Thu, 2007-10-25 at 16:15 +1000, Martin Pool wrote:
> Having potentially many round trips for a split-up inventory is a
> concern for me too and I'm happy to explore alternatives.

One way to reduce such round trips is via the topographical locality
work that Kier and I have been discussing with respect to GraphIndex.
That would allow a small number of roundtrips to load the shape of the
tree, then one readv to access it all (this is for the case of 'load the
whole thing please').

> I would strongly like the future inventory format to have an overall
> validator hash that can be stored in the inventory and is the same
> when the inventory is moved between repositories or even different
> formats.

I would like that too, but I'm not sure it's possible to reconcile this
desire with reaching the performance goal of doing work during commit
proportional to the size of change. More broad discussion below.

> One reason to have such a validator is to catch potential bugs in our
> repository implementations.  I would hope that we can explore more
> format improvements without going through the relatively risky
> operation of changing the hashed metadata itself.

I like this in principal, the problem is generating a new validator
without processing all the data that constitutes the current shape.

> At the moment we have deltas to inventories at two levels: logical
> deltas (apply_inventory_delta, generated by commit for example), and
> the text delta used to compress them for storage.  They're largely
> unconnected, and there's potentially an efficiency gain in putting
> them together.  The only case I know of where they were connected was
> the clever but hacky technique of identifying modified file ids by
> looking at the inventory weave/knit.

Which gave us a huge speed boost :).

> So I think the important constraint is to make use of that duality
> without overly constraining either the flexibility of this particular
> format, or what changes we can make in future.
> 
> For one example, we have looked at recompressing texts into different
> orders, or using xdelta.   Making the validator of the inventory
> depend on which predecessor it is compressed against limits this.
> Another case where that might bite is if we want to make a shallow
> branch, and give it (just) the full text of the first inventory on
> that branch.

I realise that you're raising a general point here, but I don't think
any of these examples actually make sense. For the first case -
different delta's, that makes sense when the inventory is modelled as 'a
single text' delta compressed by the storage layer. However, the
proposal I made was to model the inventory as 'a series of texts store
by the storage layer'. In this context xdelta or whatever else
compression applies to each text, and can still be done. In the second
case - I already covered that in the proposal I made, where the maximum
size to copy would be twice the size of a single inventory.

> So, I'd like the validator to be a hash of the full (non-delta'd) form
> of the inventory.  Preferably, but not necessarily, just a plain hash
> of the text.  But it could be a nested hash of various subsections.
> It should be something where you'll get the same result however it is
> stored.

This means at a minimum:
 - defining a canonical form for the inventory
 - loading that entire canonical form into memory during commit

I think there are four representations we have discussed to date:
current:
 - single file for one version
split inventories:
 - split by directory from that same version's data
 - split by radix tree or b+tree, that same versions data
journalled inventories:
 - split by introducing versions data

> If you define a inventory as a particular ordered line-per-entry
> representation they can have stable hashes for their full form.  You
> can then have a serialized inventory delta which can be parsed and
> generated directly, but is also suitable for regenerating the whole
> text.  It does mean you'd need to generate the whole inventory to
> validate it.

Perhaps I have gotten the wrong end of the stick. When you say you want
this whole inventory validator, I assume it is something we'd want to
stick in the revision object. That means generating the whole inventory
during commit - currently 13% of commits time, and our serialiser is
pretty good.

> The other issue in this area is quickly determining which directories
> have been affected, or whether they are still the same, but that can
> potentially be done by looking through the deltas.

Ack.

> If you have two 'cousin' inventories with an ancestor in common
> determining the differences between them is somewhat more complicated
> than with split-by-directory...

True.

-Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20071025/2e9f6c71/attachment-0001.pgp