journalled inventories

Thu Oct 25 07:15:35 BST 2007

Having potentially many round trips for a split-up inventory is a
concern for me too and I'm happy to explore alternatives.

I would strongly like the future inventory format to have an overall
validator hash that can be stored in the inventory and is the same
when the inventory is moved between repositories or even different
formats.

One reason to have such a validator is to catch potential bugs in our
repository implementations.  I would hope that we can explore more
format improvements without going through the relatively risky
operation of changing the hashed metadata itself.

At the moment we have deltas to inventories at two levels: logical
deltas (apply_inventory_delta, generated by commit for example), and
the text delta used to compress them for storage.  They're largely
unconnected, and there's potentially an efficiency gain in putting
them together.  The only case I know of where they were connected was
the clever but hacky technique of identifying modified file ids by
looking at the inventory weave/knit.

So I think the important constraint is to make use of that duality
without overly constraining either the flexibility of this particular
format, or what changes we can make in future.

For one example, we have looked at recompressing texts into different
orders, or using xdelta.   Making the validator of the inventory
depend on which predecessor it is compressed against limits this.
Another case where that might bite is if we want to make a shallow
branch, and give it (just) the full text of the first inventory on
that branch.

So, I'd like the validator to be a hash of the full (non-delta'd) form
of the inventory.  Preferably, but not necessarily, just a plain hash
of the text.  But it could be a nested hash of various subsections.
It should be something where you'll get the same result however it is
stored.

If you define a inventory as a particular ordered line-per-entry
representation they can have stable hashes for their full form.  You
can then have a serialized inventory delta which can be parsed and
generated directly, but is also suitable for regenerating the whole
text.  It does mean you'd need to generate the whole inventory to
validate it.

The other issue in this area is quickly determining which directories
have been affected, or whether they are still the same, but that can
potentially be done by looking through the deltas.

If you have two 'cousin' inventories with an ancestor in common
determining the differences between them is somewhat more complicated
than with split-by-directory...

-- 
Martin