journalled inventories

Thu Oct 25 15:49:12 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
...
> 
>> So, I'd like the validator to be a hash of the full (non-delta'd) form
>> of the inventory.  Preferably, but not necessarily, just a plain hash
>> of the text.  But it could be a nested hash of various subsections.
>> It should be something where you'll get the same result however it is
>> stored.
> 
> This means at a minimum:
>  - defining a canonical form for the inventory
>  - loading that entire canonical form into memory during commit

I agree we need a canonical form, but I don't agree that we have to load the
entire inventory.

Consider the "split-by-directory" approach. We don't actually have to store it
split-up on disk. But if our validator is recursive by directory, you can
validate portions of it.

We could easily combine your journaled approach with one that uses the
per-directory validator. Then at the top-level we have a global validator. And
to produce the validator for any directory just requires the basis for that
directory (but not everything underneath that directory) plus the delta.

This means that you cannot create a single long text, run sha1 over it, and get
the validator. But it can be written as one single long text, you just have to
parse it into sections to compute the separate validators.

> 
> I think there are four representations we have discussed to date:
> current:
>  - single file for one version
> split inventories:
>  - split by directory from that same version's data
>  - split by radix tree or b+tree, that same versions data
> journalled inventories:
>  - split by introducing versions data
> 

>> If you define a inventory as a particular ordered line-per-entry
>> representation they can have stable hashes for their full form.  You
>> can then have a serialized inventory delta which can be parsed and
>> generated directly, but is also suitable for regenerating the whole
>> text.  It does mean you'd need to generate the whole inventory to
>> validate it.
> 
> Perhaps I have gotten the wrong end of the stick. When you say you want
> this whole inventory validator, I assume it is something we'd want to
> stick in the revision object. That means generating the whole inventory
> during commit - currently 13% of commits time, and our serialiser is
> pretty good.

Again, see above. A recursive definition means that you can create a validator
which involves the entire inventory, without having to build the whole
inventory. (You just re-use the components that haven't changed.)

John
=:->

> 
>> The other issue in this area is quickly determining which directories
>> have been affected, or whether they are still the same, but that can
>> potentially be done by looking through the deltas.
> 
> Ack.

If we do it this way, we just need to remember that we have had lots of
problems with "deleted" and "added" entries. (Is there a record here because it
was modified from the basis, or because it was newly added. A record was
deleted here, was it just renamed elsewhere, or is it actually removed.) We can
solve some of this by writing reversible deltas (unidiff). But then you end up
storing "this row was X and now it is Y" which doubles what you have to write.
Though O(2*changes) is still O(changes).

> 
>> If you have two 'cousin' inventories with an ancestor in common
>> determining the differences between them is somewhat more complicated
>> than with split-by-directory...
> 
> True.
> 
> -Rob
> 

I don't think Robert's proposal and recursive validators are incompatible. And
it does allow all the nice comparisons we were hoping to get.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHIKznJdeBCYSNAAMRAtteAKCiexPVKPGfoupUDtkJRAmHfd5EFACdGeAH
7J0QTPeLVu48t3klqOZdr+I=
=REh7
-----END PGP SIGNATURE-----