journalled inventories

Thu Oct 25 20:11:10 BST 2007

On Thu, 2007-10-25 at 09:49 -0500, John Arbash Meinel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Robert Collins wrote:
> ...
> > 
> >> So, I'd like the validator to be a hash of the full (non-delta'd) form
> >> of the inventory.  Preferably, but not necessarily, just a plain hash
> >> of the text.  But it could be a nested hash of various subsections.
> >> It should be something where you'll get the same result however it is
> >> stored.
> > 
> > This means at a minimum:
> >  - defining a canonical form for the inventory
> >  - loading that entire canonical form into memory during commit
> 
> I agree we need a canonical form, but I don't agree that we have to load the
> entire inventory.

'during commit' ;).

> Consider the "split-by-directory" approach. We don't actually have to store it
> split-up on disk. But if our validator is recursive by directory, you can
> validate portions of it.

Sure, didn't claim we can't. But to create a validator for  each
directory we need the bytes-for-that-directory. The bytes for the top
directory depend on the bytes for the child directories. So we need to
create them first.

If we say 'ah, but if we don't have any changes to child X, we can use
the validator our previous commit had for child X', then we are using a
delta in the creation of the whole inventory validator and much the same
risks apply as in a form which is explicitly delta'd : that is that a
logic bug will lead to a serialised inventory which does not represent
the tree, even though it can have its hash validated.

> We could easily combine your journaled approach with one that uses the
> per-directory validator. Then at the top-level we have a global validator. And
> to produce the validator for any directory just requires the basis for that
> directory (but not everything underneath that directory) plus the delta.
> 
> This means that you cannot create a single long text, run sha1 over it, and get
> the validator. But it can be written as one single long text, you just have to
> parse it into sections to compute the separate validators.

I think I'm missing something here. Are you saying 'you can journal
changes to a logical-split-by-directory inventory'? If so - sure. I
think it will require reserialisation of every directory that has
changes in it to generate the global validator, which is considerably
more expensive than the 'journalled inventory' proposal (assuming that
in both cases we're ok with using our delta logic to decide what we need
to reprocess).

This may be an interesting thing to try though. One thing to note that
differs beyond the computation requirements is the model change - a
split-by-directory validator requires an additional field on every
directory to provide the nested validator, and we need to carry that up
into our model and serialise it back down etc. So it's a larger change.

> > 
> > I think there are four representations we have discussed to date:
> > current:
> >  - single file for one version
> > split inventories:
> >  - split by directory from that same version's data
> >  - split by radix tree or b+tree, that same versions data
> > journalled inventories:
> >  - split by introducing versions data
> > 
> 
> 
> >> If you define a inventory as a particular ordered line-per-entry
> >> representation they can have stable hashes for their full form.  You
> >> can then have a serialized inventory delta which can be parsed and
> >> generated directly, but is also suitable for regenerating the whole
> >> text.  It does mean you'd need to generate the whole inventory to
> >> validate it.
> > 
> > Perhaps I have gotten the wrong end of the stick. When you say you want
> > this whole inventory validator, I assume it is something we'd want to
> > stick in the revision object. That means generating the whole inventory
> > during commit - currently 13% of commits time, and our serialiser is
> > pretty good.
> 
> Again, see above. A recursive definition means that you can create a validator
> which involves the entire inventory, without having to build the whole
> inventory. (You just re-use the components that haven't changed.)

Well, also, see above - a 'non delta'd validator' requires that we do
not reuse components that 'have not changed' because detecting if
something has changed or not is delta'ing.

> 
> > 
> >> The other issue in this area is quickly determining which directories
> >> have been affected, or whether they are still the same, but that can
> >> potentially be done by looking through the deltas.
> > 
> > Ack.
> 
> If we do it this way, we just need to remember that we have had lots of
> problems with "deleted" and "added" entries. (Is there a record here because it
> was modified from the basis, or because it was newly added. A record was
> deleted here, was it just renamed elsewhere, or is it actually removed.) We can
> solve some of this by writing reversible deltas (unidiff). But then you end up
> storing "this row was X and now it is Y" which doubles what you have to write.
> Though O(2*changes) is still O(changes).

I'm confused by this, inventory deltas are a model level change, not a
textual diff; and as such they don't depend on sequence matching at all.
A record is moved when OLDPATH and NEWPATH differ but neither is None.
Adds are OLDPATH is None. Deletes are NEWPATH is None. 

> > 
> >> If you have two 'cousin' inventories with an ancestor in common
> >> determining the differences between them is somewhat more complicated
> >> than with split-by-directory...
> > 
> > True.
> > 
> > -Rob
> > 
> 
> I don't think Robert's proposal and recursive validators are incompatible. And
> it does allow all the nice comparisons we were hoping to get.

There are a few things I'd like to point out here. Firstly the reason we
get nice comparisons from recursive validators on the serialised form is
because we can read *part* of a document but still be confident that
section wasn't corrupted. (its not about comparing validators between
different revisions - thats an optimisation which is open to
discussion). So this basically gives us bisection and the ability to
pull a single record out easily.

But what we'd really like is nested 'last changed'/'child changed' which
is not quite the same: it can be done in a single-document model if we
want (where it saves CPU to perform the diff but not IO), and it is not
something that is guaranteed by having nested validators.

With a journalled inventory, you always have to read all the journal
entries to parse it. A 'journalled inventory with recursive directory
validators' can only give you the validated content for a single
directory after processing all the data in the inventory.

So we need to be clear about what we get by doing what.

Journalling buys:
 - serialise only the changed inventory entries at commit.
 - hash only the changed inventory entries plus the pointer-to-basis at
commit.
Journalling does not change:
 - Imposes 'must read all data' constraint.

Nested directory validators buy:
 - can read part of an inventory and determine integrity, given the full
serialised content. (e.g. if the content is one document storage deltas
have to be applied; if the content is many documents, we have to be able
to access selected ones)
 - hash the directory of every change, recursively to root at commit
(using deltas to tune the commit operation)
Nested directory validators do not change:
 - serialise entire inventory at commit if one avoided using deltas,

There are multiple ways of combining nested directory validators and
journalling. If I was working on combining them I would probably go for
'each directory has a journal listing its changes'. This would mean that
renames into/out of a directory are listed twice overall. Each directory
would store the journalled directory validator for a child directory.

Finally, one thing I think is really important to shake out - I don't
think having a one-true-hash of an inventory generated during commit is
a sensible thing. I think having the following rules 
 - A hash for the inventory in its current representation at commit
 - pushing/pulling that revision does not change its representation
within compatible repositories (and thus does not change the hash)
 - transforming the inventory into a different representation changes
the hash 

So for a given repository format we might have a number of inventory
representations it can handle, and when one pulls into it from another
repository, if they can handle the same representation, the inventory is
preserved as is, as is the hash. If they cannot (e.g. the repository is
a bzr-svn repository, or if we require more data in this repository than
the source handles, then the inventory is regenerated and the new hash
stored.

The reason I say this is because deciding on a one true hash means
stopping evolution; saying 'we're done'. But we know that software needs
to change. So I'd rather design for change *and* correctness. Being able
to hash all the way down and be 99.999999% confident you read the data
the author created is very important, but not important enough that we
want to be reading xml7 inventories from xml in 2 years time.

-Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20071026/c4266d00/attachment.pgp