[design discussion] redundant data in the system; gpg signature stability;

Tue May 22 16:11:21 BST 2007

This may not make a lot of sense for folk not at the recent sprint; I
apologise for that, but I am mailing this largely to avoid forgetting it
- and am in the middle of a different project sprint right now.

On gpg signatures, I think it would be good to ensure that a tree with
*no metadata* in it can be verified the same as a bzr tree. E.g. if I
have a tarball made from 'bzr export', I should be able to generate
something that verifies this tree against a published bzr signature. I
don't think this is currently possible, but I think it would be a great
thing to have. One way to do this would be to have a hash-tree of the
whole contents without any metadata included in the clear-signed text.
This does imply having more than one hash-tree algorithm or some such
thing; and scaling concerns suggest maintaining the output of these
incrementally even though its redundant data in many ways.

Secondly, Martin expressed a new design principal he'd like to introduce
- avoid the storage of redundant data in the 'core' - where core means
(as far as I can tell) the data that bzr must download to use; non-core
data is data that can be synthesised from the real data to generate
local caches etc. The argument for this is that redundant data is able
to be skewed from the core data due to bugs/hacking attacks/improvements
in algorithms and so we should trust it when it is supplied by third
parties; and we should be able to replace/regenerate it when we've been
keeping it for performance reasons. I think that tied into this is the
idea of using keys which depend on the content - e.g. sha1/sha256 etc -
to address some of the content of the system. Doing that provides a
strong pressure on the rest of the system to be inflexible: If you
address an item by a hash, its not possible to change it without
changing its address, and thus anything that refers to it must change as
well. This is a good thing in other ways though: you have an emergent
defense against attackers - short of hash collisions they can only
attack the system where you stop talking in hash values and start
talking in structured data.

Now, Martin is away for a couple of weeks, so this conversation can't
really be held without him arguing his case :). I did want to get my
thoughts purged from my buffer though, so I'm going to do that now.

I think that allowing bzr to regenerate derivable data when more
information is available is a really good idea; we dont do that well
today (see annotations as an example). I also think that making some
forms of derivable data optional (again annotations are a good example)
can be very useful for performance or use-case specific tuning. On the
other hand, I think its very very hard to see where the exact line of
'derivable' crosses into 'derivable but sufficiently important to
consider core', and there is a related proposal to the new design
principal, which is to use hash's to address the internal nodes of the
bzr database.

Now the way I see it its hard to really nicely tie heterogenous systems
together with such representation-based-addressing. Consider our foreign
branch support: I think its fantastic what we're able to do; and our
ability to do it lies on our -not- using hash's to refer to content in
foreign systems. We currently use hashes to validate content, but not to
address content. And I think that this is extremely useful because it
allows representation to change without breaking gpg signatures or
system indexes.

The key thing about redundancy seems to be derivability and mutability.
Content that is mutable or derivable would seem to be non-core.

As an example of where its hard to draw the line; clearly a users file
text is immutable non-derivable data supplied by the user. (Wrong! If
line ending translations are in effect, then what the user has on disk
and what the repository has may be different but they may still be the
same file). So to ascertain this we need metadata from elsewhere in the
system in order to be able to decide how to hash the users file. Other
things like encoding (UTF8 vs UTF16LE etc) or $Id$ replacements will
make this tricky right at the heart of the system, and I strongly
suspect that good history horizon support (by which I mean the ability
to annotate a file with a history horizon, while offline, by default)
will also make arguments about what is core and what is not, tricky at
best.

So perhaps we can:

 *) When we identify derivable data, mark it clearly as such in design
documents. Examine it with respect to use cases such as offline work
with history horizons, incremental push/pull, pull from hostile users
etc.
 *) Add validators for derived data that dont conflate with the
validators for non-derived data. e.g. sha sum a given representation of
annotations for a version of a file. Record that somewhere that is not
included in the validator for the version of the file, so that a
replaced version of the annotations does not invalidate the file text or
directory, but does allow the replaced annotation to be discarded (if
desired - a trusted sources replaced annotation could be used).

One of the mistakes Arch made was gpg signing representation. Its
impossible to validate a bzr tree that is converted from arch against
the arch gpg signatures; because you cannot recreate the tarball (the
datestamp in the tarball is not preserved). We should not make that
mistake: our gpg signatures, and indeed the data for a revision, should
not depend on data unless it will be preserved when the representation
of the database changes. Depending on derivable data is also bad because
it may change when a better algorithm comes along or a bug is fixed. So
we should probably only depend on:
 - data that is defined by the model not the representation.

No conclusion here as such; I think you can use sha1's and say that you
can recreate a sha1 hash'ed version of tree from a repository using
sha256 in the future, to validate an old signature. So this isn't a
strong reason to avoid using sha1 as index's - I made my point about
that up above :). 

-Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070523/69523493/attachment.pgp