Signing snapshots

Fri Jun 24 05:21:24 BST 2005

On 22 Jun 2005, Aaron Bentley <aaron.bentley at utoronto.ca> wrote:
> John A Meinel wrote:
> >> No, the way I'd do it is by not signing the inventory file-- sign the
> >> inventory data instead.  As a straw man, you'd sort it by unicode
> >> codepoint, then write out a space-delimited inventory summary with id,
> >> name, parent, type and contents-hash(if applicable) fields for each
> >> entry.  The format doesn't need to be parseable, just unique for each
> >> tree.
> >>
> > Actually, you are missing an important point. What algorithm is used to
> > generate "contents-hash" if not a hash function. Which means that if you
> > upgrade you hash algorithm, suddenly all of those "contents-hash"
> > entries change, and you need a new signature.
> 
> No, I understand that contents-hash is generated by a hash function, and
> what that implies.
> 
> What I'm talking about is the comparability of two trees.  It would be
> nice if you could say 'if the tree's SHA-1 hash is not $foo, it is not a
> true copy of this tree'.
> 
> > This really isn't any different from signing the <inventory> XML. The
> > only trick is that you would want to be careful that the <inventory>
> > tree would always be sorted in a specific way.
> 
> Actually, you need to normalize the data, not merely sort it.  We can't
> have whitespace changes or use of entity references affecting the
> results.  XML is tricky to normalize, so the strawman is about a format
> that's got only one form and needs no normalization.

Robert pointed out Canonical XML (no relation!)  I think it would be
fairly straightforward to write this out.  This should mean that any
object, written out in a particular format, would always generate the
same byte stream and therefore the same hash.

  http://www.jclark.com/xml/canonxml.html

It seems like it would be straightforward to change ElementTree to write
out the canonical form, and it'd probably be a good idea to write this
form even if we don't rely on it yet.  It would probably make sense to
stop adding decorative whitespace.

One step further, we could read in an object in format version N, then
calculate the hash of it represented in format M<N. 

> > Revfiles don't use a text_id, though you could arguably generate an
> > text_id since it is necessary for the plain file storage, and just not
> > make use of it in a revfile storage.
> 
> Eh?  Are you sure revfiles don't use text_ids?  How else do you refer to
> the file contents in a text store?

The current draft code uses the text's SHA-1 to identify it.  The
options are:

  SHA-1
  text_id
  revision_id in which the text was created

-- 
Martin