[RFC]New style of revision id [Was Re: VCS comparison table]

Tue Oct 24 18:24:29 BST 2006

On Tuesday 24 October 2006 18:15, you (John Arbash Meinel) wrote:
> Martin Pool wrote:
> > On Tue, 2006-10-24 at 00:17 +0200, Goffredo Baroncelli wrote:
[...]
> >> From a bazaar developer point of view I think if it is possible to switch 
from 
> >> a pseudo-random revidion id, to a checksum basis revision id: the 
checksum 
> >> can be computed on the basis of the sha1 of the files, and 
> >> timestamp/commiter/parent-revision(s)/properties.
[...]
> > 
> > I think storing or naming those objects by their hash is a pretty
> > interesting idea, and I've warmed to it after this discussion.  As you
> > say it should pretty much drop in to the existing framework.
> > 
> 
> The biggest problem I see is that you can't know your final revision id
> until you have done all the work. Which is a place where hg has lots of
> issues with their indexes.
> 
> Either pre-compute all of the work, figure out your final hash, and then
> go back and start writing, or you write as you go, but then have to go
> back at the end, and rewrite the indexes to include the correct revision id.

I protoypized the idea. The logic should be:

Compute the checksum (whichever you want: md5/sha1/foo-bar) on the basis of 
the following information:

1) commiter
2) comment
3) timestamp/timezone
4) revision property
5) partent revision-id [*]
6) a list of 
   - file ( path, file-id, checksum, executable and/or other information )
   - directory ( path, file-id, executable and/or other information )
   - link ( path, file-id, target and/or other information )

The chcksum computation is the same of the testament. The magic is that you 
have to compute the list on the basis not of the inventory (which isn't 
available ) but on the basis of the workingtree ( bazaar know which file will 
be in the revision ).

After you have collected all the information you can compute the checksum, 
then create the inventory/revision.
The main issue is that you have to read every changed files 2 times:
1) the first time for performing the checksum
2) the second time, for inserting it into knit/weave.... ( which I think that 
is the operation which requires more time )

> 
> And since the filesystem *might* be changing as you go along, you have
> to store a pristine copy of any files you would be adding to the repository.

Sorry I don't understand why the file can change: yes I know that the file 
*can* change during a commit which require a bit of time; but that can happen 
even with other revision id (for example during the checksum computation), 
and if that happen the commit have to fail in any case.

> I'm not sure if git stores any back pointers (pointer from the file text
> up into what manifest it would be associated with).
> 
> hg stores them by just storing an index, and it always just stores the
> *next* index because of how it is layed out. Which works most of the
> time, but causes problems if you ever have a repeated hash, because now
> you have 1 think that should be pointing at 2 different revisions.
> 
> My best guess is that git always has a top down approach. So to do the
> log of changes for a file you have to unpack the manifest and see if
> that file is listed. Rather than reading the index for the file changes,
> and going back to the commit information from there.
> 
> Having arbitrary revision ids means you can have a handle before you
> start doing any committing, and have it apply cleanly all the way out.
> 
> So unfortunately, it isn't a simple drop-in replacement. We could
> possibly have a look-aside naming scheme. So that after a revision has
> been committed, we compute the hash, and then it can be accessed by
> either name.

I know that every inetry in weave/knit/inventory is based on the revision-id; 
but nothing prevent the creation of that structure *after* the creation of 
the revision ID.

> Also, git hashes include the hashes for their parents, which means that
> you need an unbroken chain back to the NULL revision. In other words,
> you can't have ghosts. Or at least, no ghosts whose hash you don't know.
> (Though you wouldn't know what handle to give them if you didn't know
> their hash).

True, if you want to trust the full history you have to use the hashes from 
the begin. But even tough that is not true, a hash sign the history until the 
last ( or first ) revision-id without hash.

For example

A->B->C->D

where 
A,B -> the revision id(s) are pseudo-casual
C,D -> the revision id(s) are an hash of the tree content

If you know the revision-id of D, starting from the revision id of B you can 
computate the chain of revision id(s) until D, then you can compare the 
computated revision-id of D with the your one ( received by email for 
example).

> 
> I realize ghost support isn't something super critical, and it may be
> something worth getting rid of, in exchange for the hash security. But
> this came up a long time ago when we were being strict about the sha1
> value in the Revision texts, and we were discussing whether they should
> include the parent references or not.
> 
> We got rid of them because you cannot change the serialization without
> affecting the final values. Which is why we went for Testaments.
> 
> So in 'git', if you ever tried a different algorithm for laying out your
> meta information (manifest, inventory, what have you), suddenly all of
> your "revision ids" change. And my new format git branch can't talk
> (well) to your git branch. There are some compatibility possibilities,
> but with git, it has to be 100% correct from the start, because an
> upgrade is going to potentially break lots of stuff.

I think that it should be sufficent tu use a different prefix for the hash; so 
the new version of software can apply the new verify function.

> 
> Now maybe git did get it all right. It is possible. Though I'm wondering
> if there are people wishing for feature X, but it just isn't possible
> without breaking stuff. (And further, it isn't something Linus needs, so
> it won't go into *his* workflow...)
> 
> Anyway, content addressable namespaces do have some neat stuff, but I'm
> not convinced that they are the perfect solution.
> 
> John
> =:->

Goffredo

-- 
gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo) <kreijack at inwind.it>
Key fingerprint = CE3C 7E01 6782 30A3 5B87  87C0 BB86 505C 6B2A CFF9