[RFC]New style of revision id [Was Re: VCS comparison table]
Goffredo Baroncelli
kreijack at alice.it
Tue Oct 24 18:24:29 BST 2006
On Tuesday 24 October 2006 18:15, you (John Arbash Meinel) wrote:
> Martin Pool wrote:
> > On Tue, 2006-10-24 at 00:17 +0200, Goffredo Baroncelli wrote:
[...]
> >> From a bazaar developer point of view I think if it is possible to switch
from
> >> a pseudo-random revidion id, to a checksum basis revision id: the
checksum
> >> can be computed on the basis of the sha1 of the files, and
> >> timestamp/commiter/parent-revision(s)/properties.
[...]
> >
> > I think storing or naming those objects by their hash is a pretty
> > interesting idea, and I've warmed to it after this discussion. As you
> > say it should pretty much drop in to the existing framework.
> >
>
> The biggest problem I see is that you can't know your final revision id
> until you have done all the work. Which is a place where hg has lots of
> issues with their indexes.
>
> Either pre-compute all of the work, figure out your final hash, and then
> go back and start writing, or you write as you go, but then have to go
> back at the end, and rewrite the indexes to include the correct revision id.
I protoypized the idea. The logic should be:
Compute the checksum (whichever you want: md5/sha1/foo-bar) on the basis of
the following information:
1) commiter
2) comment
3) timestamp/timezone
4) revision property
5) partent revision-id [*]
6) a list of
- file ( path, file-id, checksum, executable and/or other information )
- directory ( path, file-id, executable and/or other information )
- link ( path, file-id, target and/or other information )
The chcksum computation is the same of the testament. The magic is that you
have to compute the list on the basis not of the inventory (which isn't
available ) but on the basis of the workingtree ( bazaar know which file will
be in the revision ).
After you have collected all the information you can compute the checksum,
then create the inventory/revision.
The main issue is that you have to read every changed files 2 times:
1) the first time for performing the checksum
2) the second time, for inserting it into knit/weave.... ( which I think that
is the operation which requires more time )
>
> And since the filesystem *might* be changing as you go along, you have
> to store a pristine copy of any files you would be adding to the repository.
Sorry I don't understand why the file can change: yes I know that the file
*can* change during a commit which require a bit of time; but that can happen
even with other revision id (for example during the checksum computation),
and if that happen the commit have to fail in any case.
> I'm not sure if git stores any back pointers (pointer from the file text
> up into what manifest it would be associated with).
>
> hg stores them by just storing an index, and it always just stores the
> *next* index because of how it is layed out. Which works most of the
> time, but causes problems if you ever have a repeated hash, because now
> you have 1 think that should be pointing at 2 different revisions.
>
> My best guess is that git always has a top down approach. So to do the
> log of changes for a file you have to unpack the manifest and see if
> that file is listed. Rather than reading the index for the file changes,
> and going back to the commit information from there.
>
> Having arbitrary revision ids means you can have a handle before you
> start doing any committing, and have it apply cleanly all the way out.
>
> So unfortunately, it isn't a simple drop-in replacement. We could
> possibly have a look-aside naming scheme. So that after a revision has
> been committed, we compute the hash, and then it can be accessed by
> either name.
I know that every inetry in weave/knit/inventory is based on the revision-id;
but nothing prevent the creation of that structure *after* the creation of
the revision ID.
> Also, git hashes include the hashes for their parents, which means that
> you need an unbroken chain back to the NULL revision. In other words,
> you can't have ghosts. Or at least, no ghosts whose hash you don't know.
> (Though you wouldn't know what handle to give them if you didn't know
> their hash).
True, if you want to trust the full history you have to use the hashes from
the begin. But even tough that is not true, a hash sign the history until the
last ( or first ) revision-id without hash.
For example
A->B->C->D
where
A,B -> the revision id(s) are pseudo-casual
C,D -> the revision id(s) are an hash of the tree content
If you know the revision-id of D, starting from the revision id of B you can
computate the chain of revision id(s) until D, then you can compare the
computated revision-id of D with the your one ( received by email for
example).
>
> I realize ghost support isn't something super critical, and it may be
> something worth getting rid of, in exchange for the hash security. But
> this came up a long time ago when we were being strict about the sha1
> value in the Revision texts, and we were discussing whether they should
> include the parent references or not.
>
> We got rid of them because you cannot change the serialization without
> affecting the final values. Which is why we went for Testaments.
>
> So in 'git', if you ever tried a different algorithm for laying out your
> meta information (manifest, inventory, what have you), suddenly all of
> your "revision ids" change. And my new format git branch can't talk
> (well) to your git branch. There are some compatibility possibilities,
> but with git, it has to be 100% correct from the start, because an
> upgrade is going to potentially break lots of stuff.
I think that it should be sufficent tu use a different prefix for the hash; so
the new version of software can apply the new verify function.
>
> Now maybe git did get it all right. It is possible. Though I'm wondering
> if there are people wishing for feature X, but it just isn't possible
> without breaking stuff. (And further, it isn't something Linus needs, so
> it won't go into *his* workflow...)
>
> Anyway, content addressable namespaces do have some neat stuff, but I'm
> not convinced that they are the perfect solution.
>
> John
> =:->
Goffredo
--
gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo) <kreijack at inwind.it>
Key fingerprint = CE3C 7E01 6782 30A3 5B87 87C0 BB86 505C 6B2A CFF9
More information about the bazaar
mailing list