What constitutes the "identity" of a changeset?

Fri Mar 28 21:12:23 GMT 2008

On Fri, 2008-03-28 at 20:41 +0000, Paul Moore wrote:
> >  The revision then ties this together with the tree state, and the
> >  revision metadata, including the parents, just as in Mercurial.
> 
> "Ties this together" how? In the case of Mercurial, the ID is a hash
> of the state/metadata - so I know that if the state/metadata are the
> same, the revision ID will be (and vice versa).

In the Revision object essentially. If you give me a revision
id and a branch (repository really), then I can look it up and
extract the associated revision object, and then give you
all the information. This is exactly the same as with mercurial.
Mercurial names the revision with a hash, i.e. a one way operation,
given just the name of a mercurial revision you can't tell me
anything else about it.

However, you are right mercurial does have the property that
two identical revisions will be named the same, which isn't
the case in bzr.

> 
> >  The difference is the mercurial derives the name from the data,
> >  bzr uses an arbitrary name and just associates it with the data.
> 
> There's the point, though. If revision identity is encapsulated in the
> ID, and the ID is arbitrary, how can I say if 2 revisions are
> identical. In reality, the ID *isn't* arbitrary, precisely because you
> can reason sanely about revision identity, but the rules aren't
> written up anywhere.
> 

The revisions are identical from bzr's point of view if, and
only if, they have the same id.

The contents of the revision could also be compared for equality
if you wanted to know more than that.

However there is a difficulty in doing this. As the revision
includes the parent ids, which are again just arbitrary names
you have to compare all the way back to the root.

Mercurial uses the properties of a hash to encapsulate all that
in the name of the revision.

> Let me give a concrete example, from a discussion that came up on the
> Mercurial list.
> 
> Take a branch, with a simple revision tree a -> b -> c -> d. Now, I
> want to modify revision c in the history (yes, this isn't possible as
> such, bear with me). Suppose I roll back to b, then reapply c with
> changes. The new revision *is not c*, precisely because of the changes
> - call it c'. Reapplying d gives a *new* revision d' - precisely
> because revision identity incorporates the parent, and the parent of
> d' is c' where the parent of d is c (assume there is no other change
> in d/d').
> 
> Now, if someone cloned my branch before I did this, they would have a
> -> b -> c -> d. If they pull from me, they get
> 
> a -> b -> c -> d
>        \> c' -> d'
> 
> and have to merge.
> 
> This isn't complicated logic, but the point is that I can reason like
> this, precisely because I know what affects the revision ID (and hence
> revision identity).
> 
> What I'm asking for is an explanation of how Bazaar handles revision
> IDs, so that I can make deductions like this.

The behaviour is exactly the same within bzr.

The part I think you are missing in your interpretation of bzr's
handling of this is that when you go back and change c and commit
you will get a new revision, which will have a different revision
id from the original c. As they have different names then bzr will
consider them different, in the same way that mercurial considers
them different due to their different names.

The difference is that the name changes in mercurial because the
data has changed, which is then reflected in the hash. In bzr the
name changes because bzr will generate a new arbitrary name for
the revision.

In bzrlib you can call WorkingTree.commit with a revision_id parameter
to assign that revision id to the generated commit. This means
that you could patch bzr to give your new c' revision the same
revision id as the original c revision. You would then cause
a problem in the above scenario.

Mercurial would have the same problem if you were really unlucky
and your new c' revision caused a hash collision in the hash function
they used with the old c revision, and so was given the same name.

The contract therefore is that revisions are immutable, or put another
way, if you have two revisions that are different they must be
given different names.

Mercurial fulfils this contract by using the hash naming scheme and
assumes there will be no collisions (which is a very safe assumption
to make).

Bazaar fulfils this contract by assigning revision ids based on the
committer, the date and a random number (which means the probability
of a collision remains constant over time, assuming that all developers
have a sensible system clock, which is not a property of the mercurial
scheme).

The difference is that when two revisions have the same data mercurial
will give them the same name, and bzr will perhaps give them the same
name, but is not guaranteed to.

It would be possible to make bzr use the hashed naming scheme, as
the arbitrary name that you give the revision could just happen
to be its hash for every revision.

> 
> It's not clear to me, for example, how bzr-svn would assign revision
> IDs for changes in (a) a remote Subversion repository and (b) a
> svncloned local mirror of the same repository. If bzr-svn can know
> that these 2 are "the same", then it can assign the same revision ID
> to subversion revision NNN in each. And that means that I can bzr
> branch from the local repository (for speed of the initial conversion)
> and then change to bzr pull-ing from the remote repository.
> (Experimentally, it appears that bzr-svn might not be able to match up
> like that).
> 

I believe Jelmer implied that it does this based on the repository's
UUID. As this is a UUID it can assume that if it is the same then
it is dealing with the same repository.

The problem with bzr's normal scheme for assigning names to revisions
is known as the "parallel imports" problem. If I were to take
all the tarballs ever released of the linux kernel and import
each in to bzr and then commit it I would have a representation of
the history of the kernel. Then if you were to do exactly the same
steps you would also have a representation of the same history.

However, using bzr's default scheme bzr would not consider these
two branches identical, i.e. they would have different names for
the last revision.

If you did the same thing in Mercurial then, assuming  followed
certain rules along the way,