Ancestry Graphs (fast-forward versus merge)

Thu Oct 26 19:50:26 BST 2006

I'm actually going to try and split up my responses, because we have
several threads going on at once. :)

I'm also bringing it onto the list. I realize it is sort of in the
middle of the conversation, but I've had some insight, that I think is
worthy of sharing.

Vincent Ladeuil wrote:
>>>>>> "jam" == John Arbash Meinel <john at arbash-meinel.com> writes:
> 
> <snip/>
> 
>     >> And if you think that dereferencing one more time to access the
>     >> content may have associated costs, I'm ready to bet that the
>     >> other gains will compensate that several times.
>     >> 
> 
>     jam> I have no doubt that having some sort of recursive 'has
>     jam> anything changed underneath here' will be a large
>     jam> performance gain. I'm not sure if it does anything to
>     jam> solve 'ping-pong' merges.
> 
> Huh ?
> 
> I merge from you, I get a different tree *content*, I commit.
> 
> You merge from me, you get the *same* tree content, what do you
> want to commit ? I gave you nothing.
> 
> You may create a *commit* recording the fact that "at this point
> in time vila and I were in synch" and associate a content (that
> same content with nothing in it) to that commit, but that's
> different, that's information in the commit.
> 
> But hey ! *I* may be interested in that bit of info.
> 
> And I may want to make a commit for it. Not. The ping-pong ends
> here.
> 
>     jam> It is actually fairly simple to change bzr to have pull
>     jam> allow new tips, and work in "git" mode.
>     >> 
>     >> I think it's more than that, I think there are many areas where
>     >> pruning a tree process (or a multi-tree process with synchronized
>     >> walking of the trees, or... you get the point) as soon as you can
>     >> determine that the sub trees at hand are identical without any
>     >> need to verify further.
> 
>     jam> I was discussing something different here, about how to
>     jam> record the ancestry of a branch.
> 
> Oh, sorry.
> 
>     jam> git maintains a 'reflog' which is some sort of 'these
>     jam> are all the revision ids that have been the tip for this
>     jam> branch'. And somehow it preserves the idea that merging
>     jam> you is different than committing all of the merged
>     jam> revisions along the branch. Even though the ancestry dag
>     jam> doesn't contain any of that information.
> 
> I didn't understand it that way. A commit have pointers to its
> parents and benefit of the same property: if two commits have the
> same ancestry, they are equal. No need to explore all the
> ancestry. So if you merge me, you register my tip commit as a
> parent of your commit and get the associated content.
> 
> Now, I don't know if they transfer all the parents of my commit
> which is necessary to explore how I get there. But that also may
> be a short-cut to merge the only 'content' needed to bring you
> up-to-date with me. Not a obligation, though.
> 
>     >> 
>     jam> We actually used to do this. (Hence why
>     jam> Branch.revision_history is a list, rather than a single
>     jam> tip).  
>     >> 
>     >> Here you mean that a revision can have several parents ? Or is it
>     >> something deeper ?
> 
>     jam> When you have a DAG, any line through it can be seen as a
>     jam> "line-of-development" or what we call a "branch".
> 
>     jam> At one point, bzr kept track of which line through the
>     jam> dag a given branch used. And we allowed you to put
>     jam> anything as long as it was valid line. We changed it so
>     jam> that following the primary parent became the "official"
>     jam> line, and revision-history just became a cache of that
>     jam> history.
> 
>     jam> We could change back, but it starts making revision-history
>     jam> semi-precious again, rather than having it just be a cache.
> 
> What forbids marking one of the parents of a commit the official
> one ?
> 
> And what forbids rebuilding the revision-history on demand  ?
> 

You can't mark one of the parents as the 'official' one, if you and I
aren't going to agree on 'official'.

Let's draw a diagram, time flows downwards:

Me  You
A	# I start a project
| \
|   B	# You see it, and decide to add a feature
C   |	# I'm not finished, so I keep working
| / |
D   |	# I see your feature, and I merge your changes
|   E	# You keep working, perhaps on something else
  \ |
    F	# You see I merged you, and added feature C, so you merge back

At this point, my linear history might look something like:
  A C D
And yours looks like:
  A B E F

Now I want the changes from E, so if I merge you, should it, or should
it not create a new revision? Technically, your graph is a superset of
mine, so I could just put F as my tip, and go for a history that looks like:
  A C D F

Alternatively, I could create a new node, which is the merge of D and F
(we'll call it G). Then my linear history looks like:
  A C D G

The content hash for G might be identical to F, but it has new parents
(D & F) and a new committer, as well as probably a new timestamp, and
maybe some other meta information.

Let's say that we go with 'G', and then you blindly merge me back. You
can detect that the tree content is the same for F and G, and chose not
to create a new node (bzr can detect this already by noticing saying
there is no content change from the current revision versus the one that
we just merged). A recursive content hash makes it a bit cheaper to
determine, but it is an optimization we need for other cases anyway. I'm
not talking about that, I'm talking about how to handle the revision DAG.

Anyway, there exists a state, now, with the same tree content as both F
and G, but which has new parent ids, and potentially a different
committer name than G (you), what to do at this point? At present, bzr
lets you commit, and create a new node, which is different from other
nodes (in some fashion).

Lets jump back a little bit. If we create node 'G', then we have the
property that each node has 2 parents, and if we follow the first
parent, then we can regenerate the linear history:
  A B E F
and
  A C D G

from only the revision graph. If we don't create 'G', there is no way to
tell whether:
  A B E F   or   A C D F

Is the 'correct' linear history, without recording that information
somewhere else (git creates a reflog to do this, bzr used to use
revision-history to do the same).

Now, one thing this has revealed to me (and might have been what you
were already thinking), is that there is a trick we can pull. Which is
that if we store a content hash, which is just a handle to the tree
state, and doesn't reference the ancestry DAG at all. Then when I try to
merge D & F to get G, bzr can happily copy the contents of F, and create
a node G with the new information.
But if you came back and tried to do a merge of F & G, bzr could tell
you 'tree contents identical, nothing to merge'. That would stop
ping-ponging, and it would allow each node to have a concept of the
linear history that led up to this point.

I think this could be very good. I don't think preventing 'G' from
existing is the right solution, because I think there is content in
merging D & F. It is arguably small, but it is there. There is also
potential content in H == F & G, but I would be willing to draw a line
that the content is not useful.

Now, there is another case that I would like to propose, which is where
it is very useful (IMO) to have a node F, so that you can review what
has been going on:

v- public tree
    v- Feature branch
        v- Commit message
A	# Release 0.1
| \
|   B   # Add function foobar to create feature foo
|   |
|   C   # Crush the bugs and hear the lamentations of the evil
|   |
|   D	# Ooops, missed one
|   |
|   E	# use foobar in solve_all_problems()
| /
F	# [merge] foobar solves all problems

A fairly involved feature branch can be summarized in the commit F. So
that looking at the A, F history, you can see the release, and then that
feature "foobar" was added.

If you used fast-forward, you wouldn't get the opportunity to create the
summary information, and if you were answering the question:

  When was "foobar" added to the public branch

You would probably search through the logs and see the log for B, when
it was created, but it may not be clear that it wasn't ready until E to
actually be used.

Also, you can have the nice property that all of the commits on 'public
tree' pass the test suite, while things elsewhere may or may not.

I believe the kernel tree has the policy that all of B-E must not have
any (known) bugs in them, which means you have to rewrite the history to:

A
|
D # Bugless foobar
|
E # foobar used in solve_all_problems()

Basically, you have to hide that B and C existed, because there are bugs
in them, or maybe they wouldn't apply cleanly to a new tip.

Now, I can agree that in a project the size of the kernel, with as many
developers as it has. The ancestry becomes a little unwieldy. And you
can scale clean commits, because each person trying to get something
done can do the work to make it clean. While there is only 1 person
doing the merging. But in smaller projects, the time spent cleaning up
your patches may become a large percent of the time actually spent doing
work.

Hopefully I've answered your direct questions about why you have to
either add a new node, or keep a separate file indicating which nodes
are the linear history. And why there may be value in an "empty" merge.
(It is a way to collapse history without throwing it away).

Also, I've gained the insight that checking the tree content may be very
worthwhile to avoid "more empty" (truly empty??, very empty???) merges,
where both parents have the same content.

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20061026/c28ec9f8/attachment.pgp