Ancestry Graphs (fast-forward versus merge)

Fri Oct 27 09:40:15 BST 2006

>>>>> "jam" == John Arbash Meinel <john at arbash-meinel.com> writes:

    jam> I'm actually going to try and split up my responses,
    jam> because we have several threads going on at once. :)

Indeed. And I realize that I may have merged several ideas into
what seems to be only one, I'll try to clarify below.

    jam> I'm also bringing it onto the list. I realize it is sort
    jam> of in the middle of the conversation, but I've had some
    jam> insight, that I think is worthy of sharing.

Fine with me.

    jam> Vincent Ladeuil wrote:
    >>>>>>> "jam" == John Arbash Meinel <john at arbash-meinel.com> writes:
    >> 
    >> <snip/>
    >> 
    >> >> And if you think that dereferencing one more time to
    >> >> access the content may have associated costs, I'm ready
    >> >> to bet that the other gains will compensate that
    >> >> several times.
    >> >> 
    >> 
    jam> I have no doubt that having some sort of recursive 'has
    jam> anything changed underneath here' will be a large
    jam> performance gain.

That was one part: with a clear definition of the 'content'
separated from 'SCM info' some problems become easier to
solve. So I will not talk more of that here.

Now, what do I call 'content' ?

Linus said: "What the user cares for". Hmmm. May as well be
defined as: "What the SCM handle and can provide through an
export', i.e. a distribution of a software for example (let's
forget about whether or not that should contain a 'configure'
script or just a 'configure.ac')".

This is still vague and ambiguous.

Let's say: "What the user have in its working tree, minus the
'./bzr' directory". A bit abstract, but may be easier to
grasp. That leave only things like executable bits, mode bits,
acls and some such as potentially unclear parts, so for the sake
of the discussion I will just ignore they exist.

<snip/>

    jam> When you have a DAG, any line through it can be seen as
    jam> a "line-of-development" or what we call a "branch".
    >> 
    jam> At one point, bzr kept track of which line through the
    jam> dag a given branch used. And we allowed you to put
    jam> anything as long as it was valid line. We changed it so
    jam> that following the primary parent became the "official"
    jam> line, and revision-history just became a cache of that
    jam> history.
    >> 
    jam> We could change back, but it starts making revision-history
    jam> semi-precious again, rather than having it just be a cache.
    >> 
    >> What forbids marking one of the parents of a commit the official
    >> one ?
    >> 
    >> And what forbids rebuilding the revision-history on demand  ?
    >> 

    jam> You can't mark one of the parents as the 'official' one,
    jam> if you and I aren't going to agree on 'official'.

If I understand it correctly you and I have a different
'official' one. And we should be able to handle that as we see
fit.

    jam> Let's draw a diagram, time flows downwards:

Me  You
A	# I start a project
| \
|   B	# You see it, and decide to add a feature
C   |	# I'm not finished, so I keep working
| / |
D   |	# I see your feature, and I merge your changes
|   E	# You keep working, perhaps on something else
  \ |
    F	# You see I merged you, and added feature C, so you merge back

    jam> At this point, my linear history might look something like:
    jam>   A C D
    jam> And yours looks like:
    jam>   A B E F

    jam> Now I want the changes from E, so if I merge you, should
    jam> it, or should it not create a new revision? Technically,
    jam> your graph is a superset of mine, so I could just put F
    jam> as my tip, and go for a history that looks like:

    jam>   A C D F

That's one choice. You consider that you're work is finished and
that you will now continue from there.

    jam> Alternatively, I could create a new node, which is the
    jam> merge of D and F (we'll call it G). Then my linear
    jam> history looks like:

    jam>   A C D G

    jam> The content hash for G might be identical to F, but it
    jam> has new parents (D & F) and a new committer, as well as
    jam> probably a new timestamp, and maybe some other meta
    jam> information.

That's another one. You consider that your work is not finished,
but you want to record that, even if no 'content' change was
introduced, this is a point of interest from a historical point
of view. You create a commit to record that information.

Or you make a release at that point and this information is
clearly even more important.

    jam> Let's say that we go with 'G', and then you blindly
    jam> merge me back. You can detect that the tree content is
    jam> the same for F and G, and chose not to create a new node
    jam> (bzr can detect this already by noticing saying there is
    jam> no content change from the current revision versus the
    jam> one that we just merged).

An important information (the content is the same), IMO.

And from there I may want to do a new commit if you have made a
release, to mark where in *my* history I was in synch with the
release.

Or I may decide to ignore it because, well, that's not a release
:)

    jam> A recursive content hash makes it a bit cheaper to
    jam> determine, but it is an optimization we need for other
    jam> cases anyway. I'm not talking about that,

Ok. This is where I decided to stop discussing that part*.

    jam> I'm talking about how to handle the revision DAG.

Let's talk about that and that only.

    jam> Anyway, there exists a state, now, with the same tree
    jam> content as both F and G, but which has new parent ids,
    jam> and potentially a different committer name than G (you),
    jam> what to do at this point? At present, bzr lets you

It forces me.

    jam> commit, and create a new node, which is different from
    jam> other nodes (in some fashion).

It is different. It carries information. I care about that
information. I want a commit to record that information. But that
information is not about the content (except that the content is
the same in F and G). And this is why I want to separate what I
call 'content' from the 'rest'.

    jam> Lets jump back a little bit. If we create node 'G', then
    jam> we have the property that each node has 2 parents, and
    jam> if we follow the first parent, then we can regenerate
    jam> the linear history:

    jam>   A B E F
    jam> and
    jam>   A C D G

    jam> from only the revision graph. If we don't create 'G',
    jam> there is no way to tell whether:

    jam> 
    jam>   A B E F or A C D F

Yes. I agree. And some people may be interested by that property
because they care only about the content.

I, for one, will probably be not part of them (I tend to put a
lot of information in my commit associated comments).

    jam> Is the 'correct' linear history, without recording that
    jam> information somewhere else (git creates a reflog to do
    jam> this, bzr used to use revision-history to do the same).

My current knowledge is not sufficient to understand how this is
done in either git or bzr. That may be why I have problems to
explain what I want.

Let's try again.

I think that recording the 'correct' linear history is good, I
want that for me. But in the same time, I don't' want to force
other people to keep it if they have no interest in that.

I think that it should be possible to give more control to the
user in how he wants to record his history.

    jam> Now, one thing this has revealed to me (and might have
    jam> been what you were already thinking), is that there is a
    jam> trick we can pull. 

More than one :-) That's the point. And I think you get it.

    jam> Which is that if we store a content hash, which is just
    jam> a handle to the tree state, and doesn't reference the
    jam> ancestry DAG at all. Then when I try to merge D & F to
    jam> get G, bzr can happily copy the contents of F, and
    jam> create a node G with the new information.  

E-x-a-c-t-l-y. That's what I call a 'commit'. And at that point
you may create it to say: F becomes my tip, I follow vila *or* G
becomes my tip, I have other things to add (by content or by
commit).

    jam> But if you came back and tried to do a merge of F & G,
    jam> bzr could tell you 'tree contents identical, nothing to
    jam> merge'. 

Yes.

    jam> That would stop ping-ponging, and it would allow each
    jam> node to have a concept of the linear history that led up
    jam> to this point.

Indeed.

    jam> I think this could be very good. I don't think
    jam> preventing 'G' from existing is the right solution,
    jam> because I think there is content in merging D & F. It is
    jam> arguably small, but it is there. There is also potential
    jam> content in H == F & G, but I would be willing to draw a
    jam> line that the content is not useful.

I agree with that too. But let's illustrate: I may even be
interested to record that at that point *you* merged from me and
create a 'H' commit, just because I'm a bit anal about who do
what when, there is still less information that in your 'G'
commit, but some bureaucrats can have very good reasons to record
that.

And yes, that may continue, with information less and less
important added at each ping-pong round.

My point is that should be possible but not mandatory.

    jam> Now, there is another case that I would like to propose, which is where
    jam> it is very useful (IMO) to have a node F, so that you can review what
    jam> has been going on:

v- public tree
    v- Feature branch
        v- Commit message
A	# Release 0.1
| \
|   B   # Add function foobar to create feature foo
|   |
|   C   # Crush the bugs and hear the lamentations of the evil
|   |
|   D	# Ooops, missed one
|   |
|   E	# use foobar in solve_all_problems()
| /
F	# [merge] foobar solves all problems

    jam> A fairly involved feature branch can be summarized in
    jam> the commit F. So that looking at the A, F history, you
    jam> can see the release, and then that feature "foobar" was
    jam> added.

    jam> If you used fast-forward, you wouldn't get the
    jam> opportunity to create the summary information, and if
    jam> you were answering the question:

    jam>   When was "foobar" added to the public branch

    jam> You would probably search through the logs and see the
    jam> log for B, when it was created, but it may not be clear
    jam> that it wasn't ready until E to actually be used.

Yes. *I* will not do that and I think this information should be
recorded, and the user should have the choice to "import" the
commits B-E *or* just the associated final 'content'
associated with E.

OR

It should be able to look at the history and view B-E or just B,
E or whatever.

    jam> Also, you can have the nice property that all of the
    jam> commits on 'public tree' pass the test suite, while
    jam> things elsewhere may or may not.

And that's still another kind of information that I don't want to
mix with the 'content' one. But yes, that's very valuable
information ! Let's create a commit to register that ! 

    jam> I believe the kernel tree has the policy that all of B-E
    jam> must not have any (known) bugs in them, which means you
    jam> have to rewrite the history to:

    jam> A
    >> 
    jam> D # Bugless foobar
    >> 
    jam> E # foobar used in solve_all_problems()

    jam> Basically, you have to hide that B and C existed,
    jam> because there are bugs in them, or maybe they wouldn't
    jam> apply cleanly to a new tip.

And there it appears that different users will want to look at
the history in different ways, so we should be able to present
these different views in different ways from the same set of
informations.

That may well be the role of one GUI tool and not one proposed by
the bzr core and at that point I think that the git operators
that allows to select different part of a DAG have sense**.

    jam> Now, I can agree that in a project the size of the
    jam> kernel, with as many developers as it has. The ancestry
    jam> becomes a little unwieldy. And you can scale clean
    jam> commits, because each person trying to get something
    jam> done can do the work to make it clean. While there is
    jam> only 1 person doing the merging. But in smaller
    jam> projects, the time spent cleaning up your patches may
    jam> become a large percent of the time actually spent doing
    jam> work.

I think that presenting history is a very hard
problem. Preserving history may be a simpler one, but let's try
to not be rigid in solving it.

    jam> Hopefully I've answered your direct questions about why
    jam> you have to either add a new node, or keep a separate
    jam> file indicating which nodes are the linear history. And
    jam> why there may be value in an "empty" merge.  (It is a
    jam> way to collapse history without throwing it away).

Yes you have.

I don't think 'G' is an empty merge, 'H' may be for some people.

    jam> Also, I've gained the insight that checking the tree
    jam> content may be very worthwhile to avoid "more empty"
    jam> (truly empty??, very empty???) merges, where both
    jam> parents have the same content.

Great :)

Thanks a lot to have made the whole picture clearer for me,

       Vincent

*: I first read your mail very very quickly and get the impression
 that we disagree. Then I re-read slower and realized that we
 agree far more than I thought and that in fact I had mixed two
 things: one is the point that by separating 'content' from
 'commit' we get a mine of optimizations. We agree on that. The
 other is that there are several policies to chain the commits
 (once they are separated from the content).

**: Please note that I endorse *nothing* by saying that, I have
  no idea about what are the *right* operators to propose. But
  restricting bzr to manipulate the DAG with only a range
  operator appears really far too limited. I will even go further
  and say that limiting the operations to only one DAG
  may... (ok, can of worms, never mind :)