[PATCH] more work on merge

Mon Jul 11 17:08:07 BST 2005

Aaron Bentley wrote:
> John A Meinel wrote:
>
>>>Robert Collins wrote:
>>>
>>>>A better one (but not one cheaply accessible to baz today - tho it may
>>>>be cheap for bzr today) might be lines modified + new files + deleted
>>>>files.
>
>
> Ultimately, the criteria for a good merge base is "whatever produces the
> fewest conflicts".  A good merge base will
> - not have any lines that differ from THIS and OTHER when THIS and OTHER
> also differ
> - have all files that differ in THIS and OTHER
>
> There are really only a few good candidates for merge bases anyhow, as
> any merge candidate that is an ancestor of another merge candidate is
> unlikely to be useful.  (This is a heuristic.  It assumes people don't
> revert to their previous state often, while their branches also revert
> to that previous state without merging.)

I agree, the ancestor of a candidate is rarely a better candidate.

>
> So it might be worthwile to try 3-way merging all good candidates.  We
> aren't usually talking about huge numbers of files anyway.
>

Maybe. It seems like taking a sledge hammer to put in a nail, but it
might be the best solution.

>
>>>>This neatly fixes the history shortcut problem. (The late merge produces
>>>>a very large LOC change from the merge source to the branch, making that
>>>>node an expensive one to take - though it may well be taken once the
>>>>mainlines overlapping changes to single lines/files start to dominate).
>
>
> Wouldn't another way of solving the history shortcut problem be to
> always record the longest path from CANDIDATE to THIS/OTHER, and then
> take the candidate with the shortest longest path?
>

Yes and no. The problem is that with multiple lines, you need to have
some way to stop iterating all permutations. See my earlier post where
with a simple 13 revisions, I generated 60 that needed to be searched.
If I find a longer route to a revision, then I need to update all routes
out of that revision, since there is now a longer route as well.
With the min search, the first time I see it is going to be the shortest
(if you are doing a breadth-first search).
Now if you could stop a future wavefront, because you found a longer
route to it, that might be worth something.
Example:

# A - B - C - D - E - F - G
#               \       /
#                 H - -

In this case, exhaustive bfs gives you:
G-0, (F-1, H-1), (E-2, D-2), (D-3, C-3), (C-4, B-4), (B-5, A-5), A-6
If there was some way to notice the when you see D-3, to stop D-2 from
updating and giving you C-3, it would be nice.
The only thing I can think of is that you could use BFS, and then if you
encounter a node a second time, you immediately update any child entries
that exist in the revision=>distance dictionary.
And then as you are iterating, you make sure you are using the longest
distance for that child.

>
>>>There are a couple of issues, though. Specifically, now you need the
>>>inventory as well as the revision XML, because otherwise you can't tell
>>>what has changed. This isn't a big issue, but the currently proposed
>>>cset format only supplies the revision XML.
>
>
> Oh, but you can generate the inventory from the changeset and its base,
> right?  Or just install the changeset's revision first?

I can generate the *last* inventory XML in a multi-revision rollup. And
I can generate all intermediate revision XML entries.
If you wanted to generate intermediate inventory XML, then you need to
include intermediate patches. Which negates the advantage of a roll-up.
If you have to have all ancestry inventory XML, then just send a series
of changesets.

>
>
>>>I think it is still cheaper for bzr to use "num_modified + .5*num_new +
>>>.5*num_deleted", because bzr would have to extract the actual texts and
>>>compare them. (Of course, as weave or revlib starts to become more
>>>common, you might only have to extract the delta, rather than get 2
>>>texts and compute a diff).
>
>
> Extracting texts from the current storage format doesn't seem very
> expensive to me.  We're only talking about modified files, and heck, we
> can optimize by only grabbing one copy of each text.

It isn't expensive to get the text, but you need to get the difference
from it's previous version, so it means running diff for each modified
file in the history.

And since currently we require gathering the entire distance map
(distance from each revision to the target), it means computing the
distance for all revisions in the tree. Meaning re-computing a whole lot
of diffs. Now if we change the algorithm to just being 2 SPF from TARGET
and OTHER until they happen to overlap, at least you don't have to do
all of the ancestry. If we did go with all of ancestry, there would be a
very strong argument for memoizing the differences, perhaps a
'diff-store' indexed by text-id => text-id.

>
> Aaron

John
=:->
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 253 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20050711/45ed86e9/attachment.pgp