Some questions on internals

Mon Feb 22 18:07:34 GMT 2010

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...

>> 2. I am afraid the answer is no- but.... Is there any way to relate
> the
>> chunks of the delta caused in 3.2 with the related chunks in the delta
>> to 3.1 (the initial fix merge)? Especially considering the possibility
>> that merge conflicts on 3.2 would cause (small) changes in the merge
>> there?
> 
> Annotate was mentioned 2ce as a method to do this, but I might be
> missing something because I do not see the advantage of doing annotate?
> If I commit the 3.1 merge on the 3.2 branch just doing the diff on that
> latest revision would give me the changes merged at their location in
> the 3.2 sources. I would need to find out however which diff chunk in
> the /original merge/ moved /where/ in the 3.2 version. I can of course
> try to somehow recognise diffs (using some kind of fuzzy canonical hash
> or something) over those versions but that can easily fail if minor
> changes are made during a merge (or when chunks look alike).

Guessing at where content moved is generally going to run into troubles
when "minor changes are made or when chunks look alike". You are making
an inference, rather than having someone tell you what happened. You can
work towards a 'best guess' but always you'll get something wrong.

> 
> Andrew, with tracking versions only, can you give me an idea on how a
> merge at file level is calculated? Say you have (wildly) diverged
> branches and a newer one has added heaps of code. Now you add one line
> in the "old" version at line 100, then merge it upward - where it
> appears at line 987 because lots of other commits changed code before it
> in the new version.

The most common merge algorithm is 3-way merge. Which checks the
ancestry and sees that "old" version is the most-recent common ancestor
of the two current versions of the file. We end up with 3 texts, BASE is
the "old" version, THIS is the version with lots of changes (~987), and
OTHER is the modified "old" version (change at line 100).

At this point, you detect what bits of OTHER should be added to THIS, by
comparing it with BASE. That gives you context.

You then compare THIS with BASE, to also get context. Regions in THIS
which are the same as BASE can be used to line up the matching regions
in OTHER. If the new change in OTHER lies cleanly between matching
regions, then it gets applied cleanly, otherwise it gets marked as a
conflict.

As an example, (I'm writing everything on one line, but imagine each
character represents a line of text.)
BASE = ABC
OTHER = ABDC
THIS = AXYZ...MNOPBC

Doing the matching, you get:

BASE   A          B C
OTHER  A          BDC
THIS   AXYZ...MNOPB C

Since 'B' and 'C' are still in THIS, and line up with BASE and OTHER's
content, it is clear where D should be put. Conflicts happen with stuff
like "removing B in THIS":

THIS = AXYZ...MNOPC

Now D is trying to slot between content that is no longer present, so it
just marks a conflict. (In this case all of XYZ...MNOP conflict with 'D'
in the merge.)

> 
> You can "easily" extract the change itself (what diff to apply to
> target) by looking at the shared history so that part I get. But how do
> you determine a new location in target if you have no "move history"
> there? Because the only way I can think of to do that is to conceptually
> merge the change onto the revision closest to the source revision, and
> then reapplying /all/ diffs on that file upwards to the latest version
> keeping track of changes lines due to inserts and deletes.
> 

You use the delta between THIS and BASE to find matching chunks, and
then line up by matching that with BASE and OTHER. (Conceptually the
*matches* with BASE are used to line up, while the *differences* with
BASE are used to determine what needs to be added.)

> I ask because the mechanism used might also tell me how to determine a
> new location for a diff chunk.
> 
> Thanks,
> 
> Frits

The problem here is that you don't really have a logical BASE to compare
against. You have 2 deltas (3.1 vs its parent, and 3.2 vs its parent).
What I would probably do is:

1) compute the exact diff for both branches
2) Find exact matching lines in the diff.
3) (optional) for lines that are remaining, see if there is high enough
similarity to consider it a 'slightly modified' version of the same line.

Again, you can use the regions in (2) to narrow down your search in (3).
Given:

 a
+new content
- -removed content
 b

In one diff, and

 a
+mod content
- -removed content
 b

You know that the first and last two lines are identical. And then you
start comparing the lines inbetween to see if they are 'close enough'.
You could use something like 75% of the line is identical on both sides,
or something like that.

Basically, just use 'diff' on the unified diff text. Python's difflib,
or bzrlib's patiencediff can be used here. Also, I believe both of them
would let you do a per-character diff on a single line (worst case you
just expand the line into a line per character, and run it there.)

I'm pretty sure that is what stuff like vim does when giving you
sub-line highlighting.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkuCx+YACgkQJdeBCYSNAANgFQCeMtZ8NVTSJHY1scg0UIHD4Iz+
FTsAn1dRE2SlFftRMQRIUlXxZ3AxlKpT
=87rB
-----END PGP SIGNATURE-----