Interesting merge optimization

Martin Pool mbp at sourcefrog.net
Thu Dec 8 01:42:37 GMT 2005


On  4 Dec 2005, John A Meinel <john at arbash-meinel.com> wrote:
> Robert Collins wrote:
> > On Fri, 2005-12-02 at 23:37 -0600, John Arbash Meinel wrote:
> > 
> >> Speaking of knits, I was thinking that we might be trying for too much
> >> by setting the requirement that we don't have to load the whole thing
> >> into memory. I realize we would like to see a version which gets around
> >> the O(n_lines) behavior by not having to load all lines.
> > 
> > The key thing about not loading the entire thing is that we can copy
> > remote revisions across without scaling per commit.
> > 
> > I.e. if you have 60000 commits in a repo, and I hav 59999 of those, how
> > much data do I need to read to end up with 60000
> > 
> > Rob
> 
> That is a different portion.
> I'm saying that you still need to read all 60,000. You can just chose to
> read the local ones rather than all from remote.
> 
> On of the things we were going to try to do with knits was say:
> 	I want to reproduce version 6123 in memory, to do so,
> 	I only need to read hunks 52, 55, 1123, 4432, 5524, and 6123
> 
> I don't know how easy that is going to be. And for Codeville weaves, it
> isn't possible (they explicitly require loading of all previous lines).
> 
> Having each revision be a diff against others, and having these
> locations put into an index lets us get append-only, and the ability to
> only read the remote chunks that we don't have locally.

A file which is append-only is good for robustness.  Having an index and
reading chunks is not an ideal solution for dumb-protocol remote access,
because we would probably still need to read out the whole index, and
and things like the inventory that index may get quite large.  It would
be unfortunate to have to read say 100bytes*10000 records == 1MB of
inventory index for every operation on a remote branch.

> I think that is actually more important than being able to recreate the
> in-memory text by only reading a selected number of hunks.

Perhaps the best of both worlds would be that it should be possible to
get a weave in time proportional to the number of line changes, but
getting one version of the text should be less.

> lose any ability to have deleted lines affect where things go. Which is
> sometimes useful. Certainly for weave merge you need to know what lines
> have been deleted.

Let's try to say exactly what would be different.

There are two possible outcomes in a merge for deleted text: either it
is deleted in the result, or it conflicts with another change in the
same area.

Suppose we do a cdv-like merge, where we first find regions of
disagreement between the current text.  Within those regions, we will
conflict unless exactly one side has modified the text since the common
versions.  If we change this to not take deletions into account, then
the condition becomes "conflict unless exactly one side has inserted new
lines since the common version."

So on each side any of the following can occur:

 - no changes
 - deletions only
 - new insertions only
 - new insertions and new deletions

Not taking deletions into account in merge conflates the first and
second, and third and forth cases.

If there is a region where one side has made only deletions, and the
other side has made insertions, weave merge will show a conflict, but
annotation-merge-without-deletions will let the new side win.  That
would be a problem if the deletions removed lines which were preserved on
the other side.  This is not so good.

One way to handle deletions is to note that a deletion on one side
corresponds to an old line still present on the other side.  If the
two-way diff picks out all the lines common to both sides, then every
line in the disagreement region must be either

 - new on this side (not originating in a common ancestor)
 - old, but not present on the other side (therefore deleted on the
   other side)

We want to conflict on regions where there are unmerged changes on both
sides.  We can do that using these annotations if we say that having old
lines present on one side means there is a new deletion on the *other*
side.

So for each side, on each disagreement region

 if there are new lines, this side has new information
 if there are old lines, the *other* side has new information

If exactly one side has new information, it is used; otherwise there is
a conflict.

One problem with both annotation-based merge and weaves is a tension as
to the purpose of annotations.  When humans are looking at the code they
generally want to know the true origin of the line - i.e. when it was
originally written.  But for merging, we also want to use the
annotations as an indication of the priority of a line - when it needs
to be considered newly important.

Perhaps we should in fact be tracking up to two annotations for each
line: when it really originated, and when it was last explicitly chosen.
This would get set when the line is chosen as a conflict resolution.
This is similar to *-merge, but I don't remember it being implemented
anywhere for text merges yet.

-- 
Martin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20051208/75968804/attachment.pgp 


More information about the bazaar mailing list