Interesting merge optimization

Thu Dec 8 13:49:01 GMT 2005

Martin Pool wrote:
> On  4 Dec 2005, John A Meinel <john at arbash-meinel.com> wrote:
>> Robert Collins wrote:
>>> On Fri, 2005-12-02 at 23:37 -0600, John Arbash Meinel wrote:
>>>
>>>> Speaking of knits, I was thinking that we might be trying for too much
>>>> by setting the requirement that we don't have to load the whole thing
>>>> into memory. I realize we would like to see a version which gets around
>>>> the O(n_lines) behavior by not having to load all lines.
>>> The key thing about not loading the entire thing is that we can copy
>>> remote revisions across without scaling per commit.
>>>
>>> I.e. if you have 60000 commits in a repo, and I hav 59999 of those, how
>>> much data do I need to read to end up with 60000
>>>
>>> Rob
>> That is a different portion.
>> I'm saying that you still need to read all 60,000. You can just chose to
>> read the local ones rather than all from remote.
>>
>> On of the things we were going to try to do with knits was say:
>> 	I want to reproduce version 6123 in memory, to do so,
>> 	I only need to read hunks 52, 55, 1123, 4432, 5524, and 6123
>>
>> I don't know how easy that is going to be. And for Codeville weaves, it
>> isn't possible (they explicitly require loading of all previous lines).
>>
>> Having each revision be a diff against others, and having these
>> locations put into an index lets us get append-only, and the ability to
>> only read the remote chunks that we don't have locally.
> 
> A file which is append-only is good for robustness.  Having an index and
> reading chunks is not an ideal solution for dumb-protocol remote access,
> because we would probably still need to read out the whole index, and
> and things like the inventory that index may get quite large.  It would
> be unfortunate to have to read say 100bytes*10000 records == 1MB of
> inventory index for every operation on a remote branch.
> 
>> I think that is actually more important than being able to recreate the
>> in-memory text by only reading a selected number of hunks.
> 
> Perhaps the best of both worlds would be that it should be possible to
> get a weave in time proportional to the number of line changes, but
> getting one version of the text should be less.
> 
>> lose any ability to have deleted lines affect where things go. Which is
>> sometimes useful. Certainly for weave merge you need to know what lines
>> have been deleted.
> 
> Let's try to say exactly what would be different.
> 
> There are two possible outcomes in a merge for deleted text: either it
> is deleted in the result, or it conflicts with another change in the
> same area.
> 
> Suppose we do a cdv-like merge, where we first find regions of
> disagreement between the current text.  Within those regions, we will
> conflict unless exactly one side has modified the text since the common
> versions.  If we change this to not take deletions into account, then
> the condition becomes "conflict unless exactly one side has inserted new
> lines since the common version."
> 
> So on each side any of the following can occur:
> 
>  - no changes
>  - deletions only
>  - new insertions only
>  - new insertions and new deletions
> 
> Not taking deletions into account in merge conflates the first and
> second, and third and forth cases.
> 
> If there is a region where one side has made only deletions, and the
> other side has made insertions, weave merge will show a conflict, but
> annotation-merge-without-deletions will let the new side win.  That
> would be a problem if the deletions removed lines which were preserved on
> the other side.  This is not so good.
> 
> One way to handle deletions is to note that a deletion on one side
> corresponds to an old line still present on the other side.  If the
> two-way diff picks out all the lines common to both sides, then every
> line in the disagreement region must be either
> 
>  - new on this side (not originating in a common ancestor)
>  - old, but not present on the other side (therefore deleted on the
>    other side)
> 
> We want to conflict on regions where there are unmerged changes on both
> sides.  We can do that using these annotations if we say that having old
> lines present on one side means there is a new deletion on the *other*
> side.
> 
> So for each side, on each disagreement region
> 
>  if there are new lines, this side has new information
>  if there are old lines, the *other* side has new information
> 
> If exactly one side has new information, it is used; otherwise there is
> a conflict.
> 
> One problem with both annotation-based merge and weaves is a tension as
> to the purpose of annotations.  When humans are looking at the code they
> generally want to know the true origin of the line - i.e. when it was
> originally written.  But for merging, we also want to use the
> annotations as an indication of the priority of a line - when it needs
> to be considered newly important.
> 
> Perhaps we should in fact be tracking up to two annotations for each
> line: when it really originated, and when it was last explicitly chosen.
> This would get set when the line is chosen as a conflict resolution.
> This is similar to *-merge, but I don't remember it being implemented
> anywhere for text merges yet.
> 

Isn't that what codeville merge does by keeping a numbered state? When a
line is created it gets the value 1 indicating it exists. When it is
deleted that is incremented to 2, if it is brought back later it is now
3, etc.

Though I guess it doesn't mark the revision that it was revived in
(though the information is there).

I guess the other problem is that you would have to know that there was
a conflict in this region, and it was resolved by selecting this line.
(Hence the line was actively selected).

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 249 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20051208/6f38a350/attachment.pgp