Defining semantics for copying and combing files/directories/symlinks.

Mon Mar 19 08:17:14 GMT 2007

On Mon, 2007-03-19 at 00:52 -0400, Aaron Bentley wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Robert Collins wrote:
> > On Sun, 2007-03-18 at 22:21 -0400, Aaron Bentley wrote:
> >> - - This is just file splitting, which is not controversial.  Copies are.
> >>   If all you meant was file splitting, you could have saved me a lot of
> >>   concern
> >> - - On the other hand, file splitting does not allow "us to support copies
> >> as first-class operations" as you previously described.
> > 
> > I perceive the operation I describe above as being one of copying,
> > followed by deletion of some of the content from each side. Its not
> > strictly splitting because the files content is not partitioned. e.g.
> > the copyright header is preserved.
> 
> I perceive the operation you describe as being splitting, plus adding a
> copyright header to one side.  You described it as splitting quite
> frequently.

lets assume that splitting and copying are sufficiently different that
they need separate definitions, and lets make the line a hard one:
 * splitting A into A and B gives an A' and a B that share a common
heritage, but no operations from that point on will consider them
linked. Merges from an unsplit branch will ???. Merging to the split
branch will ???. I dont have good answers to these two '???'s because
I'm assuming that we want something different to the copy case. I'll try
though: Merges from an unsplit branch will apply to both A' and B, but
each hunk of difference may only apply to one of A' and B, if it applies
successfully to both A' and B it is marked as conflicted in both. Merges
from a split branch to an unsplit branch will split the file at the
point it was split in the source branch and apply the changes from the
target branches A to both A' and B as per the reverse operation.
 * copying A to B gives an A' and a B that share a common heritage, and
operations which were defined as just affecting A will affect A' and B;
operations that affect A' and B will both affect A when merged in to
opposite direction, to preserve the symmetry of merge.

Now to me, these are clearly different, but I still dont think they are
different *enough* to justify having two separate concepts in the
system. I may be wrong :).

> >> This is "copy" semantics,
> >> not "split" semantics.  Split semantics would apply some of the changes
> >> to one file, and some of the changes to the other file, and there would
> >> be no changes applied to both.  This would avoid conflicts.
> > 
> > Right. Its copy semantics because I set out to define copy semantics :).
> 
> Well, that's the problem.  You're describing copy semantics even though
> file splitting would describe your use case better.

Well, if you accept that having copy and split be concretely separate
things. At the moment I don't, but one way out of this subdebate is for
for me to rewrite the copying side with a use case that is clearly not a
candidate for splitting. Would that help?

> > Actually no, I really did sit down with just the use cases I presented,
> > and tried to design a straight forward, no bells and whistles set of
> > semantics that will meet the criteria of being sane and explainable. In
> > particular, I think split vs copy is an artificial distinction that few
> > users will actually care about
> 
> I think a logical representation would be to represent the entire
> contents of all files as a set of edges.  The beginnings and ends of
> files would just be special edges.
> 
> Merge would then be an edge merge, applied to an optimized variant of
> the entire contents of the tree.
> 
> This representation works well for code movement and file splitting, but
> does an abysmal job of representing copies.

Well, it has no copy semantics defined at all, but surely we can define
them in much the same way at the granularity of lines as at the
granularity of files. We'd want to consider what to do for directory
copies too though: in the proposal I put above, copying a directory, and
merging from an uncopied one that adds a file would copy the file into
the new directory, and I think we'd want to keep that.

> I think that if split can give much better behavior, then the
> distinction between splitting and copying is not artificial.  It is real
> and gives real benefits.
> 
> So according to the use case you've supplied, I think you've chosen the
> wrong solution.  We should support file splits, but not file copying.

To summarise the list of better behaviours: to make sure I've been
paying attention, they are:
 * repeated merges from before-a-split into after-a-split should not
show conflicts on the portion of the file partitioned into the over part
of the split.

As far as I can tell, thats the only difference?

> >> If you are convinced that template copies are not likely to be common, I
> >> would like to understand why.  (But since you gave copying COPYING as an
> >> example, I am not hopeful.)  Otherwise, I can go into much greater
> >> detail about the potential problems I forsee with template copying.
> > 
> > Please do, I'd like to get all the issues up on the table before we
> > start triaging and making tradeoffs on complexity vs ui etc etc.
> 
> So I hold that some copies are copies, and some copies are not.
> Sometimes when people copy a template, they are making a new template,
> maybe with a few changes.  In that case, a merge should target both
> copies.  But frequently when copying a template, the copies will diverge
> almost instantly.

Do you mean here that there should be three operations? copy, split,
copy-for-diverge? Or are you saying that when you copy a template and
diverge immediately, that that is a form of split?

My position on this at the moment is that it doesn't matter: If you copy
a template to make a new template, changes made from before the copy
should affect both, because bzr cannot know whether they are relevant to
both copies or not; a template where you diverge a lot will conflict
when they change on both sides a lot. That said, changing of templates
should be rare, and it should work nicely I think.

> > I'm averse to biting off too much here though: VCS is a wicked problem,
> > its not going away anytime soon, and I'd really like to get a good
> > answer to the 'where is bzr cp' question
> 
> See, this is what makes me think you have additional criteria that
> you're admitting.  If we *never* had support for copies, but supported
> file splitting really, really well, would you be happy?

No, because 'splitting' is not the inverse of combine, and the combine
operation which seems to be otherwise non-contentious should be
something users can undo easily post-hoc, just like they can move files
and directories back after a rename.

> > here I would expect: the merged tree has foo, bar, baz, where 'bar' is
> > the bar in A plus the changes made in B to foo before it was copied;
> > 'baz' in the baz in B plus the changes made in A to foo before it was
> > copied; 'foo' contains the changes made in both A and B to foo, and
> > notes that both 'bar' and 'baz' were copied from it when you examine
> > 'bzr log'.
> 
> I'm not sure what I think should happen here.
> 
> What's interesting is that the foo in A has no special relationship to
> the foo in B.

I find that interesting too. It seems right to me though.

> > Well I do note immediately later what we could do as a more advanced
> > implementation, to remove the repetition there.
> 
> True, but you put it off for later, and I don't think that the
> heuristics you're proposing are adequate to replicate the behavior of
> file splits.

Why not? Is there a case where something that knows a file has been
split can do better than the heuristic I proposed?

> >> I am also not convinced that eliding a "deletion conflict" would ever be
> >> a correct choice when dealing with file copies.  Deletion conflicts do
> >> happen with unsplit files, after all.
> > 
> > Let me be more precise. When the deleted region is in THIS, and the
> > altered region in OTHER applied to a different copy which has this
> > region in a non-deleted form, we might consider not showing a conflict
> > on this file, even if it conflicts in the other file in some regard,
> > because we can infer a move of code occured. This should probably be
> > done in conjunction with greater code move support though, and not as
> > part of the copy implementation, because it is a more general problem.
> 
> The problem is that if the file was actually copied, rather than split,
> you will fail to emit a necessary conflict.

Are you saying that for a, lets call it 'real copy', that *every change*
made before the copy must apply to *all copies*, and the heuristic I'm
proposing of not complaining about a conflict which on one file the
lines are present (in some form) and in the other are completely missing
does not honour this? I can't think of a case where this is desirable
except in a purely hypothetical sense.

> > As for the long lived fork, I disagree.
> > There is a profound difference between 'a branch I started a long time
> > ago', and 'a branch that is hostile and does not merge the mainline'.
> 
> I am talking about a case where the fork is the one that merges the
> mainline frequently.  The fork has two copies of COPYING, and they
> become distinct.  The mainline has one.  The mainline changes its copy.
>  The changes affect both copies in the fork.  This is not ideal
> behavior.  It may be the best that's achievable with copies, though.

I see...

> Consider the contents of bzrlib/util.  Would you consider us a hostile
> fork of configobj?  Say we merge changes from the mainline.  If Fuzzyman
> updates his __init__.py (which is currently blank), it's conceivable
> that this would affect other blank copies of __init__.py.

Well, if we've been copying __init__.py all around. Sure, I'd expect it to do that, and *so would we*. It seems hard to have a copy which is a copy but not a copy. That is, if people use 'bzr cp', they should *expect* it to propogate, rather than be surprised when it does.

> >>> I dont think we need to know what the copy means. Users are very capable
> >>> of getting what they want given reasonable primitives.
> >> To cite just one problem with this, SVN users whose data we import will
> >> have never read our instructions.
> > 
> > If we operate in a reasonably sane manner, it shouldn't matter. In point
> > of fact though, svn defines copy pretty much how I have, though in less
> > detail w.r.t. merges because they expect users to detail every single
> > merge, every time. That is, they treat merges across branches
> > identically to merges across file copies: you calculate and run an svn
> > merge command to perform the merge.
> > 
> > So a SVN user gets a split by never running 'svn merge -r.... filea
> > fileb', and a copy by running that same command. 
> 
> I don't really follow this.

Uhm. Rephrasing: svn copies don't take merging into consideration
because much svn does not perform merge base selection for the user.

So the aspects of advanced merging for copy/split of files that relate
to what to merge into where, and with what bases, are not addressed in
svn. What is there is the ability to do it any-which way.

More importantly, I dont think our behaviour post-conversion matters too
much if we can represent svn properly. But I haven't explicitly tried to
accomodate svn at this point; its obviously in the back of my head, but
I dont use it enough, nor do I think the choices of svn should influence
us too much, for it to be a significant factor in this design decision
(at this point; maybe once we've a proposal we're happy with we can go
back and assess svn conversions in detail, to see what we might want to
tweak or special case for that.

-Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070319/6c91fa20/attachment.pgp