Defining semantics for copying and combing files/directories/symlinks.

Mon Mar 19 13:30:46 GMT 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
>> Robert Collins wrote:
> 
> lets assume that splitting and copying are sufficiently different that
> they need separate definitions, and lets make the line a hard one:
>  * splitting A into A and B gives an A' and a B that share a common
> heritage, but no operations from that point on will consider them
> linked. Merges from an unsplit branch will ???. Merging to the split
> branch will ???. I dont have good answers to these two '???'s because
> I'm assuming that we want something different to the copy case. I'll try
> though: Merges from an unsplit branch will apply to both A' and B, but
> each hunk of difference may only apply to one of A' and B, if it applies
> successfully to both A' and B it is marked as conflicted in both.

What I have in mind for file splits is that they would be represented as
 "Lines X from file A become file A', Lines Y from file A become file B".

This would mean that when we were performing the merge, changes
affecting lines X in the old file would be applied to file A', and
changes affecting lines Y in the old file would be applied to file B.

There would be no changes that we would attempt to apply to either
file-- they would be applied to one or the other.

> Merges
> from a split branch to an unsplit branch will split the file at the
> point it was split in the source branch and apply the changes from the
> target branches A to both A' and B as per the reverse operation.

I would have described it as splitting A into A' and B per the original
split, then applying the changes from the previously-split branch to the
newly-split one.

> Now to me, these are clearly different, but I still dont think they are
> different *enough* to justify having two separate concepts in the
> system. I may be wrong :).

I'm not proposing that we have both copy and split.  I think that the
distiction would not be clear enough.  In terms of your use case, I
think split has better behavior.  The edge representation that would
work so well for split would also work nicely for moving code between
files, and for moving code within a file.

So based on our criteria so far, I think we should have have split and
not copy.

> Well, if you accept that having copy and split be concretely separate
> things. At the moment I don't, but one way out of this subdebate is for
> for me to rewrite the copying side with a use case that is clearly not a
> candidate for splitting. Would that help?

Yes.  That would change our criteria, and split would no longer be a
clear winner.

>> I think a logical representation would be to represent the entire
>> contents of all files as a set of edges.  The beginnings and ends of
>> files would just be special edges.
>>
>> Merge would then be an edge merge, applied to an optimized variant of
>> the entire contents of the tree.
>>
>> This representation works well for code movement and file splitting, but
>> does an abysmal job of representing copies.
> 
> Well, it has no copy semantics defined at all, but surely we can define
> them in much the same way at the granularity of lines as at the
> granularity of files.

True.  My point is that this representation can reflect a lot of
desirable operations: split, move-between-files and (though I forgot to
mention it) move-inside-a-file.  But it cannot represent copy by itself.
 This is one of the reasons I consider the difference between copy and
split to be a real difference.

> We'd want to consider what to do for directory
> copies too though: in the proposal I put above, copying a directory, and
> merging from an uncopied one that adds a file would copy the file into
> the new directory, and I think we'd want to keep that.

Yeesh.  Directory copies aren't handled by the file splitting concept at
all.

>> So according to the use case you've supplied, I think you've chosen the
>> wrong solution.  We should support file splits, but not file copying.
> 
> To summarise the list of better behaviours: to make sure I've been
> paying attention, they are:
>  * repeated merges from before-a-split into after-a-split should not
> show conflicts on the portion of the file partitioned into the over part
> of the split.
> 
> As far as I can tell, thats the only difference?

The only difference is better merge behavior-- changes are only applied
once, and only to the correct portion of the file.

>> So I hold that some copies are copies, and some copies are not.
>> Sometimes when people copy a template, they are making a new template,
>> maybe with a few changes.  In that case, a merge should target both
>> copies.  But frequently when copying a template, the copies will diverge
>> almost instantly.
> 
> Do you mean here that there should be three operations? copy, split,
> copy-for-diverge? Or are you saying that when you copy a template and
> diverge immediately, that that is a form of split?

No, I think that copies may become unrelated either immediately (in
which case, we can berate the user for using "bzr cp" instead of "cp;bzr
add") or later on.

> My position on this at the moment is that it doesn't matter: If you copy
> a template to make a new template, changes made from before the copy
> should affect both, because bzr cannot know whether they are relevant to
> both copies or not; a template where you diverge a lot will conflict
> when they change on both sides a lot. That said, changing of templates
> should be rare, and it should work nicely I think.

Say the user has done a copy of a template, and they now decide that
they want the file to be distinct from the template.  "bzr remove foo;
bzr add foo" isn't a good option, because it damages merging from recent
branches.  So I think if we're supporting true copies, we would want a
way to break the association between "foo" and the old template.

>>> I'd really like to get a good
>>> answer to the 'where is bzr cp' question
>> See, this is what makes me think you have additional criteria that
>> you're admitting.  If we *never* had support for copies, but supported
>> file splitting really, really well, would you be happy?
> 
> No, because 'splitting' is not the inverse of combine

We differ here.  If splits are represented as "Lines X from A become A',
lines Y from A become B", then splits are the inverse: "A' becomes lines
X in A", "B becomes lines Y in A".  It's perfectly possible for combine
to be symmetrical with split.

>, and the combine
> operation which seems to be otherwise non-contentious should be
> something users can undo easily post-hoc, just like they can move files
> and directories back after a rename.

I think the symmetry means that users would be able to undo a
split+combine easily.

>>> Well I do note immediately later what we could do as a more advanced
>>> implementation, to remove the repetition there.
>> True, but you put it off for later, and I don't think that the
>> heuristics you're proposing are adequate to replicate the behavior of
>> file splits.
> 
> Why not? Is there a case where something that knows a file has been
> split can do better than the heuristic I proposed?

Yes.  If splits record what regions went into each file, then they can
apply only the changes that affect that region to that file.

>> The problem is that if the file was actually copied, rather than split,
>> you will fail to emit a necessary conflict.
> 
> Are you saying that for a, lets call it 'real copy', that *every change*
> made before the copy must apply to *all copies*, and the heuristic I'm
> proposing of not complaining about a conflict which on one file the
> lines are present (in some form) and in the other are completely missing
> does not honour this? I can't think of a case where this is desirable
> except in a purely hypothetical sense.

For original file A, containing these lines:

"""import StringIO
f = StringIO.StringIO()
"""

Suppose there are two copies, B and C.

B is unchanged.

C has """import cStringIO as StringIO
f = StringIO.StringIO()
"""

Now suppose a new version of A has

"""from StringIO import StringIO
f = StringIO()
"""

If we apply that change to B and C using your heuristic, C will have

"""import cStringIO as StringIO
f = StringIO()
"""

It would be better to have

"""
<<<<<<<< MERGE-SOURCE
from StringIO import StringIO
=======
import cStringIO as StringIO
>>>>>>> TREE

f = StringIO()
"""

The same kind of thing could prevent a bugfix from being applied
everywhere it was relevant.  When you have two copies, it's just not
kosher to silently drop conflicts.

>> Consider the contents of bzrlib/util.  Would you consider us a hostile
>> fork of configobj?  Say we merge changes from the mainline.  If Fuzzyman
>> updates his __init__.py (which is currently blank), it's conceivable
>> that this would affect other blank copies of __init__.py.
> 
> Well, if we've been copying __init__.py all around. Sure, I'd expect it to
> do that, and *so would we*. It seems hard to have a copy which is a copy
> but not a copy. That is, if people use 'bzr cp', they should *expect* it
> to propogate, rather than be surprised when it does.

And I hold that people will intuit that "bzr cp" should be used in
preference to "cp && bzr add".  And that when they realize their
mistake, they should be able to fix it without "bzr remove && bzr add"

> More importantly, I dont think our behaviour post-conversion matters too
> much if we can represent svn properly. But I haven't explicitly tried to
> accomodate svn at this point; its obviously in the back of my head, but
> I dont use it enough, nor do I think the choices of svn should influence
> us too much, for it to be a significant factor in this design decision
> (at this point; maybe once we've a proposal we're happy with we can go
> back and assess svn conversions in detail, to see what we might want to
> tweak or special case for that.

Understanding your position on svn is helpful.

I think we are swimming in possibilities ATM, and it would be really
nice to get some statistics on how people use (non-branching) SVN
copies.  They are the most prominent users of file copying, and I think
if we're going to implement copying, it would help for us to understand
how it's commonly used.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFF/pCG0F+nu1YWqI0RAobmAJ45w+qqa9sNPs56DvY8SCfVtPxY3wCfSHjs
+uY28viCdjqgle2FNTwnmUs=
=Zpxa
-----END PGP SIGNATURE-----