Defining semantics for copying and combing files/directories/symlinks.

Mon Mar 19 04:52:06 GMT 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> On Sun, 2007-03-18 at 22:21 -0400, Aaron Bentley wrote:
>> - - This is just file splitting, which is not controversial.  Copies are.
>>   If all you meant was file splitting, you could have saved me a lot of
>>   concern
>> - - On the other hand, file splitting does not allow "us to support copies
>> as first-class operations" as you previously described.
> 
> I perceive the operation I describe above as being one of copying,
> followed by deletion of some of the content from each side. Its not
> strictly splitting because the files content is not partitioned. e.g.
> the copyright header is preserved.

I perceive the operation you describe as being splitting, plus adding a
copyright header to one side.  You described it as splitting quite
frequently.

>> This is "copy" semantics,
>> not "split" semantics.  Split semantics would apply some of the changes
>> to one file, and some of the changes to the other file, and there would
>> be no changes applied to both.  This would avoid conflicts.
> 
> Right. Its copy semantics because I set out to define copy semantics :).

Well, that's the problem.  You're describing copy semantics even though
file splitting would describe your use case better.

> Actually no, I really did sit down with just the use cases I presented,
> and tried to design a straight forward, no bells and whistles set of
> semantics that will meet the criteria of being sane and explainable. In
> particular, I think split vs copy is an artificial distinction that few
> users will actually care about

I think a logical representation would be to represent the entire
contents of all files as a set of edges.  The beginnings and ends of
files would just be special edges.

Merge would then be an edge merge, applied to an optimized variant of
the entire contents of the tree.

This representation works well for code movement and file splitting, but
does an abysmal job of representing copies.

I think that if split can give much better behavior, then the
distinction between splitting and copying is not artificial.  It is real
and gives real benefits.

So according to the use case you've supplied, I think you've chosen the
wrong solution.  We should support file splits, but not file copying.

>> If you are convinced that template copies are not likely to be common, I
>> would like to understand why.  (But since you gave copying COPYING as an
>> example, I am not hopeful.)  Otherwise, I can go into much greater
>> detail about the potential problems I forsee with template copying.
> 
> Please do, I'd like to get all the issues up on the table before we
> start triaging and making tradeoffs on complexity vs ui etc etc.

So I hold that some copies are copies, and some copies are not.
Sometimes when people copy a template, they are making a new template,
maybe with a few changes.  In that case, a merge should target both
copies.  But frequently when copying a template, the copies will diverge
almost instantly.

> I think they are both good things to support. Its not true that a 'file
> split' can be represented as just new file + code movement, unless code
> movement will also impact log - and if in fact that is what you are
> thinking, then code movement logic possibly has a massive impact on the
> overall model - per file history might be completely obsoleted.

Yes, it's conceivable that using an edge-based representation might
obsolete a lot of our model.  Or we may be able to generate the
edge-based representation from our model plus a small amount of extra data.

> I'm averse to biting off too much here though: VCS is a wicked problem,
> its not going away anytime soon, and I'd really like to get a good
> answer to the 'where is bzr cp' question

See, this is what makes me think you have additional criteria that
you're admitting.  If we *never* had support for copies, but supported
file splitting really, really well, would you be happy?

>>>                 There are two basic cases for merge with respect to
>>>                 copies: Either both branches have already done the copy,
>>>                 or only one has.
>> What about the case where the branches have each done different copies?
> 
> Deriving from what I described above...
> Do you mean:
> branch A copies foo to bar
> branch B copies foo to baz
> branch A merges branch B?

Yes.

> here I would expect: the merged tree has foo, bar, baz, where 'bar' is
> the bar in A plus the changes made in B to foo before it was copied;
> 'baz' in the baz in B plus the changes made in A to foo before it was
> copied; 'foo' contains the changes made in both A and B to foo, and
> notes that both 'bar' and 'baz' were copied from it when you examine
> 'bzr log'.

I'm not sure what I think should happen here.

What's interesting is that the foo in A has no special relationship to
the foo in B.

>>>> It's not clear to me that we should use the same primitive to represent
>>>> both those operations.  The output of a split is two files with no
>>>> common contents that are both related to the base file.  The output of a
>>>> copy is two files that have identical contents to the base file.  In the
>>>> first case, applying a merge from a pre-split tree should apply each
>>>> change only once.  But in the second case, a merge from a pre-copy tree
>>>> the changes would be applied twice: once to each file.
>>> Its easier for a user to delete a 'deleted-region' conflict than to
>>> manually repeat a merge that we didn't do for them.
>> I wonder, though, how many times they would have to do that.
>> Potentially quite a lot, if they performed the split, and they are
>> running a long-lived branch.  If we support file splits, we can handle
>> this gracefully.  If we support file copies, we cannot.  So if I take
>> your use case at face value, we should support file splits and not file
>> copies.
> 
> Well I do note immediately later what we could do as a more advanced
> implementation, to remove the repetition there.

True, but you put it off for later, and I don't think that the
heuristics you're proposing are adequate to replicate the behavior of
file splits.

>> I am also not convinced that eliding a "deletion conflict" would ever be
>> a correct choice when dealing with file copies.  Deletion conflicts do
>> happen with unsplit files, after all.
> 
> Let me be more precise. When the deleted region is in THIS, and the
> altered region in OTHER applied to a different copy which has this
> region in a non-deleted form, we might consider not showing a conflict
> on this file, even if it conflicts in the other file in some regard,
> because we can infer a move of code occured. This should probably be
> done in conjunction with greater code move support though, and not as
> part of the copy implementation, because it is a more general problem.

The problem is that if the file was actually copied, rather than split,
you will fail to emit a necessary conflict.

>> What start out as clones can diverge to such a degree that they deserve
>> a new identity.  If you have a/COPYING (content:gplv2), and you produce
>> b/COPYING, and then, many commits later, you change b/COPYING into
>> gplv3, merges against a/COPYING should not apply to b/COPYING.
> 
> well, thats true at the level of branches as well. What starts off as
> one project can become another. 

True.  If only humans weren't so damn fuzzy.

> As for the long lived fork, I disagree.
> There is a profound difference between 'a branch I started a long time
> ago', and 'a branch that is hostile and does not merge the mainline'.

I am talking about a case where the fork is the one that merges the
mainline frequently.  The fork has two copies of COPYING, and they
become distinct.  The mainline has one.  The mainline changes its copy.
 The changes affect both copies in the fork.  This is not ideal
behavior.  It may be the best that's achievable with copies, though.

Consider the contents of bzrlib/util.  Would you consider us a hostile
fork of configobj?  Say we merge changes from the mainline.  If Fuzzyman
updates his __init__.py (which is currently blank), it's conceivable
that this would affect other blank copies of __init__.py.

>>> I dont think we need to know what the copy means. Users are very capable
>>> of getting what they want given reasonable primitives.
>> To cite just one problem with this, SVN users whose data we import will
>> have never read our instructions.
> 
> If we operate in a reasonably sane manner, it shouldn't matter. In point
> of fact though, svn defines copy pretty much how I have, though in less
> detail w.r.t. merges because they expect users to detail every single
> merge, every time. That is, they treat merges across branches
> identically to merges across file copies: you calculate and run an svn
> merge command to perform the merge.
> 
> So a SVN user gets a split by never running 'svn merge -r.... filea
> fileb', and a copy by running that same command. 

I don't really follow this.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFF/hb20F+nu1YWqI0RAgeiAJ4q+LaqnWs4Du9xxqUdWoh8mJsGKQCeJCmc
M1GX/XGY7BQcaGBZDJ7oGcM=
=2ffs
-----END PGP SIGNATURE-----