Defining semantics for copying and combing files/directories/symlinks.

Mon Mar 19 03:02:27 GMT 2007

On Sun, 2007-03-18 at 22:21 -0400, Aaron Bentley wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Robert Collins wrote:
> > For copying files I have a single use case in mind:
> > - create two files from a single file. (e.g. a user has a class Foo and
> > is splitting it into two, so they copy foo.c to foo-extracted.c).
> 
> So this is problematic, because it doesn't jibe with your previous comments.

Uhm. Well I didn't see it as problematic. Sorry :(. I did realise that
the mail was getting awfully long, its highly likely I got a bit
tangled. Lets try to debug.

> - - This is just file splitting, which is not controversial.  Copies are.
>   If all you meant was file splitting, you could have saved me a lot of
>   concern
> - - On the other hand, file splitting does not allow "us to support copies
> as first-class operations" as you previously described.

I perceive the operation I describe above as being one of copying,
followed by deletion of some of the content from each side. Its not
strictly splitting because the files content is not partitioned. e.g.
the copyright header is preserved.

> Yet lower down, you say "merging a branch that has altered the original
> file into a branch that has copied it will apply the changes made to the
> original file to both sides of the copy;".  This is "copy" semantics,
> not "split" semantics.  Split semantics would apply some of the changes
> to one file, and some of the changes to the other file, and there would
> be no changes applied to both.  This would avoid conflicts.

Right. Its copy semantics because I set out to define copy semantics :).

> Because your solution isn't the best solution for the sole use case you
> describe, I think you are also trying to address other use cases (e.g.
> hi-fidelity SVN imports) that you haven't articulated here.

Actually no, I really did sit down with just the use cases I presented,
and tried to design a straight forward, no bells and whistles set of
semantics that will meet the criteria of being sane and explainable. In
particular, I think split vs copy is an artificial distinction that few
users will actually care about, and that we can infer sufficiently well
that those that do care about will still be happy with what we might do.
I guess you'll have specific criticisms further down.

> In particular, if we are going to support file copies, it seems foolish
> to support them in ways that do not encompass SVN.
> 
> I have no data about how file copies are used in SVN.  So I don't know
> how common it is to start a new file using an existing file as a
> template.  If it is not uncommon, then we must account for it, and that
> means recognizing that some copies aren't really copies.
>
> Subversion users don't merge (aside from update) on nearly the same
> scale that Bazaar users do, so they are less likely to be bitten by
> copies that aren't really copies.
> 
> If you are convinced that template copies are not likely to be common, I
> would like to understand why.  (But since you gave copying COPYING as an
> example, I am not hopeful.)  Otherwise, I can go into much greater
> detail about the potential problems I forsee with template copying.

Please do, I'd like to get all the issues up on the table before we
start triaging and making tradeoffs on complexity vs ui etc etc.

> Also, it is rather disappointing to see discussion of file-splitting
> without accompanying discussion of code movement.  I think code movement
> is as common as file splitting, and it is equally frustrating to deal
> with moved code as split files.  Some representations of file-splitting
> would also encompass code movement.  For example, a file split could be
> represented as "new file"+"code movement".

I think they are both good things to support. Its not true that a 'file
split' can be represented as just new file + code movement, unless code
movement will also impact log - and if in fact that is what you are
thinking, then code movement logic possibly has a massive impact on the
overall model - per file history might be completely obsoleted.

I'm averse to biting off too much here though: VCS is a wicked problem,
its not going away anytime soon, and I'd really like to get a good
answer to the 'where is bzr cp' question - if you think that its
important to design code movement at the same time as copying, I'll
trust your instinct. Mine is to consider them orthogonal: Neither can
actively implement the semantics of the other in total because they are
addressing different scopes of object, and in both cases there is a
balance between accepting what the user tells us, and inferring what
they dont.

> >                 There are two basic cases for merge with respect to
> >                 copies: Either both branches have already done the copy,
> >                 or only one has.
> 
> What about the case where the branches have each done different copies?

Deriving from what I described above...
Do you mean:
branch A copies foo to bar
branch B copies foo to baz
branch A merges branch B?
here I would expect: the merged tree has foo, bar, baz, where 'bar' is
the bar in A plus the changes made in B to foo before it was copied;
'baz' in the baz in B plus the changes made in A to foo before it was
copied; 'foo' contains the changes made in both A and B to foo, and
notes that both 'bar' and 'baz' were copied from it when you examine
'bzr log'.
or
branch A copies foo to bar
branch B copies foo to bar
branch A merges branch B?
Here I would expect a path conflict on bar in our current codebase, but
otherwise the same as the first case above. Ideally we could resolve
that path conflict for the user as an automatic combine operation, and
merge the content well, given that it has a common parent (somewhere
along foo's ancestry).

> >> It's not clear to me that we should use the same primitive to represent
> >> both those operations.  The output of a split is two files with no
> >> common contents that are both related to the base file.  The output of a
> >> copy is two files that have identical contents to the base file.  In the
> >> first case, applying a merge from a pre-split tree should apply each
> >> change only once.  But in the second case, a merge from a pre-copy tree
> >> the changes would be applied twice: once to each file.
> > 
> > Its easier for a user to delete a 'deleted-region' conflict than to
> > manually repeat a merge that we didn't do for them.
> 
> I wonder, though, how many times they would have to do that.
> Potentially quite a lot, if they performed the split, and they are
> running a long-lived branch.  If we support file splits, we can handle
> this gracefully.  If we support file copies, we cannot.  So if I take
> your use case at face value, we should support file splits and not file
> copies.

Well I do note immediately later what we could do as a more advanced
implementation, to remove the repetition there.

> > Later on we could look at detecting when a
> > conflict in a split file applied correctly in another branch of the
> > split; if it did and the conflict was a 'region deleted' conflict, we
> > could elide that conflict completely, with no data-loss implications. I
> > think that there is not enough of a win by having 'split vs copy'
> > defined to justify the complexity in explaining it, let alone
> > implementing it.
> 
> I think you are saying, "supporting copies at the expense of file-splits
> is a win", which contradicts your single use case.
>
> Though some nuance is probably in order: by implementing file splits
> rather than file copies, we could apply only the relevant changes to
> each side of the split.  So rather than "eliding conflicts" as you say,
> we could simply not produce the conflicts in the first place.
> 
> I am also not convinced that eliding a "deletion conflict" would ever be
> a correct choice when dealing with file copies.  Deletion conflicts do
> happen with unsplit files, after all.

Let me be more precise. When the deleted region is in THIS, and the
altered region in OTHER applied to a different copy which has this
region in a non-deleted form, we might consider not showing a conflict
on this file, even if it conflicts in the other file in some regard,
because we can infer a move of code occured. This should probably be
done in conjunction with greater code move support though, and not as
part of the copy implementation, because it is a more general problem.

> >> Finally, it's not at all clear that anyone really wants COPYING to be
> >> treated as the same everywhere.
> 
> > I've handled this in the above user instructions by giving the user
> > predictable behaviour: If the user wants to change all COPYING files
> > ever, they branch from before the first one was created, change just
> > COPYING, commit, then merge that wherever.
> 
> Not good enough.  The branch with the copy may be a long-lived fork, and
> so your "branch from before the first one was created" scenario can
> effectively happen by accident.
>
> What start out as clones can diverge to such a degree that they deserve
> a new identity.  If you have a/COPYING (content:gplv2), and you produce
> b/COPYING, and then, many commits later, you change b/COPYING into
> gplv3, merges against a/COPYING should not apply to b/COPYING.

well, thats true at the level of branches as well. What starts off as
one project can become another. As for the long lived fork, I disagree.
There is a profound difference between 'a branch I started a long time
ago', and 'a branch that is hostile and does not merge the mainline'.
The former case will have merged the mainline many times, it will have
received the copy details for a/COPYING->b/COPYING, and had to resolve
any conflicts that occur with the later edits to b/COPYING over its
changed version. In the latter case, there will be no changes in common
since the time of the fork and the time of the merge, and there will be
a humungous merge conflict to resolve; this is just one part of it.

> >>> Advanced support for copies seems to mostly mean merging, and seems to
> >>> require knowing more about what the copy means.  Are they copying the
> >>> file to split it, or make a new copy of the same thing (like the gpl
> >>> example).
> > 
> > I dont think we need to know what the copy means. Users are very capable
> > of getting what they want given reasonable primitives.
> 
> To cite just one problem with this, SVN users whose data we import will
> have never read our instructions.

If we operate in a reasonably sane manner, it shouldn't matter. In point
of fact though, svn defines copy pretty much how I have, though in less
detail w.r.t. merges because they expect users to detail every single
merge, every time. That is, they treat merges across branches
identically to merges across file copies: you calculate and run an svn
merge command to perform the merge.

So a SVN user gets a split by never running 'svn merge -r.... filea
fileb', and a copy by running that same command. 

I can certainly go back and add 'allow high fidelity representation of
svn copies' to the use cases. Before I do that, I'd appreciate your
expansion on the problems you mentioned above.

Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070319/185e5ca4/attachment.pgp