Defining semantics for copying and combing files/directories/symlinks.

Sun Mar 18 22:35:15 GMT 2007

This contains a lot of notes from the recent thread, and because I'm
combining emails I got horribly confused about who-wrote-what, so theres
no attribution here when I quote below; sorry. There is little quoting
left though, its mainly new prose trying to detail the space.

If we support copying, we should support it for all our versioned
objects: tree-references, directories, symlinks and files. We should
support it for all of them so that users do not get surprised when it
only works some of the time, and for robust code, because object can
change kind, if we only supported copying of files, a file that was
copied and become a symlink would be likely to create corner cases.

Likewise, if we support combining, we should support it across the board
for tree-references, directories, symlinks and files, and for the same
reasons.

The behaviour of bzr in the presence of copies and combined files should
be predictable. If its not, the feature will be hard or unpleasant to
use. To that end, I think we should avoid adding magic to the feature.
For instance, one use case I've seen touted in the past was to 'use
copied files to allow updating common content across many files'. If
this falls out naturally, as an emergent property, then thats fine.
However designing explicitly for that is not a positive thing I think.
Our use cases should be real, concrete and immediately useful.

So what use cases matter?

For combining there are two use cases I can think of:
- record that one file has been subsumed into another. (e.g. a user has
two classes FooBase and FooImplementation and decides that really the
separation isn't needed, so combines them, combining the files foo.c and
foo-implementation.c into just foo.c at the same time.)
- resolve duplicate additions of files. (e.g. two bzr devs apply a
regular patch from the internet which adds a file).

For copying files I have a single use case in mind:
- create two files from a single file. (e.g. a user has a class Foo and
is splitting it into two, so they copy foo.c to foo-extracted.c).

I think these are both pretty clear, but whats not clear is how various
commands like annotate, commit, merge, log, revert etc will work. merge
in particular is interesting because of its ability to interact with
trees that have not done the combination step yet.

Heres how I'd like to be able to explain file combining to a user. Note
that I dont talk about some potentially advanced things we can do to
reduce conflicts, this is intended as end user documentation, so it
talks enough about the metal that they can completely predict what bzr
will do, and understand how to deal with it.

        When you want to combine two previously separate files into one
        (e.g. because you are combining two classes and want 'bzr
        annotate' to not consider the moved lines to be new) you can use
        the command 'bzr combine FILEA FILEB'. This command tells bzr
        that FILEB has been combined into FILEA (if FILEB is still on
        disk, you will get a warning). After running this command bzr
        treats the combined FILEA as a new logical file (call it 'C')
        with the same filename on disk. This is done as a housekeeping
        measure, so that this file is a child of both FILEA and FILEB,
        and will be affected during merge by changes made to either
        FILEA or FILEB in other branches, and conversely changes made to
        the combined FILEA will affect both FILEA and FILEB when merged
        to a different branch. 'bzr combine' can be run on any two
        versioned paths. After doing a combine a number of commands will
        give you information about the fact this combination has
        occured. Specifically:
                - bzr log FILEA will show you the changes done to the
                combined file back to the point of combination, and from
                there the changelog for both files. bzr log will also
                note clearly where the combining step occured.
                - bzr annotate FILEA will assign lines in the file to
                commits made to either FILEA or FILEB.
                - bzr reverting to a revision before they were combined
                will split the files. (There were two files before the
                combination took place, so reverting back before the
                combination needs to create two files).
                - the first bzr commit after the combine will record the
                combination as long as either FILEA or FILEB are
                selected to be committed. (commiting a different path,
                FILED, will leave the combination on local disk but not
                committed, as per usual).
                - bzr merge handles combining like any other merge
                operation: merging is symmetrical, so regardless of
                which branch you cd into and run merge from, you should
                expect the same result.
                There are two basic cases for merge with respect to
                combines: Either both branches already have the combined
                file, or only one does. If both branches have the
                combined file, it acts exactly like regular merge. If
                only one branch has the combined file, e.g. when merging
                from a branch that has committed a combine into a branch
                that has not yet recorded that combine will try to
                reproduce all the changes you made since the last time
                that branch merged you: it will attempt to represent the
                changes to made to FILEA and FILEB before the merge, the
                combine step, and any subsequent changes you have made.
                If this cannot be done, and a conflict occurs, bzr will
                create 1 or 2 sets of '.BASE' and '.THIS' files, with a
                common '.OTHER' file: this reflects bzr's use of a
                standard 3-way merge, and that the output file it wants
                to create (the result of combining your FILEA and FILEB
                like the source branch did) actually involves two
                merges: one with the uncombined base for FILEA, your
                FILEA and the combined file, and one with uncombined
                base for FILEB, your FILEB and the combined file. Due to
                how 3-way merge works its possible for this to usually
                produce no conflicts, but if conflicts occur on both
                sides, bzr will do what it can, and leave you enough
                information to pick up the pieces with an advanced gui
                conflict resolver.
                Symmetrically, merging a branch that has altered one or
                both of FILEA or FILEB into a branch that has combined
                them will apply the changes made to the combined file;
                and if there are conflicts will do its best to resolve
                them, but also leave a set of.BASE, .OTHER and .THIS
                files matching the set of files the conflict occured
                on. 

                'Cherrypicking' with combined files. If you cherrypick a
                commit from a branch that has not combined the files
                into a branch that has: this will apply the change in
                the cherrypick to the combined file. As above, if
                multiple file changes need to be consolidated into the
                combined file, multiple conflicts may occur. If you
                cherrypick a commit from a branch that has combined two
                files into a branch that has not, the changes made to
                the single combined file are made to both the files in
                the branch that has not combined them. Note that this
                will usually create a large conflict in at least one of
                the files, as the combined file usually has unique
                content which wont be present in the other file.

                Finally, as usual, run 'bzr resolved' once you have
                resolved conflicts, just like normal.

And heres how I'd like to explain file splitting(copying) to users.

        When you want to split a single file into two (e.g. you are
        splitting a file containing multiple classes into one file per
        class but still want 'bzr log' to show its history, or 'bzr
        annotate' to produce good data about who-wrote-what), you can
        run 'bzr cp FILEA FILEB'.  After running this command bzr treats
        both FILEA and FILEB as new files which come from the original
        FILEA. This is done as a housekeeping measure, so that changes
        made to the original FILEA in another branch can correctly apply
        to the new FILEA and FILEB, without changes to the new FILEA
        incorrectly applying to FILEB. 'bzr cp' can be run on any
        versioned paths. After doing a 'cp' a number of commands will
        give you information about the fact this has occured.
        Specifically:
                - bzr log FILEA will note in the log when the copy
                occured, and bzr log FILEB will show when the copy
                occured, and the path it was copied from (FILEA). bzr
                log on either FILEA or FILEB will show all history from
                before the copy took place as well as any changes made
                to that file since the copy.
                - bzr annotate will assign lines in the file being
                annotated to commits made both before an after the copy
                (as you would expect).
                - bzr reverting to a revision before the copy took place
                will delete the copy, regardless of which file you
                reverted. (Before the copy took place, there was only
                one file, so if you revert back before the copy, there
                can only be one file).
                - the first bzr commit after the copy will record the
                copy if either FILEA or FILEB is selected, and will
                commit both. (Committing a different path, FILED, will
                leave the copy on local disk but not committed, as
                usual).
                - bzr merge handles copies like any other merge
                operation: merging is symmetrical, so regardless of
                which branch you cd into and run merge from, you should
                expect the same result.
                There are two basic cases for merge with respect to
                copies: Either both branches have already done the copy,
                or only one has. If both branches have already performed
                the copy, then merge acts exactly like normal: there are
                2 files in each branch that need to be merged. If only
                one branch has the copy, e.g. when you merge from a
                branch that has performed a copy, merge will perform the
                copy in your branch, of your current version of the file
                that was copied, and then apply the unique changes made
                to each side of the copy in the other branch to the two
                files you now have. The usual conflict markers
                and .BASE, .THIS and .OTHER will be created. Its
                important to note that if you have made a single
                conflicting change you may see two conflicts: one in
                each side of the copy that was made. Due to how 3-way
                merge works its possible for concurrent copying and
                editing to usually produce no conflicts, but if
                conflicts occur, bzr will do what it can, and leave you
                enough information to pick up the pieces with an
                advanced gui conflict resolver, or wiggle, or so on.
                Symmetrically, merging a branch that has altered the
                original file into a branch that has copied it will
                apply the changes made to the original file to both
                sides of the copy; and if there are conflicts will do
                its best to resolve them, but also leave a set
                of.BASE, .OTHER and .THIS files matching the set of
                files the conflict occured on. 

                'Cherrypicking' with copied files. If you cherrypick a
                commit from a branch that has not copied the file into a
                branch that has: this will apply the change in the
                cherrypick to both files. If you cherrypick a commit
                from a branch that has copied a file into a branch that
                has not, the changes made to both the original and the
                copy are made to the original file in the branch that
                had not performed the copy. Note that this will usually
                create a large conflict in the original file, as the
                original usually has unique content with respect to at
                least one of the copies.

                Finally, as usual, run 'bzr resolved' once you have
                resolved conflicts, just like normal.

I think these definitions are predictable to users, and while there is a
lot of text to describe it, its only a single paragraph to provide the
core, the rest is really exposition. Of particular note, these
operations are symmetric: Performing 
bzr cp A B
bzr join A B
will result in a tree that behaves in the same way as if the copy never
took place. This makes undoing the operation easy for users even after
several commits have taken place.

Note that I'm still not talking about data models etc. 

>> Copies also make sense for some user
>> operations, like splitting a files contents, or take a file like
>> 'COPYING' that does not change often and putting it into other locations
>> or trees.

> It's not clear to me that we should use the same primitive to represent
> both those operations.  The output of a split is two files with no
> common contents that are both related to the base file.  The output of a
> copy is two files that have identical contents to the base file.  In the
> first case, applying a merge from a pre-split tree should apply each
> change only once.  But in the second case, a merge from a pre-copy tree
> the changes would be applied twice: once to each file.

Its easier for a user to delete a 'deleted-region' conflict than to
manually repeat a merge that we didn't do for them. So in the
definitions up above I've erred on that side, users that want merges
from pre-split trees to apply to just one side can manually delete the
deleted-region conflicts. Later on we could look at detecting when a
conflict in a split file applied correctly in another branch of the
split; if it did and the conflict was a 'region deleted' conflict, we
could elide that conflict completely, with no data-loss implications. I
think that there is not enough of a win by having 'split vs copy'
defined to justify the complexity in explaining it, let alone
implementing it.

> Finally, it's not at all clear that anyone really wants COPYING to be
> treated as the same everywhere.  Because if COPYING changes, that would
> mean that everyone in a project had agreed to change the license.  If
> you have two copies of COPYING, you probably have two sub-projects in a
> tree.  So it's quite conceivable that one sub-project might change their
> license, while the other did not.

I've handled this in the above user instructions by giving the user
predictable behaviour: If the user wants to change all COPYING files
ever, they branch from before the first one was created, change just
COPYING, commit, then merge that wherever.

>> Two versioned paths become one: This is mostly covered in my text about
>> parallel imports. While not quite the same thing they are closely
>> related.
>
> Being the inverse of copies, there are many commands (e.g. merge,
> revert) that would need to handle this, just because of the copy support.

Right. Hopefully once we've agreed on the desired behaviour, we can find
a single model that will let this drop out fairly naturally.

>> Advanced support for copies seems to mostly mean merging, and seems to
>> require knowing more about what the copy means.  Are they copying the
>> file to split it, or make a new copy of the same thing (like the gpl
>> example).
>
> It'd be fantastic if bzr could let the developer know about changes
> in lines coming from the original file.  Since we can't predict what
> behavior is really wanted, a conflict could be enforced in such
> cases, even if the patch would apply cleanly, so that the user must
> go over and review what was changed. Probably not easy to implement,
> but being able to track which lines came from the copy (as mentioned
> for annotating, above) would be a first step.

Well, I dont agree that we should make extra conflicts just because. If
the behaviour is predictable, I dont think most users will want
conflicts when they are taking advantage of copy of combining features.
But yes, we could.

>> Or do you mean being able to say it after you've already committed
>> path/to/added?
>
> The latter (really, I mean a number of commits down the road).  If we
> can say "That file in that other tree is really the same as this
> file", we should be able to say "That file that used to be in this
> tree is really the same as this file" too.

I dont think that this fits with the copy and combining use cases.
Specifically, we try not to carry around baggage thats not related to
the current tree, having arbitrary annotations that point anywhere in
history to define equivalence would imply an inability to to discard old
baggage. It may be that we can do this once we get to the design point
though - I'm not ruling it out, just asking really - how important is
this, really? A file that is deleted in all live trees is not in anyones
active set. If someone wants to have changes to that combine into
another file in the current tree, e.g. to do a forward ported bugfix,
then surely they can just merge it (which will conflict as the file was
deleted), reinstate the file and do 'combine' at this point. Which ends
up with the same result, but no special casing.

>> *file copies*
>> 
>> I note your explanation but I think we can say more about what
>> "support file copies" really means, otherwise we can claim we do it
>> already :-)
>> 
>> It would be elegant if copy+delete had the same effect as renaming. I
>> don't mean we should special-case them to be the same, or that this is
>> necessary.  At least, if they do behave differently, it would be good
>> to have a clear reason why.

I think without special casing it we will certainly get the same rough
behaviour as a rename, following the definitions I've given. That said,
we might want to special case 'delete' to undo a copy in the working
tree, to make it become identical behaviour. An interesting case is when
a user does:
bzr commit -m 'base'
bzr cp A B
bzr commit -m 'cp'
bzr delete A
bzr commit -m 'delete'
clearly here its not a rename. Or is it? Perhaps we can handle this, but
its too early I think to starting talking about what that will mean, as
it will (to a large degree) be coupled to the implementation. I hesitate
to say that we want to require that to collapse to a rename when
considered across {base,delete} because it seems marginally useful, not
a matter of correctness, and thus not a large win to be balanced against
some unknown implementation cost.

>> Advanced support for copies seems to mostly mean merging, and seems to
>> require knowing more about what the copy means.  Are they copying the
>> file to split it, or make a new copy of the same thing (like the gpl
>> example).

I dont think we need to know what the copy means. Users are very capable
of getting what they want given reasonable primitives. While we could
annotate copies with a raft of meta-data, I think most users will
actually be happier having just one command to use (cp), which operates
in the 'obvious manner' - which I have attempted to spell out, so that
it is more than just obvious, it is documented.

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070319/d5293d1d/attachment-0001.pgp