Defining semantics for copying and combing files/directories/symlinks.
Robert Collins
robertc at robertcollins.net
Sun Mar 18 22:35:15 GMT 2007
This contains a lot of notes from the recent thread, and because I'm
combining emails I got horribly confused about who-wrote-what, so theres
no attribution here when I quote below; sorry. There is little quoting
left though, its mainly new prose trying to detail the space.
If we support copying, we should support it for all our versioned
objects: tree-references, directories, symlinks and files. We should
support it for all of them so that users do not get surprised when it
only works some of the time, and for robust code, because object can
change kind, if we only supported copying of files, a file that was
copied and become a symlink would be likely to create corner cases.
Likewise, if we support combining, we should support it across the board
for tree-references, directories, symlinks and files, and for the same
reasons.
The behaviour of bzr in the presence of copies and combined files should
be predictable. If its not, the feature will be hard or unpleasant to
use. To that end, I think we should avoid adding magic to the feature.
For instance, one use case I've seen touted in the past was to 'use
copied files to allow updating common content across many files'. If
this falls out naturally, as an emergent property, then thats fine.
However designing explicitly for that is not a positive thing I think.
Our use cases should be real, concrete and immediately useful.
So what use cases matter?
For combining there are two use cases I can think of:
- record that one file has been subsumed into another. (e.g. a user has
two classes FooBase and FooImplementation and decides that really the
separation isn't needed, so combines them, combining the files foo.c and
foo-implementation.c into just foo.c at the same time.)
- resolve duplicate additions of files. (e.g. two bzr devs apply a
regular patch from the internet which adds a file).
For copying files I have a single use case in mind:
- create two files from a single file. (e.g. a user has a class Foo and
is splitting it into two, so they copy foo.c to foo-extracted.c).
I think these are both pretty clear, but whats not clear is how various
commands like annotate, commit, merge, log, revert etc will work. merge
in particular is interesting because of its ability to interact with
trees that have not done the combination step yet.
Heres how I'd like to be able to explain file combining to a user. Note
that I dont talk about some potentially advanced things we can do to
reduce conflicts, this is intended as end user documentation, so it
talks enough about the metal that they can completely predict what bzr
will do, and understand how to deal with it.
When you want to combine two previously separate files into one
(e.g. because you are combining two classes and want 'bzr
annotate' to not consider the moved lines to be new) you can use
the command 'bzr combine FILEA FILEB'. This command tells bzr
that FILEB has been combined into FILEA (if FILEB is still on
disk, you will get a warning). After running this command bzr
treats the combined FILEA as a new logical file (call it 'C')
with the same filename on disk. This is done as a housekeeping
measure, so that this file is a child of both FILEA and FILEB,
and will be affected during merge by changes made to either
FILEA or FILEB in other branches, and conversely changes made to
the combined FILEA will affect both FILEA and FILEB when merged
to a different branch. 'bzr combine' can be run on any two
versioned paths. After doing a combine a number of commands will
give you information about the fact this combination has
occured. Specifically:
- bzr log FILEA will show you the changes done to the
combined file back to the point of combination, and from
there the changelog for both files. bzr log will also
note clearly where the combining step occured.
- bzr annotate FILEA will assign lines in the file to
commits made to either FILEA or FILEB.
- bzr reverting to a revision before they were combined
will split the files. (There were two files before the
combination took place, so reverting back before the
combination needs to create two files).
- the first bzr commit after the combine will record the
combination as long as either FILEA or FILEB are
selected to be committed. (commiting a different path,
FILED, will leave the combination on local disk but not
committed, as per usual).
- bzr merge handles combining like any other merge
operation: merging is symmetrical, so regardless of
which branch you cd into and run merge from, you should
expect the same result.
There are two basic cases for merge with respect to
combines: Either both branches already have the combined
file, or only one does. If both branches have the
combined file, it acts exactly like regular merge. If
only one branch has the combined file, e.g. when merging
from a branch that has committed a combine into a branch
that has not yet recorded that combine will try to
reproduce all the changes you made since the last time
that branch merged you: it will attempt to represent the
changes to made to FILEA and FILEB before the merge, the
combine step, and any subsequent changes you have made.
If this cannot be done, and a conflict occurs, bzr will
create 1 or 2 sets of '.BASE' and '.THIS' files, with a
common '.OTHER' file: this reflects bzr's use of a
standard 3-way merge, and that the output file it wants
to create (the result of combining your FILEA and FILEB
like the source branch did) actually involves two
merges: one with the uncombined base for FILEA, your
FILEA and the combined file, and one with uncombined
base for FILEB, your FILEB and the combined file. Due to
how 3-way merge works its possible for this to usually
produce no conflicts, but if conflicts occur on both
sides, bzr will do what it can, and leave you enough
information to pick up the pieces with an advanced gui
conflict resolver.
Symmetrically, merging a branch that has altered one or
both of FILEA or FILEB into a branch that has combined
them will apply the changes made to the combined file;
and if there are conflicts will do its best to resolve
them, but also leave a set of.BASE, .OTHER and .THIS
files matching the set of files the conflict occured
on.
'Cherrypicking' with combined files. If you cherrypick a
commit from a branch that has not combined the files
into a branch that has: this will apply the change in
the cherrypick to the combined file. As above, if
multiple file changes need to be consolidated into the
combined file, multiple conflicts may occur. If you
cherrypick a commit from a branch that has combined two
files into a branch that has not, the changes made to
the single combined file are made to both the files in
the branch that has not combined them. Note that this
will usually create a large conflict in at least one of
the files, as the combined file usually has unique
content which wont be present in the other file.
Finally, as usual, run 'bzr resolved' once you have
resolved conflicts, just like normal.
And heres how I'd like to explain file splitting(copying) to users.
When you want to split a single file into two (e.g. you are
splitting a file containing multiple classes into one file per
class but still want 'bzr log' to show its history, or 'bzr
annotate' to produce good data about who-wrote-what), you can
run 'bzr cp FILEA FILEB'. After running this command bzr treats
both FILEA and FILEB as new files which come from the original
FILEA. This is done as a housekeeping measure, so that changes
made to the original FILEA in another branch can correctly apply
to the new FILEA and FILEB, without changes to the new FILEA
incorrectly applying to FILEB. 'bzr cp' can be run on any
versioned paths. After doing a 'cp' a number of commands will
give you information about the fact this has occured.
Specifically:
- bzr log FILEA will note in the log when the copy
occured, and bzr log FILEB will show when the copy
occured, and the path it was copied from (FILEA). bzr
log on either FILEA or FILEB will show all history from
before the copy took place as well as any changes made
to that file since the copy.
- bzr annotate will assign lines in the file being
annotated to commits made both before an after the copy
(as you would expect).
- bzr reverting to a revision before the copy took place
will delete the copy, regardless of which file you
reverted. (Before the copy took place, there was only
one file, so if you revert back before the copy, there
can only be one file).
- the first bzr commit after the copy will record the
copy if either FILEA or FILEB is selected, and will
commit both. (Committing a different path, FILED, will
leave the copy on local disk but not committed, as
usual).
- bzr merge handles copies like any other merge
operation: merging is symmetrical, so regardless of
which branch you cd into and run merge from, you should
expect the same result.
There are two basic cases for merge with respect to
copies: Either both branches have already done the copy,
or only one has. If both branches have already performed
the copy, then merge acts exactly like normal: there are
2 files in each branch that need to be merged. If only
one branch has the copy, e.g. when you merge from a
branch that has performed a copy, merge will perform the
copy in your branch, of your current version of the file
that was copied, and then apply the unique changes made
to each side of the copy in the other branch to the two
files you now have. The usual conflict markers
and .BASE, .THIS and .OTHER will be created. Its
important to note that if you have made a single
conflicting change you may see two conflicts: one in
each side of the copy that was made. Due to how 3-way
merge works its possible for concurrent copying and
editing to usually produce no conflicts, but if
conflicts occur, bzr will do what it can, and leave you
enough information to pick up the pieces with an
advanced gui conflict resolver, or wiggle, or so on.
Symmetrically, merging a branch that has altered the
original file into a branch that has copied it will
apply the changes made to the original file to both
sides of the copy; and if there are conflicts will do
its best to resolve them, but also leave a set
of.BASE, .OTHER and .THIS files matching the set of
files the conflict occured on.
'Cherrypicking' with copied files. If you cherrypick a
commit from a branch that has not copied the file into a
branch that has: this will apply the change in the
cherrypick to both files. If you cherrypick a commit
from a branch that has copied a file into a branch that
has not, the changes made to both the original and the
copy are made to the original file in the branch that
had not performed the copy. Note that this will usually
create a large conflict in the original file, as the
original usually has unique content with respect to at
least one of the copies.
Finally, as usual, run 'bzr resolved' once you have
resolved conflicts, just like normal.
I think these definitions are predictable to users, and while there is a
lot of text to describe it, its only a single paragraph to provide the
core, the rest is really exposition. Of particular note, these
operations are symmetric: Performing
bzr cp A B
bzr join A B
will result in a tree that behaves in the same way as if the copy never
took place. This makes undoing the operation easy for users even after
several commits have taken place.
Note that I'm still not talking about data models etc.
>> Copies also make sense for some user
>> operations, like splitting a files contents, or take a file like
>> 'COPYING' that does not change often and putting it into other locations
>> or trees.
> It's not clear to me that we should use the same primitive to represent
> both those operations. The output of a split is two files with no
> common contents that are both related to the base file. The output of a
> copy is two files that have identical contents to the base file. In the
> first case, applying a merge from a pre-split tree should apply each
> change only once. But in the second case, a merge from a pre-copy tree
> the changes would be applied twice: once to each file.
Its easier for a user to delete a 'deleted-region' conflict than to
manually repeat a merge that we didn't do for them. So in the
definitions up above I've erred on that side, users that want merges
from pre-split trees to apply to just one side can manually delete the
deleted-region conflicts. Later on we could look at detecting when a
conflict in a split file applied correctly in another branch of the
split; if it did and the conflict was a 'region deleted' conflict, we
could elide that conflict completely, with no data-loss implications. I
think that there is not enough of a win by having 'split vs copy'
defined to justify the complexity in explaining it, let alone
implementing it.
> Finally, it's not at all clear that anyone really wants COPYING to be
> treated as the same everywhere. Because if COPYING changes, that would
> mean that everyone in a project had agreed to change the license. If
> you have two copies of COPYING, you probably have two sub-projects in a
> tree. So it's quite conceivable that one sub-project might change their
> license, while the other did not.
I've handled this in the above user instructions by giving the user
predictable behaviour: If the user wants to change all COPYING files
ever, they branch from before the first one was created, change just
COPYING, commit, then merge that wherever.
>> Two versioned paths become one: This is mostly covered in my text about
>> parallel imports. While not quite the same thing they are closely
>> related.
>
> Being the inverse of copies, there are many commands (e.g. merge,
> revert) that would need to handle this, just because of the copy support.
Right. Hopefully once we've agreed on the desired behaviour, we can find
a single model that will let this drop out fairly naturally.
>> Advanced support for copies seems to mostly mean merging, and seems to
>> require knowing more about what the copy means. Are they copying the
>> file to split it, or make a new copy of the same thing (like the gpl
>> example).
>
> It'd be fantastic if bzr could let the developer know about changes
> in lines coming from the original file. Since we can't predict what
> behavior is really wanted, a conflict could be enforced in such
> cases, even if the patch would apply cleanly, so that the user must
> go over and review what was changed. Probably not easy to implement,
> but being able to track which lines came from the copy (as mentioned
> for annotating, above) would be a first step.
Well, I dont agree that we should make extra conflicts just because. If
the behaviour is predictable, I dont think most users will want
conflicts when they are taking advantage of copy of combining features.
But yes, we could.
>> Or do you mean being able to say it after you've already committed
>> path/to/added?
>
> The latter (really, I mean a number of commits down the road). If we
> can say "That file in that other tree is really the same as this
> file", we should be able to say "That file that used to be in this
> tree is really the same as this file" too.
I dont think that this fits with the copy and combining use cases.
Specifically, we try not to carry around baggage thats not related to
the current tree, having arbitrary annotations that point anywhere in
history to define equivalence would imply an inability to to discard old
baggage. It may be that we can do this once we get to the design point
though - I'm not ruling it out, just asking really - how important is
this, really? A file that is deleted in all live trees is not in anyones
active set. If someone wants to have changes to that combine into
another file in the current tree, e.g. to do a forward ported bugfix,
then surely they can just merge it (which will conflict as the file was
deleted), reinstate the file and do 'combine' at this point. Which ends
up with the same result, but no special casing.
>> *file copies*
>>
>> I note your explanation but I think we can say more about what
>> "support file copies" really means, otherwise we can claim we do it
>> already :-)
>>
>> It would be elegant if copy+delete had the same effect as renaming. I
>> don't mean we should special-case them to be the same, or that this is
>> necessary. At least, if they do behave differently, it would be good
>> to have a clear reason why.
I think without special casing it we will certainly get the same rough
behaviour as a rename, following the definitions I've given. That said,
we might want to special case 'delete' to undo a copy in the working
tree, to make it become identical behaviour. An interesting case is when
a user does:
bzr commit -m 'base'
bzr cp A B
bzr commit -m 'cp'
bzr delete A
bzr commit -m 'delete'
clearly here its not a rename. Or is it? Perhaps we can handle this, but
its too early I think to starting talking about what that will mean, as
it will (to a large degree) be coupled to the implementation. I hesitate
to say that we want to require that to collapse to a rename when
considered across {base,delete} because it seems marginally useful, not
a matter of correctness, and thus not a large win to be balanced against
some unknown implementation cost.
>> Advanced support for copies seems to mostly mean merging, and seems to
>> require knowing more about what the copy means. Are they copying the
>> file to split it, or make a new copy of the same thing (like the gpl
>> example).
I dont think we need to know what the copy means. Users are very capable
of getting what they want given reasonable primitives. While we could
annotate copies with a raft of meta-data, I think most users will
actually be happier having just one command to use (cp), which operates
in the 'obvious manner' - which I have attempted to spell out, so that
it is more than just obvious, it is documented.
-Rob
--
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070319/d5293d1d/attachment-0001.pgp
More information about the bazaar
mailing list