Of renames, copying, identity, workflows and code review [LONG]

Wed Jan 23 18:08:18 GMT 2008

A very interesting read. I think you struck on a few interesting pieces. 
One bit is that I would highly recommend doing more commits. Having 1 
commit that has 50 changes touching 20 files is a bit much. I realize it 
happens. But it isn't *recommended* practice.

Bazaar recommends the workflow of creating a feature branch, doing lots 
of commits there, and then the merge back into mainline is the 
all-at-once feature merge.

Committing changes as you go gives better annotations, more obvious 
step-by-step of what went on, and also helps keep the developer from 
forgetting "why did I do X".

This isn't meant to detract from your discussion, though.

We are slowly considering per-file commit messages, which would also 
help give you the "overall change this commit introduces" separate from 
"this hunk in this file was changed because of X,Y and Z".

I think your idea has merit, and having per-hunk tracking could be 
interesting. I know Aaron has thought a lot about per-line, or more 
accurately edge tracking. (Tracking the relation on what line is next to 
what line.)

Inferring line identity is a bit tricky (is that the same line with a 
small change, or should it be considered a different line), but adding 
some UI to interact with the changes might help resolve that. There is 
always a concern about *not* conflicting when a conflict might be 
reasonable. If A modifies something and B moves it, it may do something 
entirely different in point B.

John
=:->

John Yates wrote:
>     "Je n'ai fait celle-ci plus longue que parceque je n'ai pas eu le
>     loisir de la faire plus courte."  (I have only made this [letter]
>     longer because I lacked the leisure [time] to make it shorter.)
>     -- Blaise Pascal
> 
> 
> 
> Introduction
> ------------
> Ever since Robert wrote
> 
>     "During the last git mega-thread here I noted that
>     'git uses heuristics, bzr uses explicit data *and*
>     heuristics' : There is no good reason to discard
>     /facts/ and try to guess them later."
> 
> I have continued to ponder that thought.
> 
> The crucial point that Robert seemed to make is that bzr has opted to
> explore a paradigm in which its persistent data structures may record
> /facts/ describing precisely how a branch's previous state morphed into
> the state being committed.  This contrasts with git's paradigm in which
> a commit is purely a snapshot and understanding what happened between
> two commits is treated as a process of reconstruction.  The difference
> between a bzr recorded fact and a traditional commit comment is that the
> recorded fact is expected to inform later operations (e.g. annotate and
> merge activities).  Crucially Robert said nothing about how facts might
> be collected.
> 
> All modern VCSs at the very least understand how to transform a file or
> namespace state into a next state via semantic-free insert and delete
> edit operations.  In such a setting replace is modeled simply as insert
> followed by delete at the same point.  Imputing deeper significance to
> these edit operators remains the domain of human interpretation.  To
> the extent that such interpretations are recorded at all it is most
> often within commit comments and email threads.
> 
> The limitations of a semantic-free insert and delete framework are
> widely understood.  Across the VCS development community various ad hoc
> extensions are being explored.  Darcs has introduced token replacement.
> Various VCSs are beginning to model file moves and renames.  Some even
> are experimenting with the notion of copies.  Still no single,
> overarching model has emerged.
> 
> To the extent that any operations richer than insert and delete would
> directly impact subsequent merges, no VCS project appears ready to turn
> over final responsibility for extracting and recording such operations
> to a mechanical differencing framework.  It is true that the UI portions
> of certain text differencing schemes are able to suggest text movement.
> Similarly, by noticing invariant checksums, some directory differencing
> schemes can match up a drop with an add suggesting that a rename may
> have occurred, or, by noticing the same name disappearing from one
> directory and reappearing in another suggesting that a move may have
> occurred.  Yet, to date, in spite of the existence of such heuristics,
> no VCS identifies and records such operations unbidden.  Does this state
> of affairs then represent a justification for introduction of imperative
> move and rename sub-commands, commands that both modify a VCS managed
> workspace and capture facts to be recorded in a subsequent commit?  No!
> 
> No VCS attempts to insinuate itself into editors so as to capture the
> inserts and deletes needed to transform a text file from its previous
> state to its new commit state.  It is well accepted that the editing of
> files can occur in so many different manners and via so many different
> tools as to make such an approach untenable.  Expecting users to perform
> all text editing via VCS supplied commands would be equally untenable.
> Universally VCSs infer insert and delete sequences by comparing the
> before and after states of text files.  Strangely the logic of that
> perspective seems to have failed to inform common design approaches to
> namespace editing.  Instead we see a proliferation of VCSs sprouting
> namespace delete and rename sub-commands.  Lacking any ClearCase-like
> system call interception capabilities these VCSs hope nonetheless to
> observe all namespace modifications at the moment of their occurrence.
> Given the prevalence of scripts, GUIs, IDEs, etc, the text file editing
> perspective mentioned earlier in this paragraph suggests that "it just
> ain't gonna happen".
> 
> Am I suggesting that recording renames is impractical?  No.  But more
> importantly I would argue that the form of the question betrays a rather
> limiting conceptualization of the actual design challenge.  Renaming is
> only one of a larger collection of refactoring edits that a developer
> may apply.  An incomplete list of members of that collection would
> include moving chunks of code within a file and between files, splitting
> files, moving files, renaming files, etc.  A good solution ought to
> provide a means of modeling all such edits and others as well.
> 
> 
> A gedanken experiment
> ---------------------
> It can be helpful to imagine having a scheme for recording refactorings
> and then to consider how having access to such facts might influence a
> merge strategy.  Consider file F under the following merge scenario:
> 
>    F1
>    | \
>    |  \
>    F2 F3
>     \  |
>      \ |
>       Fm
> 
> Assume further the each file state consists of four text regions (A, B,
> C, and X):
> 
>    F1: A1     F2: A2     F3: A3     Fm: Am
>        X1 --\     B2         X3         Bm
>        B1    \->  X2         B3         Xm
>        C1         C2         C3         Cm
> 
> The F1->F2 and F1->F3 edges represent some amount of local editing
> within each of the four regions.  Finally F1->F2 includes a swapping of
> the order of X and B.  (In passing I note that while this is clearly a
> swap most developers would describe a single movement: "I moved function
> foo so that it is now follows bar".)  Without a means of recording this
> movement of region X the F1->F2 edit would appear to drop X1 and insert
> X2.  In that case the Fm merge most likely would look something like:
> 
>    Fm: Am
>        X*
>        Bm
>        X2
>        Cm
> 
> where X* represents the conflicts between the deletion of region X along
> F1->F2 and the edits along X1->X3.  Obviously this is sub-optimal.  How
> can the algorithm do better if it has knowledge that span X1 in F1 maps
> to span X2 in F2?  The key is to use knowledge of that mapping to reduce
> the merge problem to a collection of mapping-free sub-problems:
> 
>   1) given the mapping's source span X1 in F1, compare F1:F3 to identify
>      a corresponding span X3 in F3
> 
>   2) carry-out the following pair of simple (mapping-free) merges
>      (the notation Fn-Xn represents file Fn with region Xn elided):
> 
>         F1-X1                       X1
>         |    \                      | \
>         |     \                     |  \
>         F2-X2  F3-X3                X2  X3
>          \     |                     \  |
>           \    |                      \ |
>            Fm-Xm                       Xm
> 
>   3) reintegrate merged region Xm at the mapping's destination:
> 
>         Fm-Xm:  Am
>                 Bm
>                  <-- Xm
>                 Cm
> 
> The key thing to notice in this process is that it did not require
> region X to remain unchanged along either F1->F2 or F1->F3.  Thus the
> recorded /fact/ was not that a block of text was picked up and deposited
> verbatim at a different position.  Rather it is a mapping: span X1 in F1
> corresponds to span X2 in F2.  The concept generalizes easily to
> inter-file mappings, thereby modeling the movement of code between
> files.  Renaming or movement of entire files is nothing more than a
> special case:
> 
>   1) insert the new file with its full contents
>   2) define a mapping between the full span of the old and new files
>   3) delete the old file
> 
> And just to belabor the point, notice that this mechanism easily models
> the case where a file is both edited and renamed or moved prior to the
> recording of a commit.  Such a capability is crucial if one expects to
> hew to a "never commit broken changes" discipline.
> 
> 
> Capturing mappings or commit preparation as an audit process
> ------------------------------------------------------------
> Having dealt above with the deficiencies of today's UI for recording
> file moves and renames I now to propose an alternative.  I maintain that
> the appropriate time to do this is at completion of a development phase
> when the dust has settled; when, irrespective of the tools used,
> modification of the workspace has ceased; when the user has turned his
> attention to recording a commit.  This is the point when any diligent
> developer follows some form of audit process to confirm that what he is
> about to commit is indeed what he expects and that his commit commentary
> is complete.
> 
> So long as the only additional piece of information that a user supplied
> at the time of a commit remained an opaque comment a text editor was all
> the tooling required.  If we now intend that our user identify those
> changes in his workspace that will be recorded as range mapping then we
> need to consider whether some form of tool support is appropriate.  And
> while we are at it we can consider whether introduction of a tool to
> support preparing commits might not afford opportunities to improve in
> other ways the quality of the information preserved within a commit.
> 
> Prompted by a number of code reviews I have performed at work recently I
> already had some thoughts on this subject.  They crystallized with the
> appearance of the "collapsing stuff" thread on the bzr mailing list:
> 
> http://thread.gmane.org/gmane.comp.version-control.bazaar-ng.general/36050
> 
> Allow me to sketch my recent reviewing experience.  I know that in
> general outline it has many parallels in other developers experiences.
> 
> Because we have cubicles on the same floor I reviewed another engineer's
> work in his cubicle.  He had two directory tree on his machine, one
> being the state of the code before he started implementing a small
> feature, one with his completed work.  We sat in front of his screen and
> he brought up a meld comparison of the two trees.  He then brought up
> the text of his proposed commit message.  This described changes he had
> made:
> 
>  - changed a data member from bool to unsigned, with obvious rename;
>    this implied that all sites mentioning the old data member had to be
>    adjusted to account for the name change, and many needed to change
>    in a straight-forward way to account for the type change
> 
>  - adjusted the signature of a pure virtual method in an interface;
>    this implied simple mechanical changes to each implementation of that
>    virtual method
> 
>  - recognized that as part of the new feature every site where a
>    particular data member had been assigned a second unconditional
>    assignment and a conditional assignment were needed; created a new
>    inline method to abstract this pattern; adjusted all relevant sites
> 
>  - added one new public method to a class and two callers in other parts
>    of the system.
> 
> All in all this represented maybe 50 sites that had been changed in
> nearly 20 files.  With my colleague's proposed commit message as a road
> map we dove in, reviewing changes in an order chosen by meld, namely by
> ascending line number within a file, lexically within a directory and
> top down within a tree.  As we progressed through the change sites my
> colleague would explain / justify each change by referring back to one
> of the points in his proposed commit message.  Such a flow should be
> very familiar to those who review patches submitted by email.  Whether
> one simply reviews the patch directly or applies it to a tree and uses
> some graphical comparison tool the order of visiting change site remains
> chaotic.
> 
> What I found myself longing for was an ability to choose each item in
> the commit message in turn and visit each site changed on its behalf.
> This would permit me to internalize and verify the initial data member
> rename and type change without having it interleaved with all of the
> other changes.  Having dispensed with all changes grouped under that
> justification I would then turn my attention to the next conceptual
> step in this change sequence.
> 
> Today such a review flow is impossible because a commit message is an
> opaque text fragment.  But let us imaging that a commit message actually
> contained a number of sections that doubled as tags.  Further let us
> imaging that as part of recording a commit we could associate arbitrary
> ranges of files with one or more of those tags.  Such a framework could
> support just the sort of review I fantasize.  Further, if text ranges in
> different commits had the same tag and if grouping was based on textual
> identity of tag text then reviewing the collapsed projection of a series
> of commits would continue to group changes appropriately.
> 
> I have no proposal for what a UI for preparing such structured commits
> might look like.  I would expect it to use heuristics to suggest intra-
> and inter-file textual moves, as well as full file moves and renames.
> A command-line oriented version would likely present hunks and a menu of
> choices much as the shelve plugin does.
> 
> What I have done is experiment with extending the unified diff format.
> It seems it could easily convey extra tag information without breaking
> any existing tools.  The format I have played with introduces a set of
> locally numbered tag fragments between a file header and the first hunk.
> After that the first line of any hunk, following the second pair of
> at-signs, includes zero or more space separated tag references:
> 
> 
> --- file1.txt	2008-01-02 17:32:11.000000000 -0500
> +++ file2.txt	2008-01-02 17:32:42.000000000 -0500
> 5 # description of a first change (a tag)
> # a continuation of that description
> 21 # description of a second change
> 2 # description of a third change
> @@ ... @@ 5
>   ...HUNK BODY...
> @@ ... @@ 21 5
>   ...HUNK BODY...
> --- file3.txt	2008-01-02 17:32:11.000000000 -0500
> +++ file4.txt	2008-01-02 17:32:42.000000000 -0500
> @@ ... @@ 2
>   ...HUNK BODY...
> @@ ... @@ 2 5
>   ...HUNK BODY...
> 
> 
> What both the range mappings discussed in the first part of this note
> and these structured commit messages have in common is a mechanism for
> identifying ranges within files.  In the first case a range in a file
> being committed is associated with a basis range.  In the second case a
> range is associated with a distinguished fragment of a commit message.
> The similarity of these mechanics leads me to believe that there ought
> to be a single UI to support preparing both forms of annotation.
> 
> 
> Conclusion
> ----------
> In closing let me suggest that any sort of functionality along these
> lines would reinforce to bzr reviewers the importance of recording rich
> information at commit time and would help to differentiate bzr from git
> and hg.
> 
> /john
> 
>