Of renames, copying, identity, workflows and code review [LONG]

Tue Jan 22 22:33:19 GMT 2008

"Je n'ai fait celle-ci plus longue que parceque je n'ai pas eu le
    loisir de la faire plus courte."  (I have only made this [letter]
    longer because I lacked the leisure [time] to make it shorter.)
    -- Blaise Pascal



Introduction
------------
Ever since Robert wrote

    "During the last git mega-thread here I noted that
    'git uses heuristics, bzr uses explicit data *and*
    heuristics' : There is no good reason to discard
    /facts/ and try to guess them later."

I have continued to ponder that thought.

The crucial point that Robert seemed to make is that bzr has opted to
explore a paradigm in which its persistent data structures may record
/facts/ describing precisely how a branch's previous state morphed into
the state being committed.  This contrasts with git's paradigm in which
a commit is purely a snapshot and understanding what happened between
two commits is treated as a process of reconstruction.  The difference
between a bzr recorded fact and a traditional commit comment is that the
recorded fact is expected to inform later operations (e.g. annotate and
merge activities).  Crucially Robert said nothing about how facts might
be collected.

All modern VCSs at the very least understand how to transform a file or
namespace state into a next state via semantic-free insert and delete
edit operations.  In such a setting replace is modeled simply as insert
followed by delete at the same point.  Imputing deeper significance to
these edit operators remains the domain of human interpretation.  To
the extent that such interpretations are recorded at all it is most
often within commit comments and email threads.

The limitations of a semantic-free insert and delete framework are
widely understood.  Across the VCS development community various ad hoc
extensions are being explored.  Darcs has introduced token replacement.
Various VCSs are beginning to model file moves and renames.  Some even
are experimenting with the notion of copies.  Still no single,
overarching model has emerged.

To the extent that any operations richer than insert and delete would
directly impact subsequent merges, no VCS project appears ready to turn
over final responsibility for extracting and recording such operations
to a mechanical differencing framework.  It is true that the UI portions
of certain text differencing schemes are able to suggest text movement.
Similarly, by noticing invariant checksums, some directory differencing
schemes can match up a drop with an add suggesting that a rename may
have occurred, or, by noticing the same name disappearing from one
directory and reappearing in another suggesting that a move may have
occurred.  Yet, to date, in spite of the existence of such heuristics,
no VCS identifies and records such operations unbidden.  Does this state
of affairs then represent a justification for introduction of imperative
move and rename sub-commands, commands that both modify a VCS managed
workspace and capture facts to be recorded in a subsequent commit?  No!

No VCS attempts to insinuate itself into editors so as to capture the
inserts and deletes needed to transform a text file from its previous
state to its new commit state.  It is well accepted that the editing of
files can occur in so many different manners and via so many different
tools as to make such an approach untenable.  Expecting users to perform
all text editing via VCS supplied commands would be equally untenable.
Universally VCSs infer insert and delete sequences by comparing the
before and after states of text files.  Strangely the logic of that
perspective seems to have failed to inform common design approaches to
namespace editing.  Instead we see a proliferation of VCSs sprouting
namespace delete and rename sub-commands.  Lacking any ClearCase-like
system call interception capabilities these VCSs hope nonetheless to
observe all namespace modifications at the moment of their occurrence.
Given the prevalence of scripts, GUIs, IDEs, etc, the text file editing
perspective mentioned earlier in this paragraph suggests that "it just
ain't gonna happen".

Am I suggesting that recording renames is impractical?  No.  But more
importantly I would argue that the form of the question betrays a rather
limiting conceptualization of the actual design challenge.  Renaming is
only one of a larger collection of refactoring edits that a developer
may apply.  An incomplete list of members of that collection would
include moving chunks of code within a file and between files, splitting
files, moving files, renaming files, etc.  A good solution ought to
provide a means of modeling all such edits and others as well.


A gedanken experiment
---------------------
It can be helpful to imagine having a scheme for recording refactorings
and then to consider how having access to such facts might influence a
merge strategy.  Consider file F under the following merge scenario:

   F1
   | \
   |  \
   F2 F3
    \  |
     \ |
      Fm

Assume further the each file state consists of four text regions (A, B,
C, and X):

   F1: A1     F2: A2     F3: A3     Fm: Am
       X1 --\     B2         X3         Bm
       B1    \->  X2         B3         Xm
       C1         C2         C3         Cm

The F1->F2 and F1->F3 edges represent some amount of local editing
within each of the four regions.  Finally F1->F2 includes a swapping of
the order of X and B.  (In passing I note that while this is clearly a
swap most developers would describe a single movement: "I moved function
foo so that it is now follows bar".)  Without a means of recording this
movement of region X the F1->F2 edit would appear to drop X1 and insert
X2.  In that case the Fm merge most likely would look something like:

   Fm: Am
       X*
       Bm
       X2
       Cm

where X* represents the conflicts between the deletion of region X along
F1->F2 and the edits along X1->X3.  Obviously this is sub-optimal.  How
can the algorithm do better if it has knowledge that span X1 in F1 maps
to span X2 in F2?  The key is to use knowledge of that mapping to reduce
the merge problem to a collection of mapping-free sub-problems:

  1) given the mapping's source span X1 in F1, compare F1:F3 to identify
     a corresponding span X3 in F3

  2) carry-out the following pair of simple (mapping-free) merges
     (the notation Fn-Xn represents file Fn with region Xn elided):

        F1-X1                       X1
        |    \                      | \
        |     \                     |  \
        F2-X2  F3-X3                X2  X3
         \     |                     \  |
          \    |                      \ |
           Fm-Xm                       Xm

  3) reintegrate merged region Xm at the mapping's destination:

        Fm-Xm:  Am
                Bm
                 <-- Xm
                Cm

The key thing to notice in this process is that it did not require
region X to remain unchanged along either F1->F2 or F1->F3.  Thus the
recorded /fact/ was not that a block of text was picked up and deposited
verbatim at a different position.  Rather it is a mapping: span X1 in F1
corresponds to span X2 in F2.  The concept generalizes easily to
inter-file mappings, thereby modeling the movement of code between
files.  Renaming or movement of entire files is nothing more than a
special case:

  1) insert the new file with its full contents
  2) define a mapping between the full span of the old and new files
  3) delete the old file

And just to belabor the point, notice that this mechanism easily models
the case where a file is both edited and renamed or moved prior to the
recording of a commit.  Such a capability is crucial if one expects to
hew to a "never commit broken changes" discipline.


Capturing mappings or commit preparation as an audit process
------------------------------------------------------------
Having dealt above with the deficiencies of today's UI for recording
file moves and renames I now to propose an alternative.  I maintain that
the appropriate time to do this is at completion of a development phase
when the dust has settled; when, irrespective of the tools used,
modification of the workspace has ceased; when the user has turned his
attention to recording a commit.  This is the point when any diligent
developer follows some form of audit process to confirm that what he is
about to commit is indeed what he expects and that his commit commentary
is complete.

So long as the only additional piece of information that a user supplied
at the time of a commit remained an opaque comment a text editor was all
the tooling required.  If we now intend that our user identify those
changes in his workspace that will be recorded as range mapping then we
need to consider whether some form of tool support is appropriate.  And
while we are at it we can consider whether introduction of a tool to
support preparing commits might not afford opportunities to improve in
other ways the quality of the information preserved within a commit.

Prompted by a number of code reviews I have performed at work recently I
already had some thoughts on this subject.  They crystallized with the
appearance of the "collapsing stuff" thread on the bzr mailing list:

http://thread.gmane.org/gmane.comp.version-control.bazaar-ng.general/36050

Allow me to sketch my recent reviewing experience.  I know that in
general outline it has many parallels in other developers experiences.

Because we have cubicles on the same floor I reviewed another engineer's
work in his cubicle.  He had two directory tree on his machine, one
being the state of the code before he started implementing a small
feature, one with his completed work.  We sat in front of his screen and
he brought up a meld comparison of the two trees.  He then brought up
the text of his proposed commit message.  This described changes he had
made:

 - changed a data member from bool to unsigned, with obvious rename;
   this implied that all sites mentioning the old data member had to be
   adjusted to account for the name change, and many needed to change
   in a straight-forward way to account for the type change

 - adjusted the signature of a pure virtual method in an interface;
   this implied simple mechanical changes to each implementation of that
   virtual method

 - recognized that as part of the new feature every site where a
   particular data member had been assigned a second unconditional
   assignment and a conditional assignment were needed; created a new
   inline method to abstract this pattern; adjusted all relevant sites

 - added one new public method to a class and two callers in other parts
   of the system.

All in all this represented maybe 50 sites that had been changed in
nearly 20 files.  With my colleague's proposed commit message as a road
map we dove in, reviewing changes in an order chosen by meld, namely by
ascending line number within a file, lexically within a directory and
top down within a tree.  As we progressed through the change sites my
colleague would explain / justify each change by referring back to one
of the points in his proposed commit message.  Such a flow should be
very familiar to those who review patches submitted by email.  Whether
one simply reviews the patch directly or applies it to a tree and uses
some graphical comparison tool the order of visiting change site remains
chaotic.

What I found myself longing for was an ability to choose each item in
the commit message in turn and visit each site changed on its behalf.
This would permit me to internalize and verify the initial data member
rename and type change without having it interleaved with all of the
other changes.  Having dispensed with all changes grouped under that
justification I would then turn my attention to the next conceptual
step in this change sequence.

Today such a review flow is impossible because a commit message is an
opaque text fragment.  But let us imaging that a commit message actually
contained a number of sections that doubled as tags.  Further let us
imaging that as part of recording a commit we could associate arbitrary
ranges of files with one or more of those tags.  Such a framework could
support just the sort of review I fantasize.  Further, if text ranges in
different commits had the same tag and if grouping was based on textual
identity of tag text then reviewing the collapsed projection of a series
of commits would continue to group changes appropriately.

I have no proposal for what a UI for preparing such structured commits
might look like.  I would expect it to use heuristics to suggest intra-
and inter-file textual moves, as well as full file moves and renames.
A command-line oriented version would likely present hunks and a menu of
choices much as the shelve plugin does.

What I have done is experiment with extending the unified diff format.
It seems it could easily convey extra tag information without breaking
any existing tools.  The format I have played with introduces a set of
locally numbered tag fragments between a file header and the first hunk.
After that the first line of any hunk, following the second pair of
at-signs, includes zero or more space separated tag references:

--- file1.txt	2008-01-02 17:32:11.000000000 -0500
+++ file2.txt	2008-01-02 17:32:42.000000000 -0500
5 # description of a first change (a tag)
# a continuation of that description
21 # description of a second change
2 # description of a third change
@@ ... @@ 5
  ...HUNK BODY...
@@ ... @@ 21 5
  ...HUNK BODY...
--- file3.txt	2008-01-02 17:32:11.000000000 -0500
+++ file4.txt	2008-01-02 17:32:42.000000000 -0500
@@ ... @@ 2
  ...HUNK BODY...
@@ ... @@ 2 5
  ...HUNK BODY...


What both the range mappings discussed in the first part of this note
and these structured commit messages have in common is a mechanism for
identifying ranges within files.  In the first case a range in a file
being committed is associated with a basis range.  In the second case a
range is associated with a distinguished fragment of a commit message.
The similarity of these mechanics leads me to believe that there ought
to be a single UI to support preparing both forms of annotation.


Conclusion
----------
In closing let me suggest that any sort of functionality along these
lines would reinforce to bzr reviewers the importance of recording rich
information at commit time and would help to differentiate bzr from git
and hg.

/john