Of renames, copying, identity, workflows and code review [LONG]
John Arbash Meinel
john at arbash-meinel.com
Wed Jan 23 18:08:18 GMT 2008
A very interesting read. I think you struck on a few interesting pieces.
One bit is that I would highly recommend doing more commits. Having 1
commit that has 50 changes touching 20 files is a bit much. I realize it
happens. But it isn't *recommended* practice.
Bazaar recommends the workflow of creating a feature branch, doing lots
of commits there, and then the merge back into mainline is the
all-at-once feature merge.
Committing changes as you go gives better annotations, more obvious
step-by-step of what went on, and also helps keep the developer from
forgetting "why did I do X".
This isn't meant to detract from your discussion, though.
We are slowly considering per-file commit messages, which would also
help give you the "overall change this commit introduces" separate from
"this hunk in this file was changed because of X,Y and Z".
I think your idea has merit, and having per-hunk tracking could be
interesting. I know Aaron has thought a lot about per-line, or more
accurately edge tracking. (Tracking the relation on what line is next to
what line.)
Inferring line identity is a bit tricky (is that the same line with a
small change, or should it be considered a different line), but adding
some UI to interact with the changes might help resolve that. There is
always a concern about *not* conflicting when a conflict might be
reasonable. If A modifies something and B moves it, it may do something
entirely different in point B.
John
=:->
John Yates wrote:
> "Je n'ai fait celle-ci plus longue que parceque je n'ai pas eu le
> loisir de la faire plus courte." (I have only made this [letter]
> longer because I lacked the leisure [time] to make it shorter.)
> -- Blaise Pascal
>
>
>
> Introduction
> ------------
> Ever since Robert wrote
>
> "During the last git mega-thread here I noted that
> 'git uses heuristics, bzr uses explicit data *and*
> heuristics' : There is no good reason to discard
> /facts/ and try to guess them later."
>
> I have continued to ponder that thought.
>
> The crucial point that Robert seemed to make is that bzr has opted to
> explore a paradigm in which its persistent data structures may record
> /facts/ describing precisely how a branch's previous state morphed into
> the state being committed. This contrasts with git's paradigm in which
> a commit is purely a snapshot and understanding what happened between
> two commits is treated as a process of reconstruction. The difference
> between a bzr recorded fact and a traditional commit comment is that the
> recorded fact is expected to inform later operations (e.g. annotate and
> merge activities). Crucially Robert said nothing about how facts might
> be collected.
>
> All modern VCSs at the very least understand how to transform a file or
> namespace state into a next state via semantic-free insert and delete
> edit operations. In such a setting replace is modeled simply as insert
> followed by delete at the same point. Imputing deeper significance to
> these edit operators remains the domain of human interpretation. To
> the extent that such interpretations are recorded at all it is most
> often within commit comments and email threads.
>
> The limitations of a semantic-free insert and delete framework are
> widely understood. Across the VCS development community various ad hoc
> extensions are being explored. Darcs has introduced token replacement.
> Various VCSs are beginning to model file moves and renames. Some even
> are experimenting with the notion of copies. Still no single,
> overarching model has emerged.
>
> To the extent that any operations richer than insert and delete would
> directly impact subsequent merges, no VCS project appears ready to turn
> over final responsibility for extracting and recording such operations
> to a mechanical differencing framework. It is true that the UI portions
> of certain text differencing schemes are able to suggest text movement.
> Similarly, by noticing invariant checksums, some directory differencing
> schemes can match up a drop with an add suggesting that a rename may
> have occurred, or, by noticing the same name disappearing from one
> directory and reappearing in another suggesting that a move may have
> occurred. Yet, to date, in spite of the existence of such heuristics,
> no VCS identifies and records such operations unbidden. Does this state
> of affairs then represent a justification for introduction of imperative
> move and rename sub-commands, commands that both modify a VCS managed
> workspace and capture facts to be recorded in a subsequent commit? No!
>
> No VCS attempts to insinuate itself into editors so as to capture the
> inserts and deletes needed to transform a text file from its previous
> state to its new commit state. It is well accepted that the editing of
> files can occur in so many different manners and via so many different
> tools as to make such an approach untenable. Expecting users to perform
> all text editing via VCS supplied commands would be equally untenable.
> Universally VCSs infer insert and delete sequences by comparing the
> before and after states of text files. Strangely the logic of that
> perspective seems to have failed to inform common design approaches to
> namespace editing. Instead we see a proliferation of VCSs sprouting
> namespace delete and rename sub-commands. Lacking any ClearCase-like
> system call interception capabilities these VCSs hope nonetheless to
> observe all namespace modifications at the moment of their occurrence.
> Given the prevalence of scripts, GUIs, IDEs, etc, the text file editing
> perspective mentioned earlier in this paragraph suggests that "it just
> ain't gonna happen".
>
> Am I suggesting that recording renames is impractical? No. But more
> importantly I would argue that the form of the question betrays a rather
> limiting conceptualization of the actual design challenge. Renaming is
> only one of a larger collection of refactoring edits that a developer
> may apply. An incomplete list of members of that collection would
> include moving chunks of code within a file and between files, splitting
> files, moving files, renaming files, etc. A good solution ought to
> provide a means of modeling all such edits and others as well.
>
>
> A gedanken experiment
> ---------------------
> It can be helpful to imagine having a scheme for recording refactorings
> and then to consider how having access to such facts might influence a
> merge strategy. Consider file F under the following merge scenario:
>
> F1
> | \
> | \
> F2 F3
> \ |
> \ |
> Fm
>
> Assume further the each file state consists of four text regions (A, B,
> C, and X):
>
> F1: A1 F2: A2 F3: A3 Fm: Am
> X1 --\ B2 X3 Bm
> B1 \-> X2 B3 Xm
> C1 C2 C3 Cm
>
> The F1->F2 and F1->F3 edges represent some amount of local editing
> within each of the four regions. Finally F1->F2 includes a swapping of
> the order of X and B. (In passing I note that while this is clearly a
> swap most developers would describe a single movement: "I moved function
> foo so that it is now follows bar".) Without a means of recording this
> movement of region X the F1->F2 edit would appear to drop X1 and insert
> X2. In that case the Fm merge most likely would look something like:
>
> Fm: Am
> X*
> Bm
> X2
> Cm
>
> where X* represents the conflicts between the deletion of region X along
> F1->F2 and the edits along X1->X3. Obviously this is sub-optimal. How
> can the algorithm do better if it has knowledge that span X1 in F1 maps
> to span X2 in F2? The key is to use knowledge of that mapping to reduce
> the merge problem to a collection of mapping-free sub-problems:
>
> 1) given the mapping's source span X1 in F1, compare F1:F3 to identify
> a corresponding span X3 in F3
>
> 2) carry-out the following pair of simple (mapping-free) merges
> (the notation Fn-Xn represents file Fn with region Xn elided):
>
> F1-X1 X1
> | \ | \
> | \ | \
> F2-X2 F3-X3 X2 X3
> \ | \ |
> \ | \ |
> Fm-Xm Xm
>
> 3) reintegrate merged region Xm at the mapping's destination:
>
> Fm-Xm: Am
> Bm
> <-- Xm
> Cm
>
> The key thing to notice in this process is that it did not require
> region X to remain unchanged along either F1->F2 or F1->F3. Thus the
> recorded /fact/ was not that a block of text was picked up and deposited
> verbatim at a different position. Rather it is a mapping: span X1 in F1
> corresponds to span X2 in F2. The concept generalizes easily to
> inter-file mappings, thereby modeling the movement of code between
> files. Renaming or movement of entire files is nothing more than a
> special case:
>
> 1) insert the new file with its full contents
> 2) define a mapping between the full span of the old and new files
> 3) delete the old file
>
> And just to belabor the point, notice that this mechanism easily models
> the case where a file is both edited and renamed or moved prior to the
> recording of a commit. Such a capability is crucial if one expects to
> hew to a "never commit broken changes" discipline.
>
>
> Capturing mappings or commit preparation as an audit process
> ------------------------------------------------------------
> Having dealt above with the deficiencies of today's UI for recording
> file moves and renames I now to propose an alternative. I maintain that
> the appropriate time to do this is at completion of a development phase
> when the dust has settled; when, irrespective of the tools used,
> modification of the workspace has ceased; when the user has turned his
> attention to recording a commit. This is the point when any diligent
> developer follows some form of audit process to confirm that what he is
> about to commit is indeed what he expects and that his commit commentary
> is complete.
>
> So long as the only additional piece of information that a user supplied
> at the time of a commit remained an opaque comment a text editor was all
> the tooling required. If we now intend that our user identify those
> changes in his workspace that will be recorded as range mapping then we
> need to consider whether some form of tool support is appropriate. And
> while we are at it we can consider whether introduction of a tool to
> support preparing commits might not afford opportunities to improve in
> other ways the quality of the information preserved within a commit.
>
> Prompted by a number of code reviews I have performed at work recently I
> already had some thoughts on this subject. They crystallized with the
> appearance of the "collapsing stuff" thread on the bzr mailing list:
>
> http://thread.gmane.org/gmane.comp.version-control.bazaar-ng.general/36050
>
> Allow me to sketch my recent reviewing experience. I know that in
> general outline it has many parallels in other developers experiences.
>
> Because we have cubicles on the same floor I reviewed another engineer's
> work in his cubicle. He had two directory tree on his machine, one
> being the state of the code before he started implementing a small
> feature, one with his completed work. We sat in front of his screen and
> he brought up a meld comparison of the two trees. He then brought up
> the text of his proposed commit message. This described changes he had
> made:
>
> - changed a data member from bool to unsigned, with obvious rename;
> this implied that all sites mentioning the old data member had to be
> adjusted to account for the name change, and many needed to change
> in a straight-forward way to account for the type change
>
> - adjusted the signature of a pure virtual method in an interface;
> this implied simple mechanical changes to each implementation of that
> virtual method
>
> - recognized that as part of the new feature every site where a
> particular data member had been assigned a second unconditional
> assignment and a conditional assignment were needed; created a new
> inline method to abstract this pattern; adjusted all relevant sites
>
> - added one new public method to a class and two callers in other parts
> of the system.
>
> All in all this represented maybe 50 sites that had been changed in
> nearly 20 files. With my colleague's proposed commit message as a road
> map we dove in, reviewing changes in an order chosen by meld, namely by
> ascending line number within a file, lexically within a directory and
> top down within a tree. As we progressed through the change sites my
> colleague would explain / justify each change by referring back to one
> of the points in his proposed commit message. Such a flow should be
> very familiar to those who review patches submitted by email. Whether
> one simply reviews the patch directly or applies it to a tree and uses
> some graphical comparison tool the order of visiting change site remains
> chaotic.
>
> What I found myself longing for was an ability to choose each item in
> the commit message in turn and visit each site changed on its behalf.
> This would permit me to internalize and verify the initial data member
> rename and type change without having it interleaved with all of the
> other changes. Having dispensed with all changes grouped under that
> justification I would then turn my attention to the next conceptual
> step in this change sequence.
>
> Today such a review flow is impossible because a commit message is an
> opaque text fragment. But let us imaging that a commit message actually
> contained a number of sections that doubled as tags. Further let us
> imaging that as part of recording a commit we could associate arbitrary
> ranges of files with one or more of those tags. Such a framework could
> support just the sort of review I fantasize. Further, if text ranges in
> different commits had the same tag and if grouping was based on textual
> identity of tag text then reviewing the collapsed projection of a series
> of commits would continue to group changes appropriately.
>
> I have no proposal for what a UI for preparing such structured commits
> might look like. I would expect it to use heuristics to suggest intra-
> and inter-file textual moves, as well as full file moves and renames.
> A command-line oriented version would likely present hunks and a menu of
> choices much as the shelve plugin does.
>
> What I have done is experiment with extending the unified diff format.
> It seems it could easily convey extra tag information without breaking
> any existing tools. The format I have played with introduces a set of
> locally numbered tag fragments between a file header and the first hunk.
> After that the first line of any hunk, following the second pair of
> at-signs, includes zero or more space separated tag references:
>
>
> --- file1.txt 2008-01-02 17:32:11.000000000 -0500
> +++ file2.txt 2008-01-02 17:32:42.000000000 -0500
> 5 # description of a first change (a tag)
> # a continuation of that description
> 21 # description of a second change
> 2 # description of a third change
> @@ ... @@ 5
> ...HUNK BODY...
> @@ ... @@ 21 5
> ...HUNK BODY...
> --- file3.txt 2008-01-02 17:32:11.000000000 -0500
> +++ file4.txt 2008-01-02 17:32:42.000000000 -0500
> @@ ... @@ 2
> ...HUNK BODY...
> @@ ... @@ 2 5
> ...HUNK BODY...
>
>
> What both the range mappings discussed in the first part of this note
> and these structured commit messages have in common is a mechanism for
> identifying ranges within files. In the first case a range in a file
> being committed is associated with a basis range. In the second case a
> range is associated with a distinguished fragment of a commit message.
> The similarity of these mechanics leads me to believe that there ought
> to be a single UI to support preparing both forms of annotation.
>
>
> Conclusion
> ----------
> In closing let me suggest that any sort of functionality along these
> lines would reinforce to bzr reviewers the importance of recording rich
> information at commit time and would help to differentiate bzr from git
> and hg.
>
> /john
>
>
More information about the bazaar
mailing list