Future bzr features: split file tracking and type-specific merging

Fri Apr 28 07:39:41 BST 2006

On Thu, Apr 27, 2006 at 22:43:45 -0500, John Arbash Meinel wrote:
> Andrew Lambe wrote:
> > From second hand discussions of git, at its lowest level, git is mainly
> > versioning "chunks of content" based on its hash rather than whole
> > files.
> > This creates a lot of bloat which is why git is storage inefficient,
> > and it can be difficult to actually use all that tracking.
> > This kind of brute force code tracking may be good or necessary for
> > Linus's kernel management (which is why he created it), but bzr's goals
> > of flexibility, efficiency, and ease-of-use are not really compatible
> > with this approach.
> > 
> > All I really want for tracking file splits is an easier way to audit
> > the origin of a portion of code.
> > For bzr it could be as simple as:
> > bzr add --split-from original.file
> > new.file.with.code.from.original.file
> > This could store a property indicating that for the revision about to
> > be committed some or all of "new.file.with.code.from.original.file" was
> > once part of "original.file".
> > This property could then be used by plugins for various purposes and
> > could be noted on file annotations.
> 
> So how does git track which chunk is which? Very interesting, though.

GIT is basically the monotone idea. Each chunk is a (list of) parent
chunk id(s) and data and corresponds to **1 file revision**. It is
identified by it's sha1 hash. There is a special file (manifest in
monotone-speak, not sure if git sticks to that name) that contains a map
between paths and chunk ids. Chunk ID of the manifest then represents a
revision.

Thus GIT can do file copies similarly to subversion and IIRC has the
same problem merging them. In fact I am not sure GIT even attempts to
do rename-sensitive merging yet.

> I'm not sure what we could do for bzr. From the simple standpoint that
> each file has a unique identifier, and that is kind of a key feature.
> Its how we handle rename sensitive merging, and a few other nice things.
> 
> The only thing I could think of goes along with ID aliases. Where you
> might say "after this revision file_id A == file_id B".
> It has been asked in the past to allow file_id aliases (so if I add a
> particular file, and you add the same file in a different branch, we can
> merge and just identify them as the same file).
> 
> But it is kind of the inverse of the proposed aliases. Instead of 1 file
> having 2 ids, it is 2 files having 1 id [sort of].
> 
> Just thinking about it, I would say we probably *could* store some sort
> of backpointer, which just says that this file was spawned from that file.
> 
> So possible, but the devil would be in the details.

A backpointer can surely be stored as property when versioned inventory
entry properties get in, for use by other tools. As long as we leave the
original file with it's original ID and give the 'fork' a new ID, merge
is ok.

There was a thread recently talking about 'line ids'. Basic idea being
that line can be identified by file id + revision where it was added +
line number in that revision. Perhaps if we recorded a 'code moved'
metadatum (from revision xy lines a through b), the merge algorithm
could even make use of that -- but there is a long way to it (it needs
some good heuristics to guess the moves and such).

-- 
						 Jan 'Bulb' Hudec <bulb at ucw.cz>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060428/93ddd536/attachment.pgp