path tokens

Thu Mar 15 02:26:40 GMT 2007

I'd like us to consider overhauling the concept of fileids. I think
fileids serve an important purpose, in letting us unambigously talk
about a versioned path in multiple trees when it has been
renamed/deleted/etc. But as currently defined they restrict us nearly as
much as they simplify our code for merge etc.

So here are a few ideas that I have about the shape of a new tool, which
I'll call path tokens, to avoid confusion with file ids [a better name
is welcome]. I dont intend on talking about implementation yet - partly
because I dont have one in mind, but mainly because I want us to agree
on the *goals* first: theres no point talking about an implementation
until we agree on what we want to achieve. I propose a multi step plan
to tackling this problem:
 - identify the problems/use cases to solve.
 - design acceptable semantics for the new functionality that we've
decided we want to solve.
 - design an implementation that can supplant/extend file_ids to deliver
the agree semantics.
 - go forth and implement.

path tokens should:
 * For currently supported cases, have no more corner cases than
file-ids.
 * allow us to support parallel imports better than file-ids.
 * allow us to support copies as first-class operations.
 * allow us to support 'two versioned paths become one versioned path'.
 * allow us to compare two trees with no reference historical data.

path tokens should not:
 * increase storage size proportional to history or tree size. Note that
this isn't the same as saying 'they should have fixed size'.

Now, for some justification for the above. 

Corner cases: file_ids make a lot of code extremely simple. Are two
files the same, taking renames into account, is very simple in a file id
world: just compare the ids, if its the same, then yes, if not, then no.
Applying a delta to a version path is easy: lookup the fileid the delta
was made against in the target tree, and apply. Keeping this simple way
of talking about versioned things is extremely important in my opinion.

Parallel imports: There are many cases where parallel imports occur.
These imports make it difficult to really work in a decentralised
manner. Conversions from CVS, SVN, GNU Arch, imports from tarballs,
application of regular patches (which create files), all exhibit the
parallel import problem, which is that its desirable for two different
imports to be able to be merged and talked about as though they are the
same project, when it is hard for bzr to actually know that they are. We
currently go through a lot of hoops to achieve [nearly] identical output
here. If we had the ability to take two separate trees which happen to
have paths that users consider the same, and commit a record somewhere
that identifies which paths in these trees should be treated as the
same, it would be possible to merge, and replay, correctly between those
trees. This has been talked about in the past under the term 'file id
aliases'. This would allow a dramatic simplication of the user
experience when converting from systems, like tarballs and CVS, where a
repeatable conversion is essentially impossible.

Copies: This is an oft requested feature. I think it comes up at least
monthly on IRC, and its a real issue when representing what other VCS
systems like SVN actually do to perform 'renames. This isn't to say that
we want to represent SVN renames as copy and delete (I think that is
fugly), but we currently cannot accurately convert svn repositories that
do copies and *do not* delete. Copies also make sense for some user
operations, like splitting a files contents, or take a file like
'COPYING' that does not change often and putting it into other locations
or trees. Telling people to use symlinks, or to remember to manage
separate files, is IMO a reflection on our limits, not a reflection on
what we *should* allow.

Two versioned paths become one: This is mostly covered in my text about
parallel imports. While not quite the same thing they are closely
related. Specifically, there are use cases such as 'combining two source
files' which are independent from the parallel import case, and also
worth supporting, if we can clearly document sane behaviour.

No reference to historical data: Accessing lots of historical data is
expensive - it means performance degrades as history accumulates.
Additionally, in order to support history horizons, which is a proposal
that we allow people to set a strict limit on what historical data is
available to bzr, we need to be able to identify 'these are the same'
across trees without necessarily having acccess to a common ancestor. 

storage size: A naive implementation approach to supporting both file
copying and file combining without history searches may well result in
rapidly increasing storage requirements, so while we are not yet
discussing implementation, this is a constraint on the implementation.

Now, if we agree on the above *in principal*, I'll put this up
somewhere, and move onto the next step (defining the semantics of the
new features).

To be clear: I dont intend this email to provoke immediate discussion on
'file copies must do X' or 'must NOT do X', rather on 'Yes, we should
support file copies IF we can design good semantics' vs 'No, even if we
can design good semantics for file copies, we must not support them'.

-Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070315/d7fa8467/attachment.pgp