Path Tokens

Fri Jul 24 18:08:28 BST 2009

Hi all,

Chatting with Jelmer today at DebConf the subject of parallel imports
came up, as it is something we both want, for similar reasons. We
discussed the only proposed solution that I know of, path tokens, and
tried to infer how it would work from the little that we know. I'd
like to present a sketch of the system that we discussed for feedback.

== Path Tokens ==

Path tokens are an extension of file ids, that act the same in most
cases. Each file in a tree has a unique identifier separate from its
path. If an id exists in two trees then we infer that the file is
the same, regardless of the paths. The difference is that a path
token does not stay with a file for its lifetime.

There are allowed to be joins and forks in the history of a single
file. In order to acheive this then whenever a join occurs the resulting
file is assigned a new id, different from the two ids that joined.
Similarly, for a fork the two resulting ids are different to the
id that forked.

This means that in order to know whether two ids in two trees are the
same you have to take their history in to account. If the ids are the
same then the file is the same. If the ids are different, and neither
id is present in the other tree, then you need to perform a lookup
of their histories to see if they overlap.

== History lookup ==

Becuase the ids always change whenever a fork or join occurs, and we
assume that ids are globally unique, the lookup can be independent of
history (you don't need to consider only changes made in the revisions
that exist between the two trees). This not only speeds the process up,
but also means that you can still compare two tress without reference to
the revision history.

The id changes can be stored in a key-value map, keyed by id. A join
would be

   new-id: (old-id-1, old-id-2)

a fork would be:

   new-id-1: old-id
   new-id-2: old-id

(we may or may not want to store the inverted map as well).

Then given two trees, if there is an id that exists in one tree but
not the other, you look it up in the map and retrieve any related
ids, and check them in the tree. You recursively repeat this until you
have found an id that is in the other tree, or you have no more ids to
check.

This only impacts cases where an id is present in one tree but not
the other, and will be cheap for the case where a file was removed.
Otherwise it is O(number of forks/joins in the file's history), which
is reasonable.

This is a rather simple store, and wouldn't be difficult to implement
in a way that would perform reasonably well.

== Handling in merge ==

The lookups can be done and file changes applied appropriately
(discussed later), but there is one more thing to consider.

There are cases (parallel imports generally, including applying
patches from bugs), where the same file is created in two trees
at the same time. It would be possible to take path conflicts
like this at merge time and (if the contents are sufficiently
similar) automatically perform a file-join and merge the contents,
leaving a content conflict if necessary. This would solve the
issues that I am having, and would also remove a bit of confusion
that some users encounter when they hit such a case.

== Handling in diff/status ==

diff can compare the trees and also display information about copies
and joins, though how to do this is not clear, do you want to see
diff of both copies against the ancestor? If so, how do we make
"bzr diff | patch" work?

== Handling in annotate ==

Annotate could use the information to also show information from
the files before the split/join, though it may need to become
a bit smarter to identify which parts of which files ended up
where.

== Merging over copies and joins ==

Relatedly, merging over copies and joins is the major open question.
I believe it's not possible to do this in a way that always gives
what you want. Therefore I think we would need to discuss what
the default should be, and then come up with a "remerge" or similar
that allows you to change the default if it isn't what you want.

In addition, we could consider storing intent with the changes, so
"bzr copy" and "bzr split-file" would indicate different actions
to perform on merge, though this still wouldn't be enough to always
do the right thing.

These cases only arise for join when you have two files from the same
tree that join, the parallel import case does not naturally create
this situation.

Therefore we could put that part in, without the UI to do copies/joins
in-tree, and then have the initial code leave some sort of conflict
if it encounters such a case, allowing for incremental improvement.

== Storage implications ==

I said above that storing the mappings themselves wouldn't be too
difficult, but it does have implications for the existing storage.

If you have two trees that join all the files on every merge then you
can end up with many changes of ids over time, and this will
lead to inefficiencies in the storage formats that we currently
have. This may mean we need to change some of that in order
to accommodate this.

Comments welcome.

Thanks,

James