Ideas for Path Tokens

Wed Aug 20 19:49:07 BST 2008

Hi!

I'm interested in file copy, and was reading up on the involved specs:
https://blueprints.launchpad.net/bzr/+spec/filecopies
http://bazaar-vcs.org/BzrFileCopies
https://blueprints.launchpad.net/bzr/+spec/path-tokens
http://bazaar-vcs.org/DraftSpecs/PathTokens

I have some ideas how such "path tokens" could be implemented, and would 
like to offer them for discussion. I'll first outline the basic idea of 
the implementation, then show how different operations would deal with 
them, then I'll have some remarks and last I'll compare this to the set 
of requirements given in the spec draft.

IMPLEMENTATION:

I would propose to use rooted directed acyclic graphs (DAGs) of opaque 
file IDs as file tokens. The root of the DAG would be the current ID of 
the file. For each ID there is a list of zero or more ancestor IDs, from 
which the current file descended.

In case that a current file ID is present in both branches involved in 
an operation, I think I would wish the operations to be performed in the 
way they are right now, i.e. ignoring any ancestors. This should cover 
most of the operations, and help keep them as simple as they are now. 
Special care would only have to be taken when a file ID is present in 
only one of the trees. Then the DAG would have to be consulted.

OPERATIONS:

A. copy

Command:
$ bzr copy a b

DAGs:
a: newIdA <- [oldId]
b: newIdB <- [oldId]

Both files get new IDs, with a single ancestor each. When later merged 
with a different branch, changes applied to oldId should be included in 
both copies (and thus need more elaborate handling), while changes 
applied to a new file only affect one copy and can be done easier.

B. join

Command:
$ bzr join a b

DAGs:
a: newIdA <- [oldIdA]
b: newIdB <- [oldIdB, oldIdA]

Both files get new IDs, again to easily handle changes that only affect 
a single file. The new file b, which is the union of the two old files a 
and b, has both their IDs in its history.
Of course you can think of flags to the hypothetic join command that 
would drop a in the process, or write the joined file to a completely 
new name, but this here is the most general case I think.

C. log

Commands:
$ touch a b c
$ bzr commit -m one
$ bzr copy c d
$ bzr commit -m two
$ echo foo > a
$ bzr commit -m three
$ bzr join b c
$ bzr commit -m four
$ bzr log a
   -> list one and three
$ bzr log b
   -> list one
$ bzr log c
   -> list one and four
$ bzr log d
   -> list one and two

DAGs:
a: a at one
b: b at four <- b at one
c: c at four <- [c at two <- [c at one], b at one]
d: d at two <- [c at one]

IDs touched by each commit:
one:   a at one, b at one, c at one
two:   d at two
three: a at one
four:  c at four

For a log, all ancestors of the files in question should be collected, 
and all log lines touching any of these ancestors should be reported. 
When a modification changes a file ID but leaves both its contents and 
path unchanged, then I believe that modification should not be reported 
as changing that file by default. Therefore I have not listed these ID 
changes in the listing above.

D. merge with common ancestor

Command:
$ bzr merge ../foo

For the set of all current file IDs of both trees, one of these 
conditions holds:

1) File ID is current in both trees => merge as currently implemented, 
without any further considerations for file DAGs.

2) File ID is current in one of the trees, but not in the other. Let's 
say file ID aaa is present in branch foo but not in branch bar.

2.1) File ID aaa is present in the set of ALL (i.e. current and 
historical) file IDs in bar. Then aaa is a common ancestor, and all 
changes applied to aaa in foo should be applied to all descendants of 
aaa in bar.

2.2) File ID aaa is not present in bar at all, but one or more ancestors 
of aaa are present in bar. Then all those ancestors are common, and all 
operations (content and copy/join) performed to their descendants in 
both branches should be merged. As a result of the merger, all merged 
files will be assigned new IDs, referencing the involved ancestors. 
Details should be discussed in a follow-up mail.

2.3) No ancestor of aaa is present in bar. Then aaa is a new file, and 
can be added with this ID, including the whole DAG referenced by it.

E. merge tips without common ancestor (e.g. parallel import)

Command:
$ bzr merge --join ../foo

In this case, files with the same path will be joined one by one as by 
the join command, while files only present in foo will be added to the 
current branch and keep their ID, except probably for cases where this 
file ID would be duplicate, and a switch of file ID is necessary to 
ensure uniqueness.

REMARKS:

A given file ID in a given branch is either "current", "historical" or 
"invalid". The states of "current" and "invalid" are already present 
with the current file IDs. The "historical" state could probably in most 
cases be handled like one of the other two, which one depending on the 
actual application. This might help with implementation.

A given path in a given revision of a given branch will keep its file ID 
and the associated DAG for all eternity. I had some worries that 
modifications to some remote branch could introduce new IDs that my 
local branch wouldn't know about, but those IDs would only hit me in the 
revision where I merge from the remote branch, so that is the time to 
introduce the new IDs in my branch as well, and the IDs before the merge 
stay unchanged.

I guess that some care should be taken that performing the same 
operation on the same set of files will result in the generation of the 
same "current" file IDs. This could probably be achieved by taking the 
old file IDs and the revision IDs of the last modifications to these 
files and put all of this through some digest algorithm. As an 
alternative you could digest the contents of the two files before the 
merger.

REQUIREMENTS:

According to the spec, path tokens should:

* For currently supported cases, have no more corner cases than file-ids.

Done, as currently supported cases can still be handled with file IDs alone.

* allow us to support parallel imports better than file-ids.

Done, using the "merge --join" operation discussed in point E above. See 
also my mail "Use Case for Path Tokens" I'm about to send together with 
this one here.

* allow us to support copies as first-class operations.

Done, using multiple file IDs sharing an ancestor

* allow us to support 'two versioned paths become one versioned path'.

Done, using a file ID with multiple ancestors

* allow us to compare two trees with no reference historical data.

Plain diff (without bzr) does that. I guess you would want to compare 
branches, not trees. See again "Use Case for Path Tokens" about ways to 
establish a common ancestor for such cases.

Path tokens should not:

* increase storage size proportional to history or tree size. Note that 
this isn't the same as saying 'they should have fixed size'.

Done, as only copy/join operations (and merges with such operations in 
both branches involved) introduce new file IDs, and if every file ID 
references its list of ancestors, the whole DAG with all files can be 
represented fairly efficiently.

I don't have enough time right now to start implementing this kind of 
thing, but I would consider it a useful approach worth discussion. I'd 
be especially interested in use cases you think would be problematic 
with this approach.

Greetings,
  Martin von Gagern

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 260 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080820/62c4be85/attachment-0001.pgp