Ideas for Path Tokens
Martin von Gagern
Martin.vGagern at gmx.net
Wed Aug 20 19:49:07 BST 2008
Hi!
I'm interested in file copy, and was reading up on the involved specs:
https://blueprints.launchpad.net/bzr/+spec/filecopies
http://bazaar-vcs.org/BzrFileCopies
https://blueprints.launchpad.net/bzr/+spec/path-tokens
http://bazaar-vcs.org/DraftSpecs/PathTokens
I have some ideas how such "path tokens" could be implemented, and would
like to offer them for discussion. I'll first outline the basic idea of
the implementation, then show how different operations would deal with
them, then I'll have some remarks and last I'll compare this to the set
of requirements given in the spec draft.
IMPLEMENTATION:
I would propose to use rooted directed acyclic graphs (DAGs) of opaque
file IDs as file tokens. The root of the DAG would be the current ID of
the file. For each ID there is a list of zero or more ancestor IDs, from
which the current file descended.
In case that a current file ID is present in both branches involved in
an operation, I think I would wish the operations to be performed in the
way they are right now, i.e. ignoring any ancestors. This should cover
most of the operations, and help keep them as simple as they are now.
Special care would only have to be taken when a file ID is present in
only one of the trees. Then the DAG would have to be consulted.
OPERATIONS:
A. copy
Command:
$ bzr copy a b
DAGs:
a: newIdA <- [oldId]
b: newIdB <- [oldId]
Both files get new IDs, with a single ancestor each. When later merged
with a different branch, changes applied to oldId should be included in
both copies (and thus need more elaborate handling), while changes
applied to a new file only affect one copy and can be done easier.
B. join
Command:
$ bzr join a b
DAGs:
a: newIdA <- [oldIdA]
b: newIdB <- [oldIdB, oldIdA]
Both files get new IDs, again to easily handle changes that only affect
a single file. The new file b, which is the union of the two old files a
and b, has both their IDs in its history.
Of course you can think of flags to the hypothetic join command that
would drop a in the process, or write the joined file to a completely
new name, but this here is the most general case I think.
C. log
Commands:
$ touch a b c
$ bzr commit -m one
$ bzr copy c d
$ bzr commit -m two
$ echo foo > a
$ bzr commit -m three
$ bzr join b c
$ bzr commit -m four
$ bzr log a
-> list one and three
$ bzr log b
-> list one
$ bzr log c
-> list one and four
$ bzr log d
-> list one and two
DAGs:
a: a at one
b: b at four <- b at one
c: c at four <- [c at two <- [c at one], b at one]
d: d at two <- [c at one]
IDs touched by each commit:
one: a at one, b at one, c at one
two: d at two
three: a at one
four: c at four
For a log, all ancestors of the files in question should be collected,
and all log lines touching any of these ancestors should be reported.
When a modification changes a file ID but leaves both its contents and
path unchanged, then I believe that modification should not be reported
as changing that file by default. Therefore I have not listed these ID
changes in the listing above.
D. merge with common ancestor
Command:
$ bzr merge ../foo
For the set of all current file IDs of both trees, one of these
conditions holds:
1) File ID is current in both trees => merge as currently implemented,
without any further considerations for file DAGs.
2) File ID is current in one of the trees, but not in the other. Let's
say file ID aaa is present in branch foo but not in branch bar.
2.1) File ID aaa is present in the set of ALL (i.e. current and
historical) file IDs in bar. Then aaa is a common ancestor, and all
changes applied to aaa in foo should be applied to all descendants of
aaa in bar.
2.2) File ID aaa is not present in bar at all, but one or more ancestors
of aaa are present in bar. Then all those ancestors are common, and all
operations (content and copy/join) performed to their descendants in
both branches should be merged. As a result of the merger, all merged
files will be assigned new IDs, referencing the involved ancestors.
Details should be discussed in a follow-up mail.
2.3) No ancestor of aaa is present in bar. Then aaa is a new file, and
can be added with this ID, including the whole DAG referenced by it.
E. merge tips without common ancestor (e.g. parallel import)
Command:
$ bzr merge --join ../foo
In this case, files with the same path will be joined one by one as by
the join command, while files only present in foo will be added to the
current branch and keep their ID, except probably for cases where this
file ID would be duplicate, and a switch of file ID is necessary to
ensure uniqueness.
REMARKS:
A given file ID in a given branch is either "current", "historical" or
"invalid". The states of "current" and "invalid" are already present
with the current file IDs. The "historical" state could probably in most
cases be handled like one of the other two, which one depending on the
actual application. This might help with implementation.
A given path in a given revision of a given branch will keep its file ID
and the associated DAG for all eternity. I had some worries that
modifications to some remote branch could introduce new IDs that my
local branch wouldn't know about, but those IDs would only hit me in the
revision where I merge from the remote branch, so that is the time to
introduce the new IDs in my branch as well, and the IDs before the merge
stay unchanged.
I guess that some care should be taken that performing the same
operation on the same set of files will result in the generation of the
same "current" file IDs. This could probably be achieved by taking the
old file IDs and the revision IDs of the last modifications to these
files and put all of this through some digest algorithm. As an
alternative you could digest the contents of the two files before the
merger.
REQUIREMENTS:
According to the spec, path tokens should:
* For currently supported cases, have no more corner cases than file-ids.
Done, as currently supported cases can still be handled with file IDs alone.
* allow us to support parallel imports better than file-ids.
Done, using the "merge --join" operation discussed in point E above. See
also my mail "Use Case for Path Tokens" I'm about to send together with
this one here.
* allow us to support copies as first-class operations.
Done, using multiple file IDs sharing an ancestor
* allow us to support 'two versioned paths become one versioned path'.
Done, using a file ID with multiple ancestors
* allow us to compare two trees with no reference historical data.
Plain diff (without bzr) does that. I guess you would want to compare
branches, not trees. See again "Use Case for Path Tokens" about ways to
establish a common ancestor for such cases.
Path tokens should not:
* increase storage size proportional to history or tree size. Note that
this isn't the same as saying 'they should have fixed size'.
Done, as only copy/join operations (and merges with such operations in
both branches involved) introduce new file IDs, and if every file ID
references its list of ancestors, the whole DAG with all files can be
represented fairly efficiently.
I don't have enough time right now to start implementing this kind of
thing, but I would consider it a useful approach worth discussion. I'd
be especially interested in use cases you think would be problematic
with this approach.
Greetings,
Martin von Gagern
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 260 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080820/62c4be85/attachment-0001.pgp
More information about the bazaar
mailing list