Ideas for Path Tokens
Robert Collins
robertc at robertcollins.net
Fri Aug 22 02:45:32 BST 2008
On Wed, 2008-08-20 at 20:49 +0200, Martin von Gagern wrote:
> Hi!
>
> I'm interested in file copy, and was reading up on the involved specs:
Have you read the thread "Defining semantics for copying and combing
files/directories/symlinks.", which is more up to date than the wiki
pages I think.
> IMPLEMENTATION:
>
> I would propose to use rooted directed acyclic graphs (DAGs) of opaque
> file IDs as file tokens. The root of the DAG would be the current ID of
> the file. For each ID there is a list of zero or more ancestor IDs, from
> which the current file descended.
I don't think you can assume a single root. Certainly you should be able
to generate a DAG for path tokens. I don't think a DAG representation
will scale well though.
> In case that a current file ID is present in both branches involved in
> an operation, I think I would wish the operations to be performed in the
> way they are right now, i.e. ignoring any ancestors. This should cover
> most of the operations, and help keep them as simple as they are now.
> Special care would only have to be taken when a file ID is present in
> only one of the trees. Then the DAG would have to be consulted.
Sure.
> OPERATIONS:
>
> A. copy
> B. join
I've been calling this command 'combine' because 'join' is taken.
> Command:
> $ bzr join a b
>
> DAGs:
> a: newIdA <- [oldIdA]
> b: newIdB <- [oldIdB, oldIdA]
>
> Both files get new IDs, again to easily handle changes that only affect
> a single file. The new file b, which is the union of the two old files a
> and b, has both their IDs in its history.
> Of course you can think of flags to the hypothetic join command that
> would drop a in the process, or write the joined file to a completely
> new name, but this here is the most general case I think.
A should not have a file id change occur, instead a is simply removed.
> C. log
> For a log, all ancestors of the files in question should be collected,
> and all log lines touching any of these ancestors should be reported.
> When a modification changes a file ID but leaves both its contents and
> path unchanged, then I believe that modification should not be reported
> as changing that file by default. Therefore I have not listed these ID
> changes in the listing above.
I think we should show when a file is copied from, it will help people
debugging or understanding their code.
> D. merge with common ancestor
>
> Command:
> $ bzr merge ../foo
>
> For the set of all current file IDs of both trees, one of these
> conditions holds:
Note that merge operates on delta(BASE->OTHER) and delta(BASE->THIS)
rather than on the full set of ids, we try to define operations
proportional to work done, rather than to tree or history size.
> 1) File ID is current in both trees => merge as currently implemented,
> without any further considerations for file DAGs.
sire
> 2) File ID is current in one of the trees, but not in the other. Let's
> say file ID aaa is present in branch foo but not in branch bar.
>
> 2.1) File ID aaa is present in the set of ALL (i.e. current and
> historical) file IDs in bar. Then aaa is a common ancestor, and all
> changes applied to aaa in foo should be applied to all descendants of
> aaa in bar.
>
> 2.2) File ID aaa is not present in bar at all, but one or more ancestors
> of aaa are present in bar. Then all those ancestors are common, and all
> operations (content and copy/join) performed to their descendants in
> both branches should be merged. As a result of the merger, all merged
> files will be assigned new IDs, referencing the involved ancestors.
> Details should be discussed in a follow-up mail.
>
> 2.3) No ancestor of aaa is present in bar. Then aaa is a new file, and
> can be added with this ID, including the whole DAG referenced by it.
I'm having trouble following this description :(.
> E. merge tips without common ancestor (e.g. parallel import)
>
> Command:
> $ bzr merge --join ../foo
>
> In this case, files with the same path will be joined one by one as by
> the join command, while files only present in foo will be added to the
> current branch and keep their ID, except probably for cases where this
> file ID would be duplicate, and a switch of file ID is necessary to
> ensure uniqueness.
If the file id would be duplicate, its the same file by definition.
Either file id *is* the unique id, or its not.
> REMARKS:
>
> A given file ID in a given branch is either "current", "historical" or
> "invalid". The states of "current" and "invalid" are already present
> with the current file IDs. The "historical" state could probably in most
> cases be handled like one of the other two, which one depending on the
> actual application. This might help with implementation.
I don't know about these states, perhaps you could define them - we
don't have them today.
> I guess that some care should be taken that performing the same
> operation on the same set of files will result in the generation of the
> same "current" file IDs. This could probably be achieved by taking the
> old file IDs and the revision IDs of the last modifications to these
> files and put all of this through some digest algorithm. As an
> alternative you could digest the contents of the two files before the
> merger.
I don't see any reason for this paragraph.
> * allow us to compare two trees with no reference historical data.
>
> Plain diff (without bzr) does that. I guess you would want to compare
> branches, not trees. See again "Use Case for Path Tokens" about ways to
> establish a common ancestor for such cases.
plain diff doesn't handle renames; implementing a dag as you suggest
will require a common ancestor and thus needs historical data - it fails
on this requirement.
> Path tokens should not:
>
> * increase storage size proportional to history or tree size. Note that
> this isn't the same as saying 'they should have fixed size'.
>
> Done, as only copy/join operations (and merges with such operations in
> both branches involved) introduce new file IDs, and if every file ID
> references its list of ancestors, the whole DAG with all files can be
> represented fairly efficiently.
By requiring historical lookups, which is an explicit constraint to
avoid.
> I don't have enough time right now to start implementing this kind of
> thing, but I would consider it a useful approach worth discussion. I'd
> be especially interested in use cases you think would be problematic
> with this approach.
Thanks for your thoughts on this. They are quite similar to mine, and I
think thats a good thing:).
-Rob
--
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080822/abd45be0/attachment.pgp
More information about the bazaar
mailing list