Ideas for Path Tokens

Fri Aug 22 02:45:32 BST 2008

On Wed, 2008-08-20 at 20:49 +0200, Martin von Gagern wrote:
> Hi!
> 
> I'm interested in file copy, and was reading up on the involved specs:

Have you read the thread "Defining semantics for copying and combing
files/directories/symlinks.", which is more up to date than the wiki
pages I think.

> IMPLEMENTATION:
> 
> I would propose to use rooted directed acyclic graphs (DAGs) of opaque 
> file IDs as file tokens. The root of the DAG would be the current ID of 
> the file. For each ID there is a list of zero or more ancestor IDs, from 
> which the current file descended.

I don't think you can assume a single root. Certainly you should be able
to generate a DAG for path tokens. I don't think a DAG representation
will scale well though.

> In case that a current file ID is present in both branches involved in 
> an operation, I think I would wish the operations to be performed in the 
> way they are right now, i.e. ignoring any ancestors. This should cover 
> most of the operations, and help keep them as simple as they are now. 
> Special care would only have to be taken when a file ID is present in 
> only one of the trees. Then the DAG would have to be consulted.

Sure.

> OPERATIONS:
> 
> A. copy

> B. join

I've been calling this command 'combine' because 'join' is taken.

> Command:
> $ bzr join a b
> 
> DAGs:
> a: newIdA <- [oldIdA]
> b: newIdB <- [oldIdB, oldIdA]
> 
> Both files get new IDs, again to easily handle changes that only affect 
> a single file. The new file b, which is the union of the two old files a 
> and b, has both their IDs in its history.
> Of course you can think of flags to the hypothetic join command that 
> would drop a in the process, or write the joined file to a completely 
> new name, but this here is the most general case I think.

A should not have a file id change occur, instead a is simply removed.

> C. log

> For a log, all ancestors of the files in question should be collected, 
> and all log lines touching any of these ancestors should be reported. 
> When a modification changes a file ID but leaves both its contents and 
> path unchanged, then I believe that modification should not be reported 
> as changing that file by default. Therefore I have not listed these ID 
> changes in the listing above.

I think we should show when a file is copied from, it will help people
debugging or understanding their code.

> D. merge with common ancestor
> 
> Command:
> $ bzr merge ../foo
> 
> For the set of all current file IDs of both trees, one of these 
> conditions holds:

Note that merge operates on delta(BASE->OTHER) and delta(BASE->THIS)
rather than on the full set of ids, we try to define operations
proportional to work done, rather than to tree or history size.

> 1) File ID is current in both trees => merge as currently implemented, 
> without any further considerations for file DAGs.

sire

> 2) File ID is current in one of the trees, but not in the other. Let's 
> say file ID aaa is present in branch foo but not in branch bar.
> 
> 2.1) File ID aaa is present in the set of ALL (i.e. current and 
> historical) file IDs in bar. Then aaa is a common ancestor, and all 
> changes applied to aaa in foo should be applied to all descendants of 
> aaa in bar.
> 
> 2.2) File ID aaa is not present in bar at all, but one or more ancestors 
> of aaa are present in bar. Then all those ancestors are common, and all 
> operations (content and copy/join) performed to their descendants in 
> both branches should be merged. As a result of the merger, all merged 
> files will be assigned new IDs, referencing the involved ancestors. 
> Details should be discussed in a follow-up mail.
> 
> 2.3) No ancestor of aaa is present in bar. Then aaa is a new file, and 
> can be added with this ID, including the whole DAG referenced by it.

I'm having trouble following this description :(.

> E. merge tips without common ancestor (e.g. parallel import)
> 
> Command:
> $ bzr merge --join ../foo
> 
> In this case, files with the same path will be joined one by one as by 
> the join command, while files only present in foo will be added to the 
> current branch and keep their ID, except probably for cases where this 
> file ID would be duplicate, and a switch of file ID is necessary to 
> ensure uniqueness.

If the file id would be duplicate, its the same file by definition.
Either file id *is* the unique id, or its not.

> REMARKS:
> 
> A given file ID in a given branch is either "current", "historical" or 
> "invalid". The states of "current" and "invalid" are already present 
> with the current file IDs. The "historical" state could probably in most 
> cases be handled like one of the other two, which one depending on the 
> actual application. This might help with implementation.

I don't know about these states, perhaps you could define them - we
don't have them today.

> I guess that some care should be taken that performing the same 
> operation on the same set of files will result in the generation of the 
> same "current" file IDs. This could probably be achieved by taking the 
> old file IDs and the revision IDs of the last modifications to these 
> files and put all of this through some digest algorithm. As an 
> alternative you could digest the contents of the two files before the 
> merger.

I don't see any reason for this paragraph.

> * allow us to compare two trees with no reference historical data.
> 
> Plain diff (without bzr) does that. I guess you would want to compare 
> branches, not trees. See again "Use Case for Path Tokens" about ways to 
> establish a common ancestor for such cases.

plain diff doesn't handle renames; implementing a dag as you suggest
will require a common ancestor and thus needs historical data - it fails
on this requirement.

> Path tokens should not:
> 
> * increase storage size proportional to history or tree size. Note that 
> this isn't the same as saying 'they should have fixed size'.
> 
> Done, as only copy/join operations (and merges with such operations in 
> both branches involved) introduce new file IDs, and if every file ID 
> references its list of ancestors, the whole DAG with all files can be 
> represented fairly efficiently.

By requiring historical lookups, which is an explicit constraint to
avoid.

> I don't have enough time right now to start implementing this kind of 
> thing, but I would consider it a useful approach worth discussion. I'd 
> be especially interested in use cases you think would be problematic 
> with this approach.

Thanks for your thoughts on this. They are quite similar to mine, and I
think thats a good thing:).

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080822/abd45be0/attachment.pgp