Proposal: use predictable file-ids

Fri Aug 12 13:13:54 BST 2005

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John A Meinel wrote:
> Aaron Bentley wrote:
> 
>>Hi all,
>>
>>I'd like to suggest that we make file-ids predictable.  There are two
>>advantages I can see:

> I suppose. I think for (1) an easier and more beneficial change would be
> to make text-ids predictable.

Sure.  In fact, I was assuming that we'd already agreed to that.

>>My suggestion:
>>REVSISION-ID/pathname/at/commit/time
>>
>>This is a UUID because revision ids are unique and pathnames are unique
>>for a given revision.  If we can assign predictable IDs to revisions
>>from other SCMs (e.g. based on foreign rev-ids and/or hashes of tree
>>state), then this makes it fairly easy to ensure that repeated
>>operations produce the same result.
>>
>>There are two problems with this:
>>
>>1. path contains forbidden characters
>>2. revision-ID is not known until commit time
>>
>>The first can be worked around with suitable escaping.
>>
>>The second is not easy to work around.  If we use the parent ID instead
>>of the commit ID, we no longer have a UUID, because more than one branch
>>can have that parent and create that file.  Yet we are required to have
>>an ID for all versioned files.
> 
> 
> The problem with this statement, is that earlier you state:
> "it is easier to make imports from other SCMs produce identical branches
> every time"
> 
> Which is in a lot of ways indistinguishable from creating a new file and
> checking it in.

If you have two imports of the same tree, but the file-ids differ, you
cannot merge them.  Doing an import is definitely like doing a regular
commit, but if IDs are based on revision names, and you force revision
names to be the same for all imports of that tree, file-ids and text-ids
will match up automatically.

> If you really wanted, you could always go the darcs/svn route, and just
> get rid of ids entirely. As long as you have enough history to compare
> between two trees you can match up files without an id.

In that case, the way you determine that branch1/foo and branch2/bar are
the same is by determining that they had the same path in a prior
revision.  So you can see what inspired my choice of predictable ID.
OTOH, those requirements make it hard to merge branches with no common
history.  It makes us dependent on history, which we don't always have
enough of.  Also, a file that is deleted in one revision and restored in
the next will be treated as a completely different file.

To me, using an alorithm based on predicting filenames in previous trees
seems very brittle, and scales linearly with the number of revisions
between the current one and the last common ancestor.  It also makes it
much harder to do 'file suturing', as they call it in the Monotone
world.  Suturing means causing Monotone to treat two files as the same,
even though they have different history.  Whereas the plan for bzr is to
simply have a list of id aliases.

> For text-ids, we can just use the escaped path + the revision of last
> change, which means that the only unique id we need is the revision-id.
> Everything else stems from that.

Yes, that's what I had in mind.  I was just extending it a bit more.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFC/JD60F+nu1YWqI0RAowWAJ9SjQY00bpwO9hWWT/i5UtrV/lh/QCfYcid
qLyLKUKFxREs3vdvBrFC0Lc=
=kPOQ
-----END PGP SIGNATURE-----