Thoughts on file ids

Thu May 5 15:07:33 UTC 2011

On Tue, 2011-05-03 at 17:49 -0400, Aaron Bentley wrote: 
> On 11-05-03 10:35 AM, Jelmer Vernooij wrote:
> > File ids have a couple of known issues; in particular (my pet peeve)
> > they make parallel imports very problematic. They're also hard on
> > foreign branch implementations which don't natively have the concept of
> > file ids. And they're a blocker for file copies.
> > 
> > File ids currently seem to serve three fairly distinct purposes:
> > 
> > 1) as ways to get at files in the tree API
> > 
> > 2) as part of the keys to look up texts
> > 
> > 3) as a way to determine equivalent files between different trees in
> > merges
> That's about right, but 3) is about any operation involving 2 or more
> trees, not just merges.
Would it be correct to say all transform and delta operations? Are there
more operations that involve file ids across multiple trees ?

I wonder if it would make sense to have a process before transform
operations to find renames/copies - was that what you had in mind? Such
a process in its simplest form could just return the existing file ids.

More advanced versions of it could use other information, such as
automatically inferring that two files are the same or a copy, using
revision pseudonyms (parallel imports).

The nice thing about this is that it doesn't have to involve the storage
layer, though it might still be useful to store some extra information.

This would help tremendously with parallel imports. 

The main problem I see with something like this is that it doesn't help
with the problem of texts with the same content having a different file
id/revision and being stored multiple times in the repository.

> > As far as I can tell there is nothing in the API that should really
> > require the file identifiers in these three cases to be identical, and
> > these three use cases have distinct requirements:
> > ]
> > (1) and (2) strictly require file ids to be unique and refer to one
> > specific file/path.
> Absent (3), we'd probably just use the path for (1).  Using the path for
> (2) would mean that renaming files without changing their contents would
> take more space than it does with file-ids.
For (2), it doesn't necessarily have to be the path if we're not using a
file id - it could be a checksum, or perhaps even the file id of another
file in a parallel import. Whatever it is, it should be a repository
implementation detail not exposed at the higher level API / UI level.

I think what we use for (1) and (2) probably shouldn't be the same
thing, as long as there's a way to obtain the text contents from a tree
and a "file thing".

> > (3) in theory doesn't, although the current
> > implementation probably does.
> I've long imagined that any identifier could be substituted in the merge
> code without too much trouble.
The current code supports file copies - in other words, non-unique file
ids ?

> > (2) requires file ids to be persistent across multiple runs of bzr. (3)
> > doesn't - (1) probably doesn't, depending on whether we allow external
> > users to access files by file id.
> > 
> > (2) seems like it could be a repository implementation detail, and not
> > something that needs to be exposed at the API level.
> The model should provide a way to refer to specific revisions of files,
> and revision + (1) is a pretty intuitive way to do that.
Agreed.

Cheers,

Jelmer
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
URL: <https://lists.ubuntu.com/archives/bazaar/attachments/20110505/b081d949/attachment.pgp>