Thoughts on file ids

Mon May 9 15:09:02 UTC 2011

On Mon, 2011-05-09 at 09:42 -0400, Aaron Bentley wrote:
> On 11-05-08 07:53 AM, Jelmer Vernooij wrote:
> > On Fri, 2011-05-06 at 11:06 -0400, Aaron Bentley wrote:
> >> On 11-05-05 11:07 AM, Jelmer Vernooij wrote:
> >>> I wonder if it would make sense to have a process before transform
> >>> operations to find renames/copies - was that what you had in mind? Such
> >>> a process in its simplest form could just return the existing file ids.
> >> No, that wasn't something I had in mind.  Finding renames is one thing,
> >> but merge-across-copies, and the inverse, merge-across-joins, is evil
> >> and would require lots of work.
> >>
> >> I have thought about implementing merge-by-path, though.
> > What I mean is allowing a process before delta/transform operations that
> > assigns short-lived (i.e. only relevant to that action) file ids to each
> > file in the relevant trees.
> I think that might make sense, but it's also worth seeing if that could
> be merged with TreeTransform trans_ids, because they have a similar
> lifetime and purpose.
That was what I had in mind. My guess was that that was the reason the
trans_ids were different from file ids in the first place, but I have no
idea why they are actually different. I should also note I have only
limited experience with bzrlib.transform and bzrlib.merge internals.

> > That sort of thing would allow the implementation of things like
> > merge-by-path, or other more advanced mechanisms (Git's algorithm of "if
> > X percent of two files matches, it's probably the same file"), without
> > affecting the storage layer.
> Sure, but you could also achieve this kind of thing by rewriting the
> file-ids in one of the trees, e.g. using a PreviewTree.
That would need some fancy hooks in "bzr merge" too though, and would
require a similar process to find map the file ids.. It's certainly an
option but it seems like just mapping the ids would be simpler and
cleaner. It also doesn't eliminate any reliance on file ids during merge
operations.

> >> The tuples we use for versionedfiles are already repository
> >> implementation details, aren't they?
> > They are now, but that's a relatively recent change.
> Before now, they were part of the model?
Perhaps not really of the model, but certainly part of the API. 

> >> Mind you, there's also the per-file graph, which I don't think you've
> >> really discussed here.
> > I think the per file graph is should just be considered a sort of sparse
> > version of the revision graph.
> I'm not sure the per-file graph would survive the elimination of
> file-ids.  File-ids represent the idea that we know at commit time which
> files in a tree are comparable to which other files in another tree.  I
> think that if we can't encode that comparability at commit time, we
> can't have per-file *anything* encoded in a repository.  And
> establishing that comparability later could be very expensive.
The per-file graph is useful for finding out the relation between a file
and older incarnations of it. I don't see why that requires us to store
those relations up front rather than discovering them later in some
way. 

It might still be a good thing to store those relations explicitly as we
are doing now, but I don't see why being able to browse those relations
requires them to be stored up front.

Cheers,

Jelmer
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
URL: <https://lists.ubuntu.com/archives/bazaar/attachments/20110509/6564383c/attachment.pgp>