Thoughts on file ids

Mon May 16 15:39:58 UTC 2011

On Mon, 2011-05-09 at 11:34 -0400, Aaron Bentley wrote: 
> On 11-05-09 11:09 AM, Jelmer Vernooij wrote:
> > That was what I had in mind. My guess was that that was the reason the
> > trans_ids were different from file ids in the first place, but I have no
> > idea why they are actually different.
> 
> There are several reasons:
> 
> TreeTransforms need to refer to unversioned files.
> 
> TreeTransform operations can happen in any order, which means that in an
> intermediate stage, there may be duplicate file-ids.  However, if one of
> the duplicates is not deleted before attempting to apply the transform,
> this is considered a conflict.
Ah, that makes sense - thanks for the explanation.

> >> Sure, but you could also achieve this kind of thing by rewriting the
> >> file-ids in one of the trees, e.g. using a PreviewTree.
> > That would need some fancy hooks in "bzr merge" too though
> I don't think so.  It's just a matter of preprocessing a tree before
> handing it into the main merge code.
As far as I can tell that still requires either a fancy hook in
cmd_merge or a custom command - how else would you do it?

> > It's certainly an
> > option but it seems like just mapping the ids would be simpler and
> > cleaner.
> Depends what you mean by "simple".  This approach is something you can
> implement without changing any of the core merge code, and without
> introducing any new concepts.
One of the points of having a hook that maps from file ids to transform
ids means more explicitly decoupling the file ids used in the Tree API
and the ids used in the transform operations. In other words, decoupling
uses (1) and (3) as described in my initial email.

> >> I'm not sure the per-file graph would survive the elimination of
> >> file-ids.  File-ids represent the idea that we know at commit time which
> >> files in a tree are comparable to which other files in another tree.  I
> >> think that if we can't encode that comparability at commit time, we
> >> can't have per-file *anything* encoded in a repository.  And
> >> establishing that comparability later could be very expensive.
> > The per-file graph is useful for finding out the relation between a file
> > and older incarnations of it. I don't see why that requires us to store
> > those relations up front rather than discovering them later in some
> > way. 
> > 
> > It might still be a good thing to store those relations explicitly as we
> > are doing now, but I don't see why being able to browse those relations
> > requires them to be stored up front.
> Without storing them up front, I think generating the graph would be too
> expensive.
It's what git does, and it works reasonably well there. git doesn't even
have the file revisions in an inventory to help it, it has to actually
load the entire tree and compare trees to find similarities.

There are of course corner cases where guessing performs less than
optimal, e.g. if you have files that rarely get modified but a lot of
revisions (and I'm not arguing we shouldn't be storing the file graph)
but it's certainly not a strict requirement.

Cheers,

jelmer 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
URL: <https://lists.ubuntu.com/archives/bazaar/attachments/20110516/b6db1ddb/attachment.pgp>