Telling if two trees are different

Fri Oct 24 10:34:54 BST 2008

Robert Collins writes:

 > A few points here - in git the content blobs are the encoded blobs, not
 > the content-only.
 > 
 > That is there is a header + content for anything in the database.

True.  The header contains the object type and its size.

 > So for a file X, it can be in the database many times, with different
 > content pointers - the object pointer will say 'not equal' - even though
 > the file content is identical.

I don't think so.  Two objects have the same SHA1 "name" if and only
if they have the same type and content.  That's the fundamental
invariant of git, and the source of its speed.  The type is constant
for files ("blob") and the size will be the same for identical
content.

 > Secondly, the cost of translating a bzr inventory into any other
 > representation is _always_ much greater than that of doing
 > revtree1.iter_changes(revtree1).next()

True, but I'm suggesting keeping an auxiliary index, not calculating
it on the fly.

 > By contrast, that python code fragment will use the best facilities of
 > the underlying trees (its a multimethod). The trees in my repository
 > branch are similar in nature to gits directory storage, in that there
 > are multiple documents to represent a single tree's
 > list-of-files-and-dirs, compared to hg's manifests and [mainline] bzr's
 > inventories. 

That should be fast, indeed.

 > I haven't yet written the iter_changes optimiser for these trees, but it
 > will have similar work to do as git does to compare two trees (though I
 > hope it will have a better scaling factory due to balancing its own tree
 > rather than following directory boundaries.)

In practice this is not going to matter.  Humans like balanced trees,
too.