[ANNOUNCE] Example Cogito Addon - cogito-bundle

Fri Oct 20 21:57:27 BST 2006

On Fri, 20 Oct 2006, Aaron Bentley wrote:
> 
> Agreed.  We start by comparing BASE and OTHER, so all those comparisons
> are in-memory operations that don't hit disk.  Only for files where BASE
> and OTHER differ do we even examine the THIS version.

Git just slurps in all three trees. I actually think that the current 
merge-recursive.c does it the stupid way (ie it expands all trees 
recursively, regardless of whether it's needed or not), but I should 
really check with Dscho, since I had nothing to do with that code.

I wrote a tree-level merger that avoided doing the recursive tree reading 
when the tree-SHA1's matched entirely, and re-doing the latest merge using 
that took all of 0.037s, because it didn't recursively expand any of the 
uninteresting trees.

But the default recursive merge was ported from the python script that 
did it a full tree at a time, so it's comparatively "slow". But it's fast 
enough (witness the under-1s time ;) that I think the motivation to be 
smarter about reading the trees was basically not just there, so my 
"git-merge-tree" thing is languishing as a proof-of-concept.

So right now, git merging itself doesn't even take advantage of the "you 
can compare two whole directories in one go". We do that all over the 
place in other situations, though (it's a big reason for why doing a 
"diff" between different revisions is so fast - you can cut the problem 
space up and ignore the known-identical parts much faster).

That tree-based data structure turned out to be wonderful. Originally (as 
in "first weeks of actual git work" in April 2005) git had a flat "file 
manifest" kind of thing, and that really sucked.  So the data structures 
are important, and I think we got those right fairly early on.

> We can do a do-nothing kernel merge in < 20 seconds, and that's
> comparing every single file in the tree.  In Python.  I was aiming for
> less than 10 seconds, but didn't quite hit it.

Well, so I know I can do that particular actual merge in 0.037 seconds 
(that's not counting the history traversal to actually find the common 
parent, which is another 0.01s or more ;), so we should be able to 
comfortably do the simple merges in less than a tenth of a second. But at 
some point, apparently nobody just cares.

Of course, this kind of thing depends a lot on developer behaviour. We had 
some performance bugs that we didn't notice simply because the kernel 
didn't show any of those patterns, but people using it for other things 
had slower merges. Sometimes you don't see the problem, just because you 
end up looking at the wrong pattern for performance.

> > So recursive basically generates the matrix of similarity for the 
> > new/deleted files, and tries to match them up, and there you have your 
> > renames - without ever looking at the history of how you ended up where 
> > you are.
> 
> So in the simple case, you compare unmatched THIS, OTHER and BASE files
> to find the renames?

Right. Some cases are easy: if one of the branches only added files (which 
is relatively common), that obviously cannot be a rename. So you don't 
even have to compare all possible combinarions - you know you don't have 
renames from one branch to the other ;)

But I'm not even the authorative person to explain all the details of the 
current recursive merge, and I might have missed something. Dscho? 
Fredrik? Anything you want to add?

			Linus