[ANNOUNCE] Example Cogito Addon - cogito-bundle
Linus Torvalds
torvalds at osdl.org
Fri Oct 20 20:46:39 BST 2006
On Fri, 20 Oct 2006, Aaron Bentley wrote:
>
> Linus Torvalds wrote:
> > Git goes one step further: it _really_ doesn't matter about how you got to
> > a certain state. Absolutely _none_ of what the commits in between the
> > final stages and the common ancestor matter in the least. The only thing
> > that matters is what the states at the end-point are.
>
> That's interesting, because I've always thought one of the strengths of
> file-ids was that you only had to worry about end-points, not how you
> got there.
>
> How do you handle renames without looking at the history?
You first handle all the non-renames that just merge on their own. That
takes care of 99.99% of the stuff (and I'm not exaggerating: in the
kernel, you have ~21000 files, and most merges don't have a single rename
to worry about - and even when you do have them, they tend to be in the
"you can count them on one hand" kind of situation).
Then you just look at all the pathnames you _couldn't_ resolve, and that's
usually cut down the thing to something where you can literally use a lot
of CPU power per file, because now you only have a small number of
candidates left.
If you were to use one hundredth of a second per file regardless of file,
a stupid per-file merge would take 210 seconds, which is just
unacceptable. So you really don't want to do that. You want to merge whole
subdirectories in one go (and with git, you can: since the SHA1 of a
directory defines _all_ of the contents under it, if the two branches you
merge have an identical subdirectory, you don't need to do anything at
_all_ about that one. See?).
So instead of trying to be really fast on individual files and doing them
one at a time, git makes individual files basically totally free (you
literally often don't need to look at them AT ALL). And then for the few
files you can't resolve, you can afford to spend more time.
So say that you spend one second per file-pair because you do complex
heuristics etc - you'll still have a merge that is a _lot_ faster than
your 210-second one.
So recursive basically generates the matrix of similarity for the
new/deleted files, and tries to match them up, and there you have your
renames - without ever looking at the history of how you ended up where
you are.
Btw, that "210 second" merge is not at all unlikely. Some of the SCM's
seem to scale much worse than that to big archives, and I've heard people
talk about merges that took 20 minutes or more. In contrast, git doing a
merge in ~2-3 seconds for the kernel is _normal_.
[ In fact, I just re-tested doing my last kernel merge: it took 0.970
seconds, and that was _including_ the diffstat of the result - not
obviously not including the time to fetch the other branch over the
network.
I don't know if people appreciate how good it is to do a merge of two
21000-file branches in less than a second. It didn't have any renames,
and it only had a single well-defined common parent, but not only is
that the common case, being that fast for the simple case is what
_allows_ you to do well on the complex cases too, because it's what gets
rid of all the files you should _not_ worry about ]
Performance does matter.
Linus
More information about the bazaar
mailing list