Can "bzr annotate" be sped up?

Fri Jun 17 11:04:00 UTC 2011

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...
> Btw, a large portion of the time for xdisp.c (0m55s out of 1m20s) is
> spent in a phase that displays "extracting 0/2007" on the screen.
> This part is instantaneous for bidi.c, so the "extracting" thing just
> flashes on the screen and disappears right away.
> 
> Can "annotate" be sped up, please?
> 
...

> So: can "annotate" be made faster?
> 
> TIA
> 

Short answer, yes but not trivially.

I've spent a lot of time to make annotate as fast as can be with our
current data structures. As you noted, we're already significantly
faster than git, because we put time in optimizing it.

Basically we extract every revision of the text over its history, and
then do a comparison over time. The extraction is actually pretty fast,
but for your file with 2000+ revisions, that is ~2000 'diff' commands.
(probably more, because of how ancestry works.) I'm suprised the counter
stays at 0, though.

The code is meant to be pretty good about extracting as fast as it can,
and building up the annotations as it goes.

The ways to speed it up are generally in bugs tagged 'annotate' on
Launchpad. The primary one you would care about is:
 http://pad.lv/153787

There are 2 things that we could try to do going forward:

1) Annotate in 'reverse' mode. Forward mode is a bit more obvious. You
start at a rev, mark all lines as 'modified in rev 1'. Then grab the
second rev, and lines that match are left alone, lines which are new are
marked 'rev2'. Then on to rev3, etc.

In reverse mode, you start at rev3 and mark all lines as 'unknown'. Then
you compare with rev2. All lines that match are left as 'unknown', new
lines in rev3 are marked 'rev3'. That isn't particularly hard, the
trickier part is dealing with merge-parents. (Lines are new in the
current text vs the first parent, but that is because they came from the
second parent.)

Even harder are things like lines-introduced-in-multiple-parents
simultaneously. (You and I as different authors both commit a very
similar patch, and then later on we merge our changes together,
collapsing some of the changes, etc.

This can be faster by stopping once all lines have been annotated.

 a) In my testing, most files have at least 1 line from the very
    beginning of history. Usually it is one of the Copyright lines at
    the top of the file.
    So to get a full annotation generally means walking all of history
    anyway.

 b) It would help the most for "bzr log -r:annotate:file:12345" where
    you only care about a specific line. Since you can stop once you've
    settled on the revision introducing that line.

    Though again, multiple revisions *can* introduce the same text, so
    there are still bits of resolution that sometimes matter.

Overall, *I* haven't pursued this one because I didn't feel like the win
would be big. qannotate could use it to incrementally update the lines,
which could be pretty. But 'bzr annotate' itself always does all lines.

2) Caching. The current code is written with an Annotator that was meant
to allow in-memory caching (at least). The goal was to make "bzr
qannotate" the same speed to get the first annotation, but much faster
to get any other annotation. (for example cache every 100th revision, so
any given annotation only has to apply 100 texts, rather than 2000).

The next step in this was to serialize the annotation caches to disk. I
started on this for a while. I got a bit side-tracked on trying to make
it easier to write indexed storage. (Basically, take what we have in
Repository, and make it easier to write a collection of pack-files +
indexes for any sort of data.)

Other bits that caused feature bloat (and lack of development), were
wanting to support alternative annotation mechanisms. Such as "bzr
annotate --ignore-whitespace", or "bzr annotate --lefthand-only".

Also, trying to figure out if annotation caches could be
shared/propagated by push & pull. 'bzr annotate' used to be quite a bit
faster, because we stored the data in annotated form. (it was bad for
storage, good for annotate.)

3) A small win could be gained by changing the current logic slightly.
For every text, we build up a hash index, and then use that to make the
comparison logic fast. We could cache that hash index between
comparisons. It requires reworking the diff logic to expose that index.
Every text is generally diffed at least 2 times (once against its
parents, once against its children). However, the index building time is
not a major fraction of total time spent.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk37NKAACgkQJdeBCYSNAANgjQCfTabGJnRQCh712gt7m5Enrs4e
YT0An2xG24DZb9T/3Ayx7zHjR43xutY5
=a49l
-----END PGP SIGNATURE-----