[RFI] Annotations cache & progressive display of annotations

Mon Feb 15 14:37:09 GMT 2010

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Clatworthy wrote:
> The MySQL guys recently raised the fact that annotate is slower is 2a
> format than pack-0.92. Talking to poolie last Friday, this looks like
> something I could help improve in coming weeks.
> 
> Before I look into it though, I'd like to start from the hard-earned
> knowledge and great ideas of others. I'm sure John, Robert, Vincent and
> others know 10X (well 100X) more than I do in this area. Any deep
> insights, tips or branches I should start from?
> 
> I'd also like to hear from end users who use annotate frequently. How
> often are you using the GUI vs command-line to view annotations? Any
> thoughts on "lazy-loading" of the data into the GUI so we can begin
> showing some annotations immediately?
> 
> Ian C.

"lazy-loading..."

My primary feeling is that it isn't worthwhile for 2 reasons:

1) You have to work backwards, which isn't always obvious how to make it
work. Forward annotation is more 'obvious'. Not to say you couldn't work
out the logic, just that it will be more complex.
2) In my testing, almost every file has a Copyright header whose source
goes all the way back to the first revision of the file. So you have to
grab the whole ancestry anyway. (Otherwise you could stop as soon as all
lines were annotated.)

Mostly because of (2), I think a cache is really the only way to make it
reasonable. *if* we supported just annotating a region of a file, then
(1) could be a reasonable win. (People tend to care about recently
introduced text.)

Annotator is my start at providing a place to attach cache information.
It already caches the intermediate steps, etc. It is also where the
logic resides to determine what revisions are needed to annotate a file,
etc. (This is what qannotate should be using to make it cheaper to walk
back through the history of the file, rather than using annotate_iter().)

The idea was that it could interact with a disk cache, to create an
annotation cache every X revisions. With a little bit of logic, you
could also determine what the most efficient revisions to cache would
be. If you have:
  A
  |\
  B C
  |/
  D

Caching B isn't worth as much as caching A, since you would need to get
A to annotate D anyway. (Basically, cache determinators, or all
revisions at a certain generation, etc.)

The code should already be factored out pretty well between what the
annotation is, and what the lines are.

If I was doing it, I would create essentially another 'text' store based
on a bunch of pack files, and store the annotation information as just
regular texts in a Groupcompress block. I would key it based on
(file_id, revision_id, flags). The idea of flags is that you can do
stuff like 'ignore_whitespace'. For an initial implementation, we could
just leave the flags empty, but I'd like to reserve a spot for it.

Possible flags:
  1) mainline only (tell me who *landed* the patch, not who created it),
     also allows us to only extract and diff mainline texts
  2) ignore whitespace
  3) collision resolution (both lines introduced X, which one 'wins').
     However, the current annotation code is capable of tracking
     multiple sources for a given line, and if we save that to the
     annotation cache, then we don't need a flag for it. (collisions are
     currently resolved as the last step, rather than early.)
  4) pickaxe/track moved lines. Arguably we'd like to do something like
     this inside of a file regardless. But potentially we could do the
     'git' thing, and say "in revision X, this file was modified, what
     other files were modified and could this content have come from
     there?" A bit expensive to compute, and needs a change in our
     annotation handling to have a way to *present* that these lines
     came from that file, rather than just 'this revision'.
  5) Similar to (4), also track copied code. Much more expensive,
     because you can copy code from a file that did not otherwise
     change.
  6) Change the annotator matcher to allow moved content to be tracked,
     needs a way to think about how it is presented. Example:
     A    A
     B => C
     C    B

     Right now one of B and C will be tracked as associated, and the
     other will look like newly introduced content. This is because
     PatienceDiff is essentially a 'linear' matcher.
     However, if you go to the extreme with finding matches, then it
     seems like moving 'B' ends up as a 'no-op' for annotation, which
     doesn't seem right either.
     One option is 'edge' matching. Which would say that changing "AB"
     to "AC" is the action, and thus gets the annotation. However,
     nobody thinks in terms of "when was this edge introduced", so I
     think it would be confusing.

Problems with using a pack-based disk cache of annotation information is
that you probably end up re-implementing a fair amount of the autopack,
etc logic. Ask Robert what he did with 'bzr-search' which has similar
constraints. I thought he was working on factoring out some helper code.
We might be able to land some of that in bzrlib, and then have
bzr-search and annotate share it. (Also, he was working on creating a
single '.pack' which includes the indices at the end of the file, which
also works well with this situation.)

One thing that hung me up in the past... If creating these annotations
takes a while, it would be nice to be able to share them. So that 'bzr
branch' and/or 'bzr pull' could share the annotations. We *could* put
the information back into the regular pack files (adding an index for
it), and transmit it with fetch. (say when fetching revision 'X' fetch
all annotations with that revision present.)  However, it complicated
things a bit too much, and seemed better to just get an annotation cache
working, rather than worrying about fetch properties as well.

I probably have some more thoughts rolling around, having worked on the
problem 3 or 4 times now. My wife is getting anxious for me to actually
stop working on my day off :).

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkt5XBUACgkQJdeBCYSNAAPr6wCcDfYdRUHZtL5UE4uzB2FlaTsT
wGAAoIctlmJrc1UdV1RcLcDZC2QSMGit
=lA9f
-----END PGP SIGNATURE-----