Handling repeated text annotations

Wed Feb 13 23:07:30 GMT 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I found a really odd performance characteristic with annotations. It has to do
with how we resolve when two sides disagree on the last-modified information for
a line.

As the first example, start with bzrlib/repository.py it has 716 versions, and
248 of them have 2 parents. The *bulk* of the time is spent resolving the
annotation differences between the left and right lines.

So the weird part is this:
unchanged: 518332
new: 	      122
empty:      17803
left:        8495
right:      13234
tip:        17286

"unchanged" is the number of lines that left and right completely agree on. This
should certainly be the largest number since the bulk of the file doesn't change
with each revision.

"new" are lines that seem to exist in the middle of unmatched blocks, but are
otherwise identical between left and right. I thought it might be lines that are
considered new by both sides, but that doesn't really fit.

"empty" is a matching block that follows a previous matching block without any
lines inbetween. I assume this is "ABC" versus "AC". The A & C match, and we can
just pass the "B" lines across without worrying about any matches on the
unannotated text.

"left" are lines in blocks that didn't match between right and left that cannot
be assigned to right (left + empty should be the total number of lines that are
attributed to the left parent)

"right" are lines that come from the right parent

The important/odd one is "tip". These are lines that match their raw texts, but
have different annotations.

The concern I have is that there are far too many of them. It is sort of an
indication of when both sides of the branch have the same changes, but different
annotations. (people applying the same patch?) I think the problem is that it
propagates. So each time it gets merged we get a new tip. For example:

A
|\
B C
|\|\
D E F
| |/
G H
|/
I

If B and C introduce the same text, then it will get updated to be marked 'E'
when they are merged. So far not bad. Except when we get to 'H' it will get
marked with 'H' because 'E != F'. Then when we get to 'I' we will again mark it
with 'I' because 'B != H'.

I seem to remember we had a problem like this with our general merge algorithm,
and the only way we had to solve it was using a "heads()" check. So that you
would see that 'E' is descended from 'C', so 'H' should automatically resolve to
keep 'E', as should 'I'.

So why does this annotation churn matter? Because it ends up causing us to spend
a lot more time comparing lines that otherwise would be considered identical.

On a bigger file with 2.6k revisions and about 1.2k merges, I get:

unchanged: 8363598
new: 	      2667
empty:       77801
left:       243009
right:      294698
tip:       1385106

'time bzr annotate --show-ids' takes 1m4s.

If I change the resolution algorithm to always pick the left parent (rather than
using tip), the time drops to 23s.
So by themselves, these lines are causing our annotate algorithm to slow down by 3x.

I'll try using heads() though I'm not sure it is exposed easily down at this
level :(. I'm a little concerned that all the heads calls are going to kill
performance. But I guess we only have to do it for lines that seem to claim 2
parents.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHs3gyJdeBCYSNAAMRAkEkAJ9oexOGkuBTNe1CBGg7I6+PWsyDgACgzGIh
Fp4pbUJdUWq8ODoIB+45Gu8=
=JGm9
-----END PGP SIGNATURE-----