Speeding up reannotate

Mon Feb 11 22:00:29 GMT 2008

I'm currently working on making annotate faster, and so far I've had decent 
success. I'm trying to work on the current bottleneck, and I was hoping someone 
might have an idea of a trick I could use.

By reducing the index pressure (as Robert noted), I cut the annotate time in 
half (from about 4s => 1.6s). I'm currently trying to chip away at that 
remaining time. I cut out some of it by working off of the read_records_iter() 
stream, rather than buffering all the texts, and then iterating through them.

Right now, it seems the most expensive single operation is when a node has 2 
parents, and we have to re-annotate against the right-hand parent.

Surprisingly, doing the diff is not the expensive time. Under --lsprof, 
_reannotate is taking 3.3s, and we only spend .28s in 'get_matching_blocks'. 
(under regular time.clock() it seems to be .6s in reannotate, versus .28s in 
get_matching_blocks)

The expensive time seems to be 555,000 calls to list.append().

The problem seems to be that reannotate has to spend a lot of time dealing with 
lines that have not changed, and I'm trying to figure out if we can cut that 
down significantly.

I'll try to give an overview of what it is doing, to see if it sheds any light 
on a different solution.

We start with an unannotated text, and the annotated texts for both parents. We 
first compare the text with the left hand parent and update the annotation for 
any lines that are considered "new". The comparison can be extracted from a knit 
delta. This is nice in that it calls .append() based on the number of *changed* 
lines.

Then we compare the new lines with the right hand parent (this is never cached). 
For lines that are marked as "new" we just return the left annotation. The 
problem is that for all lines that *match* we then do a line-by-line comparison 
to the left annotation.
If the annotations agree, then we just return one side.
If left claims the line is 'new', then we return the right annotation.
If both claim the line is old, but disagree, then we mark the line as new.

The problem is that *most* of the time lines are unchanged, so the "agree" 
portion happens about 25x more often than the other two. But we are still 
checking the lines one by one.

The issue is that we have to check all of the lines that are considered 
*unchanged* wrt the right-hand parent. This scales with the number of lines in 
the file, and not the number of lines changed.

Writing this down did bring up 1 possibility. After we've annotated with the 
left-hand parent, what if we do the diff versus the annotated right-hand parent.
Then for blocks which match, we can just skip over them. For blocks that *don't* 
match, then we do another diff on the unannotated lines for that region. Matches 
and misses then get labeled according to the previous resolution rules.

I'll give it a shot, but if someone has another idea, let me know.

John
=:->