Speeding up reannotate

Tue Feb 12 14:08:26 GMT 2008

Aaron Bentley wrote:
> John Arbash Meinel wrote:
>> John Arbash Meinel wrote:
> 
>> ...
> 
>>> Writing this down did bring up 1 possibility. After we've annotated
>>> with the left-hand parent, what if we do the diff versus the annotated
>>> right-hand parent.
> 
>> I think I've implemented this, and it did indeed make it faster, but it
>> seems to be generating different results.
> 
> It's surprising that it would we faster, because all that's needed to
> compare the two versions is a simple linear scan.  And that has O(n)
> complexity, compared to Patience, which has O(n²) complexity.
> 
>> I'm trying to figure out why, at the moment the best I can come up with
>> is that lines which used to be considered "common" and thus would not be
>> points to line-up, now have a chance to be considered unique, because
>> their annotation information is different.
> 
> By doing sequence matching, you're ignoring the eqivalence info we
> already have.  I'm still surprised you found a difference, though.
> 
> Aaron

If you have a 1000 line file, and the left side changes 10 lines, and the right 
side changes 20.

The existing code builds up a list of annotated lines based on the difference to 
left parent, I call this the "left-annotated new lines". I haven't changed this 
part.
It then annotated versus the right hand parent. Starting with a diff (1000x1000 
lines), and then creating 1000 tuples pulling some from the right parent and 
marking some as new.

Then it would do a linear scan across all 1000 lines, to compare the annotations 
and resolve differences.

My first update changed the middle step to not build up right-annotation new 
lines, instead it could do the diff against the plain text right parent, and 
then for regions marked "unchanged" it would resolve left versus right, for 
regions marked "new" it would just return the left-side annotations.

This helped a little bit, but the bulk of the lines are "unchanged" and require 
resolving. (I think the big help here is not having to create all the tuples and 
then strip them so that it can just compare the annotation info.)

The new version does a diff between the left-annotated new lines and the 
annotated right parent. This is still a 1000x1000 diff. But now when it is done, 
you end up with 900 "matching" lines, and only 30 "unmatched" ones. The 10 that 
left introduced and the 20 that left should claim as "new" that have specific 
annotations in the right parent.

So then you do 2 more diffs on the order of 10x10 and 20x20, and a linear scan 
through those 30 lines.

So the reason it is faster is because both of them have to do a 1000x1000 node 
comparison, but the new code only has to do a linear scan on the areas that are 
different between the left-annotated new text and the right parent.

Basically, I'm trading O(file) linear scan with O(changes) diffs and O(changes) 
linear scans.

It is arguable that this is a "better" annotation because it has more 
information about each line. I know that Aaron was working on the concept of 
line-identity, possibly by using an annotated line, and this is doing exactly that.

If someone can come up with another way to avoid doing a O(file) diff followed 
by an O(file) linear scan, I'm happy to use it. This is just the best I could 
come up with.

John
=:->