Speeding up reannotate
John Arbash Meinel
john at arbash-meinel.com
Tue Feb 12 14:08:26 GMT 2008
Aaron Bentley wrote:
> John Arbash Meinel wrote:
>> John Arbash Meinel wrote:
>
>> ...
>
>>> Writing this down did bring up 1 possibility. After we've annotated
>>> with the left-hand parent, what if we do the diff versus the annotated
>>> right-hand parent.
>
>> I think I've implemented this, and it did indeed make it faster, but it
>> seems to be generating different results.
>
> It's surprising that it would we faster, because all that's needed to
> compare the two versions is a simple linear scan. And that has O(n)
> complexity, compared to Patience, which has O(n²) complexity.
>
>> I'm trying to figure out why, at the moment the best I can come up with
>> is that lines which used to be considered "common" and thus would not be
>> points to line-up, now have a chance to be considered unique, because
>> their annotation information is different.
>
> By doing sequence matching, you're ignoring the eqivalence info we
> already have. I'm still surprised you found a difference, though.
>
> Aaron
If you have a 1000 line file, and the left side changes 10 lines, and the right
side changes 20.
The existing code builds up a list of annotated lines based on the difference to
left parent, I call this the "left-annotated new lines". I haven't changed this
part.
It then annotated versus the right hand parent. Starting with a diff (1000x1000
lines), and then creating 1000 tuples pulling some from the right parent and
marking some as new.
Then it would do a linear scan across all 1000 lines, to compare the annotations
and resolve differences.
My first update changed the middle step to not build up right-annotation new
lines, instead it could do the diff against the plain text right parent, and
then for regions marked "unchanged" it would resolve left versus right, for
regions marked "new" it would just return the left-side annotations.
This helped a little bit, but the bulk of the lines are "unchanged" and require
resolving. (I think the big help here is not having to create all the tuples and
then strip them so that it can just compare the annotation info.)
The new version does a diff between the left-annotated new lines and the
annotated right parent. This is still a 1000x1000 diff. But now when it is done,
you end up with 900 "matching" lines, and only 30 "unmatched" ones. The 10 that
left introduced and the 20 that left should claim as "new" that have specific
annotations in the right parent.
So then you do 2 more diffs on the order of 10x10 and 20x20, and a linear scan
through those 30 lines.
So the reason it is faster is because both of them have to do a 1000x1000 node
comparison, but the new code only has to do a linear scan on the areas that are
different between the left-annotated new text and the right parent.
Basically, I'm trading O(file) linear scan with O(changes) diffs and O(changes)
linear scans.
It is arguable that this is a "better" annotation because it has more
information about each line. I know that Aaron was working on the concept of
line-identity, possibly by using an annotated line, and this is doing exactly that.
If someone can come up with another way to avoid doing a O(file) diff followed
by an O(file) linear scan, I'm happy to use it. This is just the best I could
come up with.
John
=:->
More information about the bazaar
mailing list