[MERGE] Faster diff on historical data

Thu Aug 9 18:50:39 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Lukáš Lalinský wrote:
> On St, 2007-08-08 at 13:47 -0400, Aaron Bentley wrote:
>>> People usually use such diffs only on nearby
>>> revisions, so the knit extraction involves a lot of duplicated work,
>>> e.g. instead of taking the file at revision 10 and applying one delta,
>>> it takes revision 1 and unpacks/applies 10 deltas.
>> It's worse than that, because those intermediate deltas could be used to
>> speed up the computation of the desired delta.  There's discussion of
>> this in my diff analysis: doc/developers/diff.txt
>>
>> That document slightly predates your C sequence matcher, so I'm not
>> clear whether the performance enhancement from reusing those comparisons
>> will be an advantage when C sequence matching is merged.
> 
> I've tried profiling this with and without the C sequence matcher, and
> the times are quite interesting (this is for the second blender diff):
> 
> Python - get_grouped_opcodes - 2150 ticks
> C      - get_grouped_opcodes - 105 ticks
> 
> I'm really not sure if combining the line deltas from knits using Python
> would be any faster. But I already have some code for it, so I'll try.
> 
>>> The attached patch lets it to extract both files in one go. It isn't a
>>> big win, but it makes diff -r before:x..x on bzr.dev in my repository
>>> about 0.1 seconds faster (~10% of the total diff time).
>> Doing this for a 0.1 second improvement doesn't seem worthwhile.  Can
>> you get numbers for a larger project, where the impact is likely to be
>> bigger?
> 

If I'm reading your numbers correctly, it seems that:

a) cpatience has a bigger effect that fastdiff
b) the effects do stack
c) fastdiff has a bigger effect when you have more history (as I would
expect).

I think fastdiff would have less of an effect after my
"faster_knit_extract" and "pyrex_knit_extract" code gets merged. Because
it speeds up the time to extract data from a knit. But I only optimized
the single-text case at the moment. It modifies get_lines() but not
get_line_list(). At one point I had written it to affect both, but I
could make the single-file case even better by special casing it, and at
the moment, it is very rare that we grab more than 1 text at a time. So
it made an overall smaller patch to just do the special case.

Actually, with my patch, but without changing get_line_list() I could
argue that you might see a net loss from 'fastdiff'. It would be pretty
situation dependent, though.

The big reason for the difference is that my code is special casing when
you want just the text from the knit, and don't want any annotations.
But the current code uses _get_component_maps() which has the annotation
information as part of its api. I had written a _get_text_map() which
didn't require that, but it had some redundancy, and was still less
efficient than what you could do by just special casing a single file text.

The specific difference is that the code for grabbing multiple texts
caches all the intermediate steps (as lists, so it isn't *terribly*
inefficient, just inefficient). Because of this caching, it has to
create a bunch of lists, rather than just modifying one list in place.

Oh, and the reason I wrote pyrex code for extracting the knits and not
for applying the line deltas, is that for the bzr codebase, we can apply
3.3k deltas in 330ms (<1ms each). But we spend 2.1s to extract all the
fulltexts and linedeltas from disk. (The pyrex code drops that to 0.6s).
Interestingly enough it seems very much based on how many lines we have
to extract. Because we spend approx the same amount of time extracting
650 fulltexts as we spend extracting 3.3k line deltas.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGu1PvJdeBCYSNAAMRAqtQAKCawM0aMRQuAf+SmhKRrX4s66om6wCeKAxT
AvhLMfgqEliWsRSPQXRYFx8=
=mA+a
-----END PGP SIGNATURE-----