[patch] fix bug 65714 bzr 0.11 takes an inordinate amount of time to process a commit

Tue Oct 31 17:40:11 GMT 2006

Cheuksan Edward Wang wrote:
> 
> Replace python's difflib by patiencediff because the worst case
> performance is cubic for difflib (get_matching_blocks) and people
> committing large data
> files are often hurt by this. The worst case performance of patience is
> quadratic. Fix bug 65714.
> 
> If there is an easy way to run benchmarks, can someone tell me how to
> run them? I want to see the change in commit time and size of the knits.
> 
> Neither difflib nor patience give the minimal diffs. If this is not an
> improvement in time or space, I can try several other algorithms.
> 
> Cheuksan Edward Wang
> 
> 

If we are going to stop using KnitSequenceMatcher it probably needs to
be deprecated (say, issue a deprecation warning during __init__). That
at least makes it clear that we aren't really supporting it right now,
so if someone wants it, they can are alerted that it isn't specifically
used.

I was originally thinking to just do:

KnitSequenceMatcher = patiencediff.PatienceSequenceMatcher

But either way is fine.

Did you work out what happens with patience diff in the case of no
matching lines? Earlier you seemed to think it didn't have an upper bound.

Also, I'm thinking we could change recurse_matches() to default to a
shorter recursion depth. If we have to recurse more than 3 or 4 deep, I
don't think we are really getting useful information, so I'd rather we
just stop earlier.

IIRC, it only recurses when it finds new unique lines, that it didn't
find on the outer search. For example:

v- text 1
   v- Text 2
A  A
B  B
1  2
C  C
D  D
B  B
3  4
C  C
E  E

This would first find A, D, E since they are unique in the whole set,
then it would recurse and find BC in the first group, and BC in the
second grouping. If you have something like:

A  A
B  C
D  E
F  G
H  H

Then I think the first pass would find A H, and then recurse and not
find any unique matches. I don't know if it is worth updating the
algorithm to track non-unique lines at the same time, so you would know
if it is worth recursing at all.

Anyway, I've been wanting to switch to PatienceSequenceMatcher for a
while, so I'd be happy to merge something like this.

v- Imports are in sorted order, and 'patiencediff' comes before
'versionedfile'. But I will second Aaron's comment that we normally do
this as:

from bzrlib import patiencediff

And if there are other similar imports we do:

from bzrlib import (
    errors,
    patiencediff,
    )

Keeping it in sorted order, and one-per line helps prevent future conflicts.

Also, by using the from bzrlib import foo, and then accessing 'foo' it
gives us:

1) An easy way to wrap the code in a lazy_import()
2) Avoids a common class of race-condition bugs where you access
'bzrlib.bar' even though you have never imported it directly. There are
a few latent bugs for code that currently works because some other
module imports 'bar' before this module, but as we switch to lazy
imports, we start tripping over these.

So +1 from me if you fix the import stuff. And I would prefer it if you
do something to KnitSequenceMatcher. The rest is just discussion about
possible improvements in PatienceSequenceMatcher.

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20061031/71a3f674/attachment.pgp