SequenceMatching without indents?

John Arbash Meinel john at arbash-meinel.com
Fri Jul 14 17:32:29 BST 2006


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aaron Bentley wrote:
> +        else:
> +            return DotsProgressBar(to_file=to_file, **kwargs)
>      else:
> -        return DotsProgressBar(to_file=to_file, **kwargs)
> -
> 
> I hate bogus matches like this.  I'd rather see
> 
> +        else:
> +            return DotsProgressBar(to_file=to_file, **kwargs)
> +     else:
> + <more stuff>...
> -     else:
> -        return DotsProgressBar(to_file=to_file, **kwargs)
> 
> What if we stripped leading indents before generating the sequence, then
>  checked whether matches were true before emitting them as matches?

Well, this sounds like a general 'ignore whitespace' in the first pass,
and then add whitespace back in a second pass.

It seems like we would want to run whatever diff algorithm without
whitespace, and then a second 'refine matches' which would only look at
regions that matched, and re-match them with whitespace.

Other than the performance implications of doing diff twice, it sounds
good to me. In the worst case, only one line is changed in the file (say
at the end), so you end up doing 2*O(algorithm) which is 2*O(N^2) for
difflib, and *might* be 2*O(NlogN) for patience, but I've never really
broken down the algorithm.

I've looked a little bit into doing a C version of patience diff, and
one thing Matt from hg mentioned was doing a simple hash match which
lets you compare integers for the line as a first check (rather than
doing a complete string comparison).
It seems like with a custom data structure, you could compute both hash
forms, and both string forms, to help decrease the number of passes over
the data you need to do.
pass 1 builds up the hashes and string forms for all lines in the file,
and probably some dicts to make lookups fast.

Maybe then the first comparison could check if the line matches both
with and without whitespace, which might save on some of the later lookups.

Hard to say how much of an effect it would have, but it certainly could
make the diffs nicer to look at.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEt8cdJdeBCYSNAAMRApZTAJ4xGMg1hmzuajKzVLYA8WMgTFjRuACguvqe
BYgo8OWp1zat25QIRnvvf+8=
=J4Jq
-----END PGP SIGNATURE-----




More information about the bazaar mailing list