SequenceMatching without indents?

Aaron Bentley aaron.bentley at utoronto.ca
Fri Jul 14 17:58:36 BST 2006


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:
> It seems like we would want to run whatever diff algorithm without
> whitespace, and then a second 'refine matches' which would only look at
> regions that matched, and re-match them with whitespace.

I don't think so; looking for matches all over again is more expensive
than just checking the claimed matches to see if they're real.

> Other than the performance implications of doing diff twice, it sounds
> good to me. In the worst case, only one line is changed in the file (say
> at the end), so you end up doing 2*O(algorithm) which is 2*O(N^2) for
> difflib, and *might* be 2*O(NlogN) for patience, but I've never really
> broken down the algorithm.

Just double-checking the matches should be an O(M) operation, where M is
the number of matching lines found by the first pass.

> I've looked a little bit into doing a C version of patience diff, and
> one thing Matt from hg mentioned was doing a simple hash match which
> lets you compare integers for the line as a first check (rather than
> doing a complete string comparison).

That would speed things up, but I wonder how much?  I bet most of the
comparisions fail pretty early into the string comparision.

> It seems like with a custom data structure, you could compute both hash
> forms, and both string forms, to help decrease the number of passes over
> the data you need to do.
> pass 1 builds up the hashes and string forms for all lines in the file,
> and probably some dicts to make lookups fast.

Yes, doing lookup by hash probably does improve performance significantly.

> Maybe then the first comparison could check if the line matches both
> with and without whitespace, which might save on some of the later lookups.

Yeah, I dunno.  That might be good.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFEt8080F+nu1YWqI0RAqw1AJsG5fODXkyEVYrG86PF5cMZ5c9UjQCeKdTF
wEJSUKCJ/qysiIjQJzC6Gyw=
=1xp3
-----END PGP SIGNATURE-----




More information about the bazaar mailing list