SequenceMatching without indents?
Aaron Bentley
aaron.bentley at utoronto.ca
Fri Jul 14 17:58:36 BST 2006
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
John Arbash Meinel wrote:
> It seems like we would want to run whatever diff algorithm without
> whitespace, and then a second 'refine matches' which would only look at
> regions that matched, and re-match them with whitespace.
I don't think so; looking for matches all over again is more expensive
than just checking the claimed matches to see if they're real.
> Other than the performance implications of doing diff twice, it sounds
> good to me. In the worst case, only one line is changed in the file (say
> at the end), so you end up doing 2*O(algorithm) which is 2*O(N^2) for
> difflib, and *might* be 2*O(NlogN) for patience, but I've never really
> broken down the algorithm.
Just double-checking the matches should be an O(M) operation, where M is
the number of matching lines found by the first pass.
> I've looked a little bit into doing a C version of patience diff, and
> one thing Matt from hg mentioned was doing a simple hash match which
> lets you compare integers for the line as a first check (rather than
> doing a complete string comparison).
That would speed things up, but I wonder how much? I bet most of the
comparisions fail pretty early into the string comparision.
> It seems like with a custom data structure, you could compute both hash
> forms, and both string forms, to help decrease the number of passes over
> the data you need to do.
> pass 1 builds up the hashes and string forms for all lines in the file,
> and probably some dicts to make lookups fast.
Yes, doing lookup by hash probably does improve performance significantly.
> Maybe then the first comparison could check if the line matches both
> with and without whitespace, which might save on some of the later lookups.
Yeah, I dunno. That might be good.
Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFEt8080F+nu1YWqI0RAqw1AJsG5fODXkyEVYrG86PF5cMZ5c9UjQCeKdTF
wEJSUKCJ/qysiIjQJzC6Gyw=
=1xp3
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list