[RFC/MERGE] Avoid syncing on incorrect lines during diff
John Arbash Meinel
john at arbash-meinel.com
Wed Jan 10 16:57:31 GMT 2007
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Martin Pool wrote:
...
>
> s//leading and trailing whitespace//
>
> I wonder - is it really necessary to hardcode that this is about removal
> of whitespace? Would it perhaps be cleaner to take a function that
> conditions the lines on both sides...
We certainly could have a function passed in which provides a munged
form. (Sort of the same as what the SequenceMatcher(junk) parameter is for).
The downside is that "x.strip()" is *much* faster than string.strip(x).
At least in my testing:
% python -m timeit -s "x = ' xyz '" "x.strip()"
1000000 loops, best of 3: 0.604 usec per loop
% python -m timeit -s "x = ' xyz '; import string; strip = string.strip"
"strip(x)"
1000000 loops, best of 3: 1.34 usec per loop
Though it turns out you can use the same function with "str.strip"
% python -m timeit -s "x = ' xyz '; strip = str.strip" "strip(x)"
1000000 loops, best of 3: 0.665 usec per loop
And while it is a little bit slower, it at least 10% instead of 2x slower.
>
> Would it work to have a dictionary mapping from the line to the stripped
> (or conditioned) version of that line? Then you could avoid doing the
> stripping twice for line texts repeated either in both files or within a
> file, which we expect to be reasonably common... Anyhow that can wait
> for it to be profiled.
>
Yeah, we could. I'm not 100% positive what the benefits/deficits would
be. We might be able to save some space, since duplicate lines wouldn't
be doubled in memory. I can write something that does it at least. The
other downside is that we can't write it as a list comprehension because
there is a an if/then/else clause, and a side effect of setting
something in a dictionary.
Actually, we *can* do a list comprehension with:
def insert_and_return(d, key, func):
tmp = func(key)
d[key] = tmp
return tmp
alt_lines = [((l in d) and d[l]) or insert_and_return(d, l, func)
for l in lines]
It is a little abusive of language constructs, and it has a problem that
if d[l] evaluates to False then it always calls insert_and_return().
As an example, try the above with 'func=str.strip', and
lines['\n','\n'], and add a print statement to 'insert_and_return', and
you will see it gets called multiple times.
Which doesn't happen for lines=['a\n', 'a\n']
I'll try and performance test the alternatives, though.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFFpRr7JdeBCYSNAAMRAs3PAKDU3HpdvwyNjLxDej2RCF9Yvg4iXQCePKHF
ma/Nzl/hnGwVSNTiEO9hkSU=
=BrlH
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list