[RFC/MERGE] Avoid syncing on incorrect lines during diff

John Arbash Meinel john at arbash-meinel.com
Wed Jan 10 16:57:31 GMT 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Martin Pool wrote:
...

> 
> s//leading and trailing whitespace//
> 
> I wonder - is it really necessary to hardcode that this is about removal
> of whitespace?  Would it perhaps be cleaner to take a function that
> conditions the lines on both sides...

We certainly could have a function passed in which provides a munged
form. (Sort of the same as what the SequenceMatcher(junk) parameter is for).

The downside is that "x.strip()" is *much* faster than string.strip(x).

At least in my testing:

% python -m timeit -s "x = ' xyz '" "x.strip()"
1000000 loops, best of 3: 0.604 usec per loop
% python -m timeit -s "x = ' xyz '; import string; strip = string.strip"
"strip(x)"
1000000 loops, best of 3: 1.34 usec per loop

Though it turns out you can use the same function with "str.strip"

% python -m timeit -s "x = ' xyz '; strip = str.strip" "strip(x)"
1000000 loops, best of 3: 0.665 usec per loop

And while it is a little bit slower, it at least 10% instead of 2x slower.

> 
> Would it work to have a dictionary mapping from the line to the stripped
> (or conditioned) version of that line?  Then you could avoid doing the
> stripping twice for line texts repeated either in both files or within a
> file, which we expect to be reasonably common...  Anyhow that can wait
> for it to be profiled.
> 

Yeah, we could. I'm not 100% positive what the benefits/deficits would
be. We might be able to save some space, since duplicate lines wouldn't
be doubled in memory. I can write something that does it at least. The
other downside is that we can't write it as a list comprehension because
there is a an if/then/else clause, and a side effect of setting
something in a dictionary.

Actually, we *can* do a list comprehension with:

def insert_and_return(d, key, func):
  tmp = func(key)
  d[key] = tmp
  return tmp

alt_lines = [((l in d) and d[l]) or insert_and_return(d, l, func)
             for l in lines]

It is a little abusive of language constructs, and it has a problem that
if d[l] evaluates to False then it always calls insert_and_return().

As an example, try the above with 'func=str.strip', and
lines['\n','\n'], and add a print statement to 'insert_and_return', and
you will see it gets called multiple times.
Which doesn't happen for lines=['a\n', 'a\n']

I'll try and performance test the alternatives, though.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFpRr7JdeBCYSNAAMRAs3PAKDU3HpdvwyNjLxDej2RCF9Yvg4iXQCePKHF
ma/Nzl/hnGwVSNTiEO9hkSU=
=BrlH
-----END PGP SIGNATURE-----



More information about the bazaar mailing list