[Merge] Not all whitespace is unicode

Thu Feb 1 13:29:46 GMT 2007

Lukáš Lalinský wrote:
> On Št, 2007-02-01 at 16:52 +1100, Martin Pool wrote:
[...]
> > 
> > There is something extremely strange going on if there are any
> > characters in string.whitespace that are not convertable to unicode...
> > 
> > Maybe string.whitespace is locale-dependent?  If it is, we could just
> > hardcode the ascii value here.
> > 
> 
> For "some" reason "something" is putting '\xa0' to string.whitespace.
> There is an old thread about this, but the problem was never fixed:
> 
> http://thread.gmane.org/gmane.comp.version-control.bazaar-ng.general/13487

textwrap.py in the standard library includes this comment:

    # Hardcode the recognized whitespace characters to the US-ASCII
    # whitespace characters.  The main reason for doing this is that in
    # ISO-8859-1, 0xa0 is non-breaking whitespace, so in certain locales
    # that character winds up in string.whitespace.  Respecting
    # string.whitespace in those cases would 1) make textwrap treat 0xa0 the
    # same as any other whitespace char, which is clearly wrong (it's a
    # *non-breaking* space), 2) possibly cause problems with Unicode,
    # since 0xa0 is not in range(128).
    _whitespace = '\t\n\x0b\x0c\r '

The standard library isn't necessarily a good indicator of best practice, but
textwrap.py is a fairly modern module, so I'm inclined to believe this comment
(despite the strange vagueness about precisely which locales can trigger this
problem).  If textwrap.py hardcodes whitespace, it seems reasonable for us to do
the same.

-Andrew.