[Merge] Not all whitespace is unicode

John Arbash Meinel john at arbash-meinel.com
Thu Feb 1 14:50:43 GMT 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> Lukáš Lalinský wrote:
>> On Št, 2007-02-01 at 16:52 +1100, Martin Pool wrote:
> [...]
>>> There is something extremely strange going on if there are any
>>> characters in string.whitespace that are not convertable to unicode...
>>>
>>> Maybe string.whitespace is locale-dependent?  If it is, we could just
>>> hardcode the ascii value here.
>>>
>> For "some" reason "something" is putting '\xa0' to string.whitespace.
>> There is an old thread about this, but the problem was never fixed:
>>
>> http://thread.gmane.org/gmane.comp.version-control.bazaar-ng.general/13487
> 
> textwrap.py in the standard library includes this comment:
> 
>     # Hardcode the recognized whitespace characters to the US-ASCII
>     # whitespace characters.  The main reason for doing this is that in
>     # ISO-8859-1, 0xa0 is non-breaking whitespace, so in certain locales
>     # that character winds up in string.whitespace.  Respecting
>     # string.whitespace in those cases would 1) make textwrap treat 0xa0 the
>     # same as any other whitespace char, which is clearly wrong (it's a
>     # *non-breaking* space), 2) possibly cause problems with Unicode,
>     # since 0xa0 is not in range(128).
>     _whitespace = '\t\n\x0b\x0c\r '
> 
> The standard library isn't necessarily a good indicator of best practice, but
> textwrap.py is a fairly modern module, so I'm inclined to believe this comment
> (despite the strange vagueness about precisely which locales can trigger this
> problem).  If textwrap.py hardcodes whitespace, it seems reasonable for us to do
> the same.
> 
> -Andrew.

I agree. I think the reason it wasn't fixed is because not all platforms
include '\xa0' as part of string.whitespace.

Specifically, I can see that here we have:
whitespace = ' \t\n\r\v\f'

Even weirder, though, is that if I do:

python -c "import string; print repr(string.whitespace))"

I get:
'\t\n\x0b\x0c\r '

Where somehow the 0x20 has moved from the beginning of the string to the
end. I can understand that \v == \x0b and and \f == \x0c, but I really
don't understand how the ' ' moved from being at the beginning to being
at the end. And '\r' has moved, too. My best guess is that some other
class (locale?) is overwritting string.whitespace based on the current
locale. Which would also explain how '\xa0' shows up.

Anyway, we never fixed this because we don't always experience it,
because some implementations of python don't include '\xa0'.

So I consider the attached patch the "correct" fix. Also it means we
don't need to "import string" which costs about 8ms of startup time.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFwf5DJdeBCYSNAAMRApQXAJ4wZbwo9CVbqvHJLukbfjHSkSjMUACdHIXe
zTSBvFBUvCp5JHyPOg+1um0=
=rkw1
-----END PGP SIGNATURE-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hardcode_whitespace_chars.patch
Type: text/x-patch
Size: 2640 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070201/66532911/attachment-0001.bin 


More information about the bazaar mailing list