unicode issue in osutils.contains_whitespace
Lukáš Lalinský
lalinsky at gmail.com
Sun Jun 18 15:40:48 BST 2006
John Arbash Meinel wrote:
> It is the fault of Tailor if it passes in ISO-8859-1 strings rather than
> passing in ASCII or Unicode.
Tailor passes Unicode strings to bzrlib.
> The characters we are currently looking at are:
> ' \t\n\r\v\f'
> The only one not in your list above is '\v' which I believe is 'vertical
> white space'.
I was playing with this a bit and found the problem, unfortunately not the exact
source of it. When I import bzrlib.osutils manually from Python console,
string.whitespace contains '\t\n\x0b\x0c\r ', but when I run it through Tailor
it contains '\t\n\x0b\x0c\r \xa0'. I've found this comment in
site-packages/textwrap.py:
# Hardcode the recognized whitespace characters to the US-ASCII
# whitespace characters. The main reason for doing this is that in
# ISO-8859-1, 0xa0 is non-breaking whitespace, so in certain locales
# that character winds up in string.whitespace. Respecting
# string.whitespace in those cases would 1) make textwrap treat 0xa0 the
# same as any other whitespace char, which is clearly wrong (it's a
# *non-breaking* space), 2) possibly cause problems with Unicode,
# since 0xa0 is not in range(128).
_whitespace = '\t\n\x0b\x0c\r '
So maybe bzrlib could use hard-coded list of whitespaces characters as well, or
even better, the \s regex.
--
Lukáš Lalinský
More information about the bazaar
mailing list