unicode issue in osutils.contains_whitespace

Sun Jun 18 15:40:48 BST 2006

John Arbash Meinel wrote:
> It is the fault of Tailor if it passes in ISO-8859-1 strings rather than
> passing in ASCII or Unicode.

Tailor passes Unicode strings to bzrlib.

> The characters we are currently looking at are:
> ' \t\n\r\v\f'
> The only one not in your list above is '\v' which I believe is 'vertical
> white space'.

I was playing with this a bit and found the problem, unfortunately not the exact
source of it. When I import bzrlib.osutils manually from Python console,
string.whitespace contains '\t\n\x0b\x0c\r ', but when I run it through Tailor
it contains '\t\n\x0b\x0c\r \xa0'. I've found this comment in
site-packages/textwrap.py:

# Hardcode the recognized whitespace characters to the US-ASCII
# whitespace characters.  The main reason for doing this is that in
# ISO-8859-1, 0xa0 is non-breaking whitespace, so in certain locales
# that character winds up in string.whitespace.  Respecting
# string.whitespace in those cases would 1) make textwrap treat 0xa0 the
# same as any other whitespace char, which is clearly wrong (it's a
# *non-breaking* space), 2) possibly cause problems with Unicode,
# since 0xa0 is not in range(128).
_whitespace = '\t\n\x0b\x0c\r '

So maybe bzrlib could use hard-coded list of whitespaces characters as well, or
even better, the \s regex.

-- 
Lukáš Lalinský