unicode issue in osutils.contains_whitespace

John Arbash Meinel john at arbash-meinel.com
Sun Jun 18 14:38:23 BST 2006


Lukáš Lalinský wrote:
> John Arbash Meinel wrote:
>> I think if we really care we should use a regex, and then do:
>>
>> _whitespace_re = re.compile('\s', re.UNICODE)
>>
>> if _whitespace_re.search(s):
>>   return True
>> return False
>>
>> That lets us check for any Unicode whitespace characters (that python
>> recognizes). And it means a single pass over the string, rather than a
>> pass for every possible whitespace character.
> 
> No, this will not work if bzrlib mixes str and unicode strings. The problem is
> that you can't compare unicode to str if the str is not in us-ascii. The current
> implementation is wrong as well, because it compares ISO-8859-1 characters to
> characters in unknown encoding, e.g. \xa0 is a non-breaking space in ISO-8891-1,
> but it's a printable character in ISO-8859-2, and even bigger problems are with
> comparing it to multi-byte encodings like UTF-8.

It is the fault of Tailor if it passes in ISO-8859-1 strings rather than
passing in ASCII or Unicode.

> 
> Until bzrlib uses unicode internally everywhere, I think it would be better to
> check only "standard" ascii whitespace characters:
> 
> def c ontains_whitespace(s):
>     for ch in ' \t\f\n\r':
>         if ch in s:
>             return True
>     else:
>         return False
> 

The characters we are currently looking at are:
' \t\n\r\v\f'
The only one not in your list above is '\v' which I believe is 'vertical
white space'.

But regardless you just said that the above loop fails for you. Your
original fix just assumed that all input strings where iso-8859-1, which
is a much larger fallacy.

I still think the regex is the proper way to go.

And at this point in the code, we should always be using unicode or
ascii. It would be a violation of our API to send in a plain string with
high bits set. (at least at this point).

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060618/13c9a9a6/attachment.pgp 


More information about the bazaar mailing list