unicode issue in osutils.contains_whitespace
John Arbash Meinel
john at arbash-meinel.com
Sun Jun 18 14:38:23 BST 2006
Lukáš Lalinský wrote:
> John Arbash Meinel wrote:
>> I think if we really care we should use a regex, and then do:
>>
>> _whitespace_re = re.compile('\s', re.UNICODE)
>>
>> if _whitespace_re.search(s):
>> return True
>> return False
>>
>> That lets us check for any Unicode whitespace characters (that python
>> recognizes). And it means a single pass over the string, rather than a
>> pass for every possible whitespace character.
>
> No, this will not work if bzrlib mixes str and unicode strings. The problem is
> that you can't compare unicode to str if the str is not in us-ascii. The current
> implementation is wrong as well, because it compares ISO-8859-1 characters to
> characters in unknown encoding, e.g. \xa0 is a non-breaking space in ISO-8891-1,
> but it's a printable character in ISO-8859-2, and even bigger problems are with
> comparing it to multi-byte encodings like UTF-8.
It is the fault of Tailor if it passes in ISO-8859-1 strings rather than
passing in ASCII or Unicode.
>
> Until bzrlib uses unicode internally everywhere, I think it would be better to
> check only "standard" ascii whitespace characters:
>
> def c ontains_whitespace(s):
> for ch in ' \t\f\n\r':
> if ch in s:
> return True
> else:
> return False
>
The characters we are currently looking at are:
' \t\n\r\v\f'
The only one not in your list above is '\v' which I believe is 'vertical
white space'.
But regardless you just said that the above loop fails for you. Your
original fix just assumed that all input strings where iso-8859-1, which
is a much larger fallacy.
I still think the regex is the proper way to go.
And at this point in the code, we should always be using unicode or
ascii. It would be a violation of our API to send in a plain string with
high bits set. (at least at this point).
John
=:->
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060618/13c9a9a6/attachment.pgp
More information about the bazaar
mailing list