unicode issue in osutils.contains_whitespace

Sun Jun 18 09:04:56 BST 2006

John Arbash Meinel wrote:
> I think if we really care we should use a regex, and then do:
> 
> _whitespace_re = re.compile('\s', re.UNICODE)
> 
> if _whitespace_re.search(s):
>   return True
> return False
> 
> That lets us check for any Unicode whitespace characters (that python
> recognizes). And it means a single pass over the string, rather than a
> pass for every possible whitespace character.

No, this will not work if bzrlib mixes str and unicode strings. The problem is
that you can't compare unicode to str if the str is not in us-ascii. The current
implementation is wrong as well, because it compares ISO-8859-1 characters to
characters in unknown encoding, e.g. \xa0 is a non-breaking space in ISO-8891-1,
but it's a printable character in ISO-8859-2, and even bigger problems are with
comparing it to multi-byte encodings like UTF-8.

Until bzrlib uses unicode internally everywhere, I think it would be better to
check only "standard" ascii whitespace characters:

def contains_whitespace(s):
    for ch in ' \t\f\n\r':
        if ch in s:
            return True
    else:
        return False

Or if you want to try to check as many whitespace characters as possible:

_whitespace_re = re.compile('\s')
_unicode_whitespace_re = re.compile('\s', re.UNICODE)

def contains_whitespace(s):
    if isinstance(s, unicode):
        if _unicode_whitespace_re.search(s):
            return True
    else:
        if _whitespace_re.search(s):
            return True
    return False