[Bulk] Re: Python 3

Wed Jun 23 23:59:05 BST 2010

On 24 June 2010 03:28, Gordon Tyler <gordon.tyler at gmail.com> wrote:
> On 23/06/2010 11:45 AM, John Arbash Meinel wrote:
>> There are also things like diff headers, where it is a bit more unclear
>> whether it is valid to have them as Unicode. (even though users see them)

(I think the illuminating example there is to consider a UTF-16
encoded diff - istm the headers should be UTF-16, but whether patch
can actually read such a thing is a different question.)

> This is perhaps due to my being a Java programmer for 10 years but it
> seems quite logical to have the internal representation of a string in
> Unicode and encode that to the required encoding when it "exits" the
> application, i.e. written to stdout/file/socket/etc.
>
> Of course, it may be tricky getting an external input into Unicode if
> you're not certain what the encoding is.

Java was designed from the start with a clear separation between
strings which are Unicode and byte arrays.  Python supported both but
in a kind of fuzzy way, with the default for literals being a byte
string in 2.0.

http://blog.labix.org/2009/07/02/screwing-up-python-compatibility-unicode-str-bytes

Our general approach is also to normally convert to/from byte encoding
on the boundary but the 2.x environment means that either sometimes we
couldn't follow that approach consistently, or there may be latent
inadvertent cases where we don't follow it.  This is one reason why
getting the tests to pass under 2to3, even if we don't want to
officially support that, may find some interesting bugs.

-- 
Martin