format string should be unicode instead byte string

Mon Sep 7 23:31:07 BST 2009

2009/9/8 John Arbash Meinel <john at arbash-meinel.com>:
> Actually, I think he has it backwards. If you do:
>
> "path: %s" % (path,)
>
> Then if 'path' is unicode then it will upcast the string to Unicode. If
> path is 'bytes' and contains non-ascii characters, it stays bytes.

I think this is where the problem comes in.  If we have a non-ascii
byte string, it will probably/typically be utf-8.  Therefore we end up
with the result of the format interpolation being utf-8.  However,
this is not appropriate to print to all terminals.

If we only took his second suggestion from
<https://bugs.edge.launchpad.net/bzr/+bug/404740> ie

>> 2) sys.stderr.write(unicode_string) encodes unicode_string to default encoding (ascii). So bzr should wrap sys.stderr with codecs.StreamWriter and suitable encoding.

then we could possibly say that at the output layer, if we're trying
to print a non-ascii byte string, we should interpret it as utf-8 then
reencode it into the output encoding, if that's different.  But this
seems pretty kludgy, almost as if we'd be reinventing the
defaultencoding concept.

> However if you do:
>
> u"path: %s" % (path,)
>
> If 'path' is Unicode, things are fine, and if 'path' is ascii things are
> fine (auto-upcasting ascii => unicode). However if 'path' is non-ascii
> characters you get a failure.
>
>>>> 'path: %s' % ('ascii-path',)
> 'path: ascii-path'
>>>> 'path: %s' % (u'unicode-path',)
> u'path: unicode-path'
>>>> 'path: %s' % ('nonascii-\xb5path',)
> 'path: nonascii-\xb5path'
>>>> u'path %s' % ('ascii-path',)
> u'path: ascii-path'
>>>> u'path %s' % (u'unicode-path',)
> u'path: unicode-path'
>>>> u'path: %s' % ('nonascii-\xb5path',)
> UnicodeDecodeError

Right, this is because most pythons run with defaultencoding set to
'ascii' -- though I believe it does vary by platform -- so you can't
do implicit conversions of non-ascii byte strings.

And in fact your examples, using presumably latin-1 paths, show that
we can't count on any particular encoding.

My reasoning is this:

- We can't safely treat user data as 'just bytes' without considering
encoding because, amongst other things, we have our own files defined
to be utf-8, and we have platforms where the terminal and filesystem
encoding differ.  We must treat it as real unicode.

- The best way to treat it as unicode correctly is to decode it early
and encode it late, so that in memory it's a unicode object, not a
byte string with implicit encoding.  In particular if we're combining
strings and they're not completely guaranteed to be either ascii or
utf-8, they should be unicode.

- As a special case bulk data coming from our own files, eg dirstate,
probably needs to stay in utf-8 for speed, but this is a special
compromise.

Therefore I think Inada's suggestion with Robert's modification to not
define all static strings as unicode objects probably still makes
sense.

-- 
Martin <http://launchpad.net/~mbp/>