format string should be unicode instead byte string

Martin Pool mbp at canonical.com
Mon Sep 7 07:50:15 BST 2009


2009/9/7 INADA Naoki <songofacandy at gmail.com>:
> Related to: https://bugs.launchpad.net/bzr/+bug/404740
> Human readable format string should be unicode even though ascii string.
>
> When belowing code executed::
>
>  "path: %s" % (path,)
>
> If path is unicode string, it may cause UnicodeEncodeError.
> But next code::
>
>  u"path: %s" % (path,)
>
> It works fine when path is both unicode and bytes.

That seems to make sense.  There may be some cases where we're using
format strings to produce something for a file required to be in a
particular encoding and we would prefer to get a UnicodeError, but
they seem rare.

> Next example.
>
> class Foo:
>    def __init__(self, path):
>         self.path= unicode(path)
>    def __str__(self):
>        return self.path.encode('utf-8')
>    def __unicode__(self):
>        return self.path
>
> foo = Foo(path) # path may be unicode.
> b = "foo: %s" % (foo,)
> u = u"foo: %s" % (foo,)
>
> When use byte format string, __str__() is called. And any chance to
> encode suitable encoding is lost.
> When use unicode format string, __unicode__() is called.
>
> Best practice is:
> * Use unicode literal for all human readable string.
> * Encoding/decoding should done with I/O and use unicode internal.

Thanks for pointing this out.  If no one points out a problem and we
do make this the standard practice then we should at least

1- update the developer guide to say this
2- update the error strings, as you're doing in
https://code.edge.launchpad.net/~songofacandy/bzr/error-encoding
3- maybe add a test_source that we always do things this way - but it
may be hard to catch just the right cases

-- 
Martin <http://launchpad.net/~mbp/>



More information about the bazaar mailing list