format string should be unicode instead byte string
Martin Pool
mbp at canonical.com
Mon Sep 7 07:50:15 BST 2009
2009/9/7 INADA Naoki <songofacandy at gmail.com>:
> Related to: https://bugs.launchpad.net/bzr/+bug/404740
> Human readable format string should be unicode even though ascii string.
>
> When belowing code executed::
>
> "path: %s" % (path,)
>
> If path is unicode string, it may cause UnicodeEncodeError.
> But next code::
>
> u"path: %s" % (path,)
>
> It works fine when path is both unicode and bytes.
That seems to make sense. There may be some cases where we're using
format strings to produce something for a file required to be in a
particular encoding and we would prefer to get a UnicodeError, but
they seem rare.
> Next example.
>
> class Foo:
> def __init__(self, path):
> self.path= unicode(path)
> def __str__(self):
> return self.path.encode('utf-8')
> def __unicode__(self):
> return self.path
>
> foo = Foo(path) # path may be unicode.
> b = "foo: %s" % (foo,)
> u = u"foo: %s" % (foo,)
>
> When use byte format string, __str__() is called. And any chance to
> encode suitable encoding is lost.
> When use unicode format string, __unicode__() is called.
>
> Best practice is:
> * Use unicode literal for all human readable string.
> * Encoding/decoding should done with I/O and use unicode internal.
Thanks for pointing this out. If no one points out a problem and we
do make this the standard practice then we should at least
1- update the developer guide to say this
2- update the error strings, as you're doing in
https://code.edge.launchpad.net/~songofacandy/bzr/error-encoding
3- maybe add a test_source that we always do things this way - but it
may be hard to catch just the right cases
--
Martin <http://launchpad.net/~mbp/>
More information about the bazaar
mailing list