Encoding woes
John Arbash Meinel
john at arbash-meinel.com
Sat Dec 24 21:03:55 GMT 2005
Well, I decided to get myself into debugging the encoding issues, and we
definitely have some.
Specifically, 'mutter()' expects everything to be in valid python
strings. So if a string is a plain string, it has to be ascii, otherwise
it should be unicode. No big deal there.
But then we have the issue that 'run_bzr' expects the strings to be
encoded in 'bzrlib.user_encoding', which is generally valid. The
arguments come in as plain strings, and to pass funky characters, you
need the decode step.
The problem is how things interact. I tried passing 'µ' (greek letter
mu) in from a test case, and I found that
self.run_bzr() logs the arguments which are used, and then calls
bzrlib.commands.run_bzr_catch_errors(), which runs
bzrlib.commands.run_bzr().
Well if you try to call TestCase.run_bzr() with a unicode string, then
the log() works, but then the call to commands.run_bzr() tries to decode
a unicode string (which means it assumes it is actually ASCII), and we
get an assert. But if we encode ahead of time, so that we call
TestCase.run_bzr() with encoded strings, then the log() call fails,
because it writes to ~/.bzr.log and wants to encode an already encoded
string.
I was thinking that we should make it so that all of the strings inside
the library are unicode, or at least valid strings. So rather than doing
.decode() inside run_bzr, we should do it in 'main()'
And then TestCase.run_bzr() would take unicode strings.
The alternative is to change run_bzr() so that it always takes encoded
strings, just like the command line does, and fix up the internals there
so that the logging it does won't fail.
Another thing I would consider, is that mutter() should never fail. I
don't know if there is a way to tell it to use decode(errors='replace'),
but I don't think decoding errors mean quite the same thing. Also, if
mutter() is failing, it is a sign that our code is incorrect, so it may
be okay having it fail.
So to summarize:
1) What should TestCase.run_bzr() expect. Encoded strings, or Unicode
strings?
2) What should bzrlib.commands.run_bzr() expect. Encoded strings, or
Unicode strings?
3) Should mutter() fail if encoding/decoding would fail? (Should it only
be passed valid strings)
My feeling is that (1) should be Unicode, (2) should be Unicode, and (3)
should never fail. Though for now it is useful as we debug our code.
In the meantime, I'm doing the work at:
http://bzr.arbash-meinel.com/branches/bzr/encoding/
John
=:->
More information about the bazaar
mailing list