Encoding woes

Sat Dec 24 21:03:55 GMT 2005

Well, I decided to get myself into debugging the encoding issues, and we 
definitely have some.

Specifically, 'mutter()' expects everything to be in valid python 
strings. So if a string is a plain string, it has to be ascii, otherwise 
it should be unicode. No big deal there.

But then we have the issue that 'run_bzr' expects the strings to be 
encoded in 'bzrlib.user_encoding', which is generally valid. The 
arguments come in as plain strings, and to pass funky characters, you 
need the decode step.

The problem is how things interact. I tried passing 'µ' (greek letter 
mu) in from a test case, and I found that
self.run_bzr() logs the arguments which are used, and then calls 
bzrlib.commands.run_bzr_catch_errors(), which runs 
bzrlib.commands.run_bzr().

Well if you try to call TestCase.run_bzr() with a unicode string, then 
the log() works, but then the call to commands.run_bzr() tries to decode 
a unicode string (which means it assumes it is actually ASCII), and we 
get an assert. But if we encode ahead of time, so that we call 
TestCase.run_bzr() with encoded strings, then the log() call fails, 
because it writes to ~/.bzr.log and wants to encode an already encoded 
string.

I was thinking that we should make it so that all of the strings inside 
the library are unicode, or at least valid strings. So rather than doing 
.decode() inside run_bzr, we should do it in 'main()'

And then TestCase.run_bzr() would take unicode strings.

The alternative is to change run_bzr() so that it always takes encoded 
strings, just like the command line does, and fix up the internals there 
so that the logging it does won't fail.

Another thing I would consider, is that mutter() should never fail. I 
don't know if there is a way to tell it to use decode(errors='replace'), 
but I don't think decoding errors mean quite the same thing. Also, if 
mutter() is failing, it is a sign that our code is incorrect, so it may 
be okay having it fail.

So to summarize:

1) What should TestCase.run_bzr() expect. Encoded strings, or Unicode 
strings?

2) What should bzrlib.commands.run_bzr() expect. Encoded strings, or 
Unicode strings?

3) Should mutter() fail if encoding/decoding would fail? (Should it only 
be passed valid strings)

My feeling is that (1) should be Unicode, (2) should be Unicode, and (3) 
should never fail. Though for now it is useful as we debug our code.

In the meantime, I'm doing the work at:
http://bzr.arbash-meinel.com/branches/bzr/encoding/

John
=:->