Encoding woes

Sun Dec 25 01:24:18 GMT 2005

On Sat, 2005-12-24 at 15:03 -0600, John Arbash Meinel wrote:
> Well, I decided to get myself into debugging the encoding issues, and we 
> definitely have some.
> 
> Specifically, 'mutter()' expects everything to be in valid python 
> strings. So if a string is a plain string, it has to be ascii, otherwise 
> it should be unicode. No big deal there.
> 
> But then we have the issue that 'run_bzr' expects the strings to be 
> encoded in 'bzrlib.user_encoding', which is generally valid. The 
> arguments come in as plain strings, and to pass funky characters, you 
> need the decode step.
> 
> The problem is how things interact. I tried passing 'µ' (greek letter 
> mu) in from a test case, and I found that
> self.run_bzr() logs the arguments which are used, and then calls 
> bzrlib.commands.run_bzr_catch_errors(), which runs 
> bzrlib.commands.run_bzr().
> 
> Well if you try to call TestCase.run_bzr() with a unicode string, then 
> the log() works, but then the call to commands.run_bzr() tries to decode 
> a unicode string (which means it assumes it is actually ASCII), and we 
> get an assert. But if we encode ahead of time, so that we call 
> TestCase.run_bzr() with encoded strings, then the log() call fails, 
> because it writes to ~/.bzr.log and wants to encode an already encoded 
> string.
> 
> I was thinking that we should make it so that all of the strings inside 
> the library are unicode, or at least valid strings. So rather than doing 
> .decode() inside run_bzr, we should do it in 'main()'
> 
> And then TestCase.run_bzr() would take unicode strings.
> 
> The alternative is to change run_bzr() so that it always takes encoded 
> strings, just like the command line does, and fix up the internals there 
> so that the logging it does won't fail.
> 
> Another thing I would consider, is that mutter() should never fail. I 
> don't know if there is a way to tell it to use decode(errors='replace'), 
> but I don't think decoding errors mean quite the same thing. Also, if 
> mutter() is failing, it is a sign that our code is incorrect, so it may 
> be okay having it fail.
> 
> So to summarize:
> 
> 1) What should TestCase.run_bzr() expect. Encoded strings, or Unicode 
> strings?
> 
> 2) What should bzrlib.commands.run_bzr() expect. Encoded strings, or 
> Unicode strings?
> 
> 3) Should mutter() fail if encoding/decoding would fail? (Should it only 
> be passed valid strings)
> 
> My feeling is that (1) should be Unicode, (2) should be Unicode, and (3) 
> should never fail. Though for now it is useful as we debug our code.
> 
> In the meantime, I'm doing the work at:
> http://bzr.arbash-meinel.com/branches/bzr/encoding/

I think that our internal code should be generally plain strings: Even
if were to require u'' everywhere, other library users will not realise
this, and chaos will ensue. And requiring isinstance(foo, unicode)
everywhere would be just nasty.

So code that uses public apis should *always* be safe if passing in
ascii strings inside python.

For mutter, which can fail, we should indeed pre-encode ourselves or
whatever to ensure that it never fails - but if it does have to do this
to avoid failure, it should log that it would have failed.. if that
makes sense.

with respect to tests that need to give user input in options or
commands, I think its reasonable to have a variation on run_bzr that
takes unicode strings, and the plain one we use should then encode to
unicode and use that one.

main() then is just an alternative user of the plain one, that provides
a specific encoding to code with.

Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20051225/003762c0/attachment.pgp