Encoding woes

John A Meinel john at arbash-meinel.com
Mon Dec 26 22:04:44 GMT 2005


Robert Collins wrote:
> On Sat, 2005-12-24 at 15:03 -0600, John Arbash Meinel wrote:
>> Well, I decided to get myself into debugging the encoding issues, and we 
>> definitely have some.
>>
>> Specifically, 'mutter()' expects everything to be in valid python 
>> strings. So if a string is a plain string, it has to be ascii, otherwise 
>> it should be unicode. No big deal there.
>>
>> But then we have the issue that 'run_bzr' expects the strings to be 
>> encoded in 'bzrlib.user_encoding', which is generally valid. The 
>> arguments come in as plain strings, and to pass funky characters, you 
>> need the decode step.
>>
>> The problem is how things interact. I tried passing 'µ' (greek letter 
>> mu) in from a test case, and I found that
>> self.run_bzr() logs the arguments which are used, and then calls 
>> bzrlib.commands.run_bzr_catch_errors(), which runs 
>> bzrlib.commands.run_bzr().
>>
>> Well if you try to call TestCase.run_bzr() with a unicode string, then 
>> the log() works, but then the call to commands.run_bzr() tries to decode 
>> a unicode string (which means it assumes it is actually ASCII), and we 
>> get an assert. But if we encode ahead of time, so that we call 
>> TestCase.run_bzr() with encoded strings, then the log() call fails, 
>> because it writes to ~/.bzr.log and wants to encode an already encoded 
>> string.
>>
>> I was thinking that we should make it so that all of the strings inside 
>> the library are unicode, or at least valid strings. So rather than doing 
>> .decode() inside run_bzr, we should do it in 'main()'
>>
>> And then TestCase.run_bzr() would take unicode strings.
>>
>> The alternative is to change run_bzr() so that it always takes encoded 
>> strings, just like the command line does, and fix up the internals there 
>> so that the logging it does won't fail.
>>
>> Another thing I would consider, is that mutter() should never fail. I 
>> don't know if there is a way to tell it to use decode(errors='replace'), 
>> but I don't think decoding errors mean quite the same thing. Also, if 
>> mutter() is failing, it is a sign that our code is incorrect, so it may 
>> be okay having it fail.
>>
>> So to summarize:
>>
>> 1) What should TestCase.run_bzr() expect. Encoded strings, or Unicode 
>> strings?
>>
>> 2) What should bzrlib.commands.run_bzr() expect. Encoded strings, or 
>> Unicode strings?
>>
>> 3) Should mutter() fail if encoding/decoding would fail? (Should it only 
>> be passed valid strings)
>>
>> My feeling is that (1) should be Unicode, (2) should be Unicode, and (3) 
>> should never fail. Though for now it is useful as we debug our code.
>>
>> In the meantime, I'm doing the work at:
>> http://bzr.arbash-meinel.com/branches/bzr/encoding/
> 
> 
> I think that our internal code should be generally plain strings: Even
> if were to require u'' everywhere, other library users will not realise
> this, and chaos will ensue. And requiring isinstance(foo, unicode)
> everywhere would be just nasty.
> 
> So code that uses public apis should *always* be safe if passing in
> ascii strings inside python.
> 
> For mutter, which can fail, we should indeed pre-encode ourselves or
> whatever to ensure that it never fails - but if it does have to do this
> to avoid failure, it should log that it would have failed.. if that
> makes sense.
> 
> with respect to tests that need to give user input in options or
> commands, I think its reasonable to have a variation on run_bzr that
> takes unicode strings, and the plain one we use should then encode to
> unicode and use that one.
> 
> main() then is just an alternative user of the plain one, that provides
> a specific encoding to code with.
> 
> Rob
> 

I think there should be 3 types of strings inside bzrlib:

1) Plain ascii strings, these are isinstance(x, string), these should
not have characters outside the ascii set. (so x.decode() should always
work)
2) Unicode strings, for anything outside of ascii, it should be a
unicode string.
3) Text blobs. These are just arrays of bytes. Stuff that we would never
try to encode/decode. This is stuff like file contents, etc. The only
thing we might do with these strings is split them on newlines.

Stuff that is read from stdin, or read from the argument list needs to
be converted into one of those 3 strings.

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 249 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20051226/e73fc133/attachment.pgp 


More information about the bazaar mailing list