Encoding woes

John A Meinel john at arbash-meinel.com
Fri Dec 30 21:15:50 GMT 2005


Jan Hudec wrote:
> On Thu, Dec 29, 2005 at 10:26:34 -0600, John A Meinel wrote:
>> Jan Hudec wrote:
>>> On Wed, Dec 28, 2005 at 10:52:28 -0600, John A Meinel wrote:
>>>> Technically right now we always have to use the latter form, because we
>>>> need to use unicode internally for filesystem operations, but we should
>>>> turn them into unicode internally if we need that. We should not
>>>> *require* that unicode strings be passed in.
>>>>
>>>> Perhaps a better example is:
>>>>
>>>> t = b.working_tree()
>>>> t.commit('text message')
>>>>
>>>> That text message only has to be unicode if you are using characters
>>>> outside of the ascii subset.
>>> How do I tell ascii string originating from the source from an ebcdic string
>>> commit from input stream? They are the same type.
>> Because the code which is reading the ebcdic string should know that it
>> is reading from the input stream and that it needs to be decoded. That
>> is actually my point. Code at the interface layer (such as coming in
>> from sys.argv or reading from stdin, etc) needs to do the translation
>> before it gets deeper inside bzrlib.
> 
> Which is exactly the reason unicode should be required. Because otherwise I,
> as a front-end writer, can easily forget to decode the input and not notice
> until much later.

What I would prefer, rather than "isinstance(param, unicode)" is to have:

param = unicode(param)

Which means that internally the string would be converted into a unicode
string. If it was non-ascii, python will fail to decode it.

There is a lot of places (especially in test code) where it is far
easier to just write ascii strings.
I really would prefer not to require adding u everywhere.

> 
>>> Also the argument is likely to be from some text entry widget and if it is
>>> a plain string, it is likely a locale-encoded string quite possibly
>>> containing non-ascii characters. Or that case is forbidden?
>> Then the thing which handles the text entry widget needs to translate it
>> before sending it into bzrlib. bzrlib cannot be responsible for knowing
>> how widget foo works. It *can* tell people that "I expect this argument
>> to be in this form, if it is not, then it is your responsibility to make
>> it so".
> 
> Yes. And actually requiring the object to be of unicode type helps people
> know immediately that they forgot it.
> 
>> The other possibility would be to create a new string type, which tracks
>> what encoding it is in. But really it is much easier to require the next
>> layer up to handle encode/decode.
> 
> If you need to tell it it's encoding, you can decode it to unicode instead.
> 
> What I would prefer is if I could set encoding of an IO stream and it
> actually returned unicode strings then -- as it does in perl. But in python
> encoding is set by default only on std{in,out,err} and only if they are on
> a terminal (so python foo.py and python foo.py | cat are NOT equivalent,
> which I consider just plain BOGUS). And they can't be set later on. Yes,
> I know I can wrap the streams in a recoding object -- though I don't
> understand why I have to when the stream already has recoding capability
> built in.
> 
>>> Ok, so it's actually forbidden to pass in locale-encoded string except for
>>> blob? That'd work, but I am not sure it will be any easier to use. Because
>>> most of the developers don't regularly use non-ascii characters, it will be
>>> easy to forget to encode something. And that may be rather hard to hunt down
>>> later.
>> Well, actually you need them to "decode" it so that you have a full
>> unicode string. But yes, the idea is that inside bzrlib you would have
>> unencoded strings. If they are plain ascii, then they can be the plain
>> 'string' type. Anything more, and they need to be unicode.
> 
> Yes. And what I fear is that it will be hard to debug if someone forgets to
> decode, because 90% of time, the input will be in ascii by coincidence.
> If it did not accept plain string, I'd be sure I have not forget to decode
> anywhere at the cost of writing u before all string constants. Personally
> I would make that trade off.
> 

I don't feel quite as willing to make the trade off. Though I don't hold
that position as strongly as I used to.

John
=:->


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 249 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20051230/bbef09ad/attachment.pgp 


More information about the bazaar mailing list