Encoding woes

Fri Dec 30 11:08:02 GMT 2005

On Thu, Dec 29, 2005 at 10:26:34 -0600, John A Meinel wrote:
> Jan Hudec wrote:
> > On Wed, Dec 28, 2005 at 10:52:28 -0600, John A Meinel wrote:
> >> Technically right now we always have to use the latter form, because we
> >> need to use unicode internally for filesystem operations, but we should
> >> turn them into unicode internally if we need that. We should not
> >> *require* that unicode strings be passed in.
> >>
> >> Perhaps a better example is:
> >>
> >> t = b.working_tree()
> >> t.commit('text message')
> >>
> >> That text message only has to be unicode if you are using characters
> >> outside of the ascii subset.
> > 
> > How do I tell ascii string originating from the source from an ebcdic string
> > commit from input stream? They are the same type.
> 
> Because the code which is reading the ebcdic string should know that it
> is reading from the input stream and that it needs to be decoded. That
> is actually my point. Code at the interface layer (such as coming in
> from sys.argv or reading from stdin, etc) needs to do the translation
> before it gets deeper inside bzrlib.

Which is exactly the reason unicode should be required. Because otherwise I,
as a front-end writer, can easily forget to decode the input and not notice
until much later.

> > Also the argument is likely to be from some text entry widget and if it is
> > a plain string, it is likely a locale-encoded string quite possibly
> > containing non-ascii characters. Or that case is forbidden?
> 
> Then the thing which handles the text entry widget needs to translate it
> before sending it into bzrlib. bzrlib cannot be responsible for knowing
> how widget foo works. It *can* tell people that "I expect this argument
> to be in this form, if it is not, then it is your responsibility to make
> it so".

Yes. And actually requiring the object to be of unicode type helps people
know immediately that they forgot it.

> The other possibility would be to create a new string type, which tracks
> what encoding it is in. But really it is much easier to require the next
> layer up to handle encode/decode.

If you need to tell it it's encoding, you can decode it to unicode instead.

What I would prefer is if I could set encoding of an IO stream and it
actually returned unicode strings then -- as it does in perl. But in python
encoding is set by default only on std{in,out,err} and only if they are on
a terminal (so python foo.py and python foo.py | cat are NOT equivalent,
which I consider just plain BOGUS). And they can't be set later on. Yes,
I know I can wrap the streams in a recoding object -- though I don't
understand why I have to when the stream already has recoding capability
built in.

> > Ok, so it's actually forbidden to pass in locale-encoded string except for
> > blob? That'd work, but I am not sure it will be any easier to use. Because
> > most of the developers don't regularly use non-ascii characters, it will be
> > easy to forget to encode something. And that may be rather hard to hunt down
> > later.
> 
> Well, actually you need them to "decode" it so that you have a full
> unicode string. But yes, the idea is that inside bzrlib you would have
> unencoded strings. If they are plain ascii, then they can be the plain
> 'string' type. Anything more, and they need to be unicode.

Yes. And what I fear is that it will be hard to debug if someone forgets to
decode, because 90% of time, the input will be in ascii by coincidence.
If it did not accept plain string, I'd be sure I have not forget to decode
anywhere at the cost of writing u before all string constants. Personally
I would make that trade off.

-- 
						 Jan 'Bulb' Hudec <bulb at ucw.cz>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20051230/9944f4c7/attachment.pgp