Encoding woes

Jan Hudec bulb at ucw.cz
Thu Dec 29 09:17:40 GMT 2005


On Wed, Dec 28, 2005 at 10:52:28 -0600, John A Meinel wrote:
> Jan Hudec wrote:
> > On Mon, Dec 26, 2005 at 16:04:44 -0600, John A Meinel wrote:
> >> I think there should be 3 types of strings inside bzrlib:
> >>
> >> 1) Plain ascii strings, these are isinstance(x, string), these should
> >> not have characters outside the ascii set. (so x.decode() should always
> >> work)
> >> 2) Unicode strings, for anything outside of ascii, it should be a
> >> unicode string.
> > 
> > Why do these need to be two types of strings? Ascii is a subset of unicode.
> 
> Short answer, they already exist, and we are just not forbidding library
> users from passing them in. The point is that I can call:
> 
> b = Branch.open('.')
> 
> Which is an ascii string, and I don't have to always call
> b = Branch.open(u'.')

One of things I really don't like on python is, that I can't tell it to imply
u for all string literals. I am used to that from both perl and tcl.

Anyway, I did not mean to forbid _that_. I meant it should be forbidden for
that argument to remain undecoded inside. Eg. if it was directly stored in an
attribute.

> Technically right now we always have to use the latter form, because we
> need to use unicode internally for filesystem operations, but we should
> turn them into unicode internally if we need that. We should not
> *require* that unicode strings be passed in.
> 
> Perhaps a better example is:
> 
> t = b.working_tree()
> t.commit('text message')
> 
> That text message only has to be unicode if you are using characters
> outside of the ascii subset.

How do I tell ascii string originating from the source from an ebcdic string
commit from input stream? They are the same type.

Also the argument is likely to be from some text entry widget and if it is
a plain string, it is likely a locale-encoded string quite possibly
containing non-ascii characters. Or that case is forbidden?

> >> 3) Text blobs. These are just arrays of bytes. Stuff that we would never
> >> try to encode/decode. This is stuff like file contents, etc. The only
> >> thing we might do with these strings is split them on newlines.
> > 
> > Hm, I believe there should be a special class made for them. So they could
> > always be told from case 1. Also if all ascii strings are made unicode (which
> > I think they can), then the plain string type can be outlawed except in the
> > external interface (only the part for front-ends) so forgetting to classify
> > the input would be immediately obvious.
> 
> I think Robert was specifically against forbidding plain ascii strings
> because it makes the library harder to use. And I agree with him on that
> point. Which is where I'm saying that if we have an object which is a
> plain string type, it should be either a text blob which we aren't
> planning on interpreting, or it must be ascii only.
>
> I think we can do okay by just properly naming our variables and
> parameters. If it ends in 'text' or 'lines', it is a text blob, in all
> other cases (committer, message, revision_id, etc) it needs to be either
> a valid ascii string, or unicode.

Ok, so it's actually forbidden to pass in locale-encoded string except for
blob? That'd work, but I am not sure it will be any easier to use. Because
most of the developers don't regularly use non-ascii characters, it will be
easy to forget to encode something. And that may be rather hard to hunt down
later.

-- 
						 Jan 'Bulb' Hudec <bulb at ucw.cz>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20051229/613715f4/attachment.pgp 


More information about the bazaar mailing list