Encoding woes

Mon Jan 2 20:50:44 GMT 2006

On Wed, 2005-12-28 at 10:52 -0600, John A Meinel wrote:
> Jan Hudec wrote:
> > On Mon, Dec 26, 2005 at 16:04:44 -0600, John A Meinel wrote:
> >> I think there should be 3 types of strings inside bzrlib:
> >>
> >> 1) Plain ascii strings, these are isinstance(x, string), these should
> >> not have characters outside the ascii set. (so x.decode() should always
> >> work)
> >> 2) Unicode strings, for anything outside of ascii, it should be a
> >> unicode string.
> > 
> > Why do these need to be two types of strings? Ascii is a subset of unicode.
> 
> Short answer, they already exist, and we are just not forbidding library
> users from passing them in. The point is that I can call:
> 
> b = Branch.open('.')
> 
> Which is an ascii string, and I don't have to always call
> b = Branch.open(u'.')
> 
> Technically right now we always have to use the latter form, because we
> need to use unicode internally for filesystem operations, but we should
> turn them into unicode internally if we need that. We should not
> *require* that unicode strings be passed in.
> 
> Perhaps a better example is:
> 
> t = b.working_tree()
> t.commit('text message')
> 
> That text message only has to be unicode if you are using characters
> outside of the ascii subset.
> 
> > 
> >> 3) Text blobs. These are just arrays of bytes. Stuff that we would never
> >> try to encode/decode. This is stuff like file contents, etc. The only
> >> thing we might do with these strings is split them on newlines.
> > 
> > Hm, I believe there should be a special class made for them. So they could
> > always be told from case 1. Also if all ascii strings are made unicode (which
> > I think they can), then the plain string type can be outlawed except in the
> > external interface (only the part for front-ends) so forgetting to classify
> > the input would be immediately obvious.
> 
> I think Robert was specifically against forbidding plain ascii strings
> because it makes the library harder to use. And I agree with him on that
> point. Which is where I'm saying that if we have an object which is a
> plain string type, it should be either a text blob which we aren't
> planning on interpreting, or it must be ascii only.
> 
> I think we can do okay by just properly naming our variables and
> parameters. If it ends in 'text' or 'lines', it is a text blob, in all
> other cases (committer, message, revision_id, etc) it needs to be either
> a valid ascii string, or unicode.

Yay Hungarian. :)

So I am thinking that something like the following happens in apis:

def initialize(klass, path):
    """blah."""
    path = safe_unicode(path)

def safe_unicode(a_string):
    """Coerce a_string into unicode.

    If a_string is already unicode, it is returned.
    If it is an ascii only string, it is decoded as if it were utf8.
    If the decoding fails, the exception is wrapped as a 
    BzrBadParameter exception.
    """

This will allow library users to use '.', u'.' and u'\ffff' and file
system paths in unicode safely.

Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060103/b65a1bca/attachment.pgp