[rfc] bencode unicode strings

Tue Jun 16 15:15:17 BST 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Alexander Belchenko wrote:
> Vincent Ladeuil пишет:
>>>>>>> "bialix" == Alexander Belchenko <bialix at ukr.net> writes:
>>
>>     bialix> Actually this is what I'm planning to implement for QBzr.
>>     bialix> My first mail was (bad) attempt to ask is there interest
>> for such
>>     bialix> thing in the core, because of Vincent' suggestions about
>>     bialix> different common things. If not -- then not.
>>
>> There is interest.
>>
>> But as you said yourself:
>> - bzr-gtk uses utf-8,
>> - bzr itself uses utf-8,
>> - Qt requires Unicode.
>>
>> So a shared version will need to encode its data in utf-8 and
>> qbzr being an interface between Qt and bzr will need to handle
>> the Unicode <-> utf-8 conversions.
>>
>> This especially important for persistent data but also for shared
>> data.
> 
> /me gives up

So Alexander, maybe a different tactic...

bencode *the protocol* doesn't know about Unicode strings. All it has is
"3:foo" to indicate that we have a byte-string following.

Because of that, if you want to have a serialization that can *mix*
8-bit strings and Unicode strings, then you need to add some sort of new
flag into the stream. Such as:

"u3:foo" which would tell the decoder that it should call
.decode('utf-8') on the string. This would be clearly a breaking of
compatibility, because other versions of bdecode would fail to decode
the 'u' type.

The other possibility is to have *every* string be considered as UTF-8
data. So that "3:foo" will be decoded as 'foo'.decode('utf-8') == u'foo'.

The potential issues are:

1) If you ever mix user data with logical data. What I'm trying to get
at is if you mixed Unicode strings with 8-bit strings that *aren't*
UTF-8. ('1:\xff' == boom)

2) Getting Unicode strings for things that are supposed to be ascii
text. (Dictionary keys come to mind.)

3) How would you handle bencode(['8-bit', u'unicod\xe9']) or even:
   bencode(['\xc2\xb5tf-8', u'unicod\xe9'])
   (Do you just assume that an 8-bit string is already utf-8, do you
    always try to cast it back up to Unicode and then back down? etc)

As far as adding a function like:
  bencodeu() and bdecodeu()

I don't have a problem with that. It is the same as bdecode_as_tuple.
Actually, calling it "bdecode_as_unicode" may be a good way to put it.

I would guess that 'bencode_from_unicode' is going to be quite a bit
harder to write, because it hits a lot of edge cases.
'bdecode_as_unicode' is going to be fairly straightforward if you just
wrap a '.decode('utf-8')' around everything.

In an initial implementation, I might say that all 'string' inputs
*must* be Unicode, but it gets into problems with things like 'dicts' as
mentioned before. (bencode doesn't support things like using integer keys)

I'll also note that len(unicode) != len(unicode.encode('utf-8')) (except
for ASCII), so it isn't as simple as just decoding the whole stream. As
the length of the entries will be wrong.

Honestly, if we really wanted to do something that was properly Unicode
aware, bencode was not the best pick. That is one thing that good or
bad, "rio" chose that all Values were Unicode strings.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAko3qPUACgkQJdeBCYSNAAOIKQCfaB816nbQG6Vo5cOEP7AnnfEd
f1cAn2wz/X3oYaUDQ3c1ZfJL9Y1teESf
=oLoa
-----END PGP SIGNATURE-----