UTF-16 versus UCS2
John Arbash Meinel
john at arbash-meinel.com
Wed May 23 11:29:29 BST 2007
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Dennis Benzinger wrote:
> Am Wed, 23 May 2007 10:40:15 +0200
> schrieb John Arbash Meinel <john at arbash-meinel.com>:
>
>> I just checked around and found this page:
>> http://en.wikipedia.org/wiki/UTF-16
>>
>> Which basically says that UTF-16 == UCS-2 for everything that isn't
>> in the "extended" character set.
>>
>> Basically, Unicode extended how many codes there was going to be, so
>> you need >65,000 values in your serialized data. UCS2 *doesn't*
>> support those. UTF-16 supports them by having some codes be encoded
>> with 4-bytes instead of just 2. (Like how UTF-8 can use up to 6?)
>>
>> Anyway, I would guess that you could write a trivial UTF-16 => UCS2
>> converter. If you want to be really safe, it should check for the
>> special codes, and either complain or just flatten them.
>>
>> Either that, or just accept some small chance for inaccuracy when
>> passing UTF-16 to code that is expecting a UCS-2 string. (The extra
>> codes are probably extremely rare, and UCS2 can't handle them anyway).
>>
>> John
>> =:->
>
>
> A qoute from the Basic Questions part of the Unicode FAQ
> <http://www.unicode.org/faq/basic_q.html#25>:
>
> "In particular, for the purposes of data exchange, UCS-2 and UTF-16 are
> identical formats. Both are 16-bit, and have exactly the same code unit
> representation."
>
> You didn't give much context info in your mail so I don't really know
> if that applies to your use case.
>
>
> Dennis Benzinger
>
The specific discussion is that C# (and .NET in general, I believe) uses
UCS2 as its internal Unicode representation, while python may use 2 or
4-bytes per character. I don't know whether that makes the internal
python representation UCS2 or UTF-16. They were trying to pass data
between python an C# (through the C-api), and trying to figure out if
they needed to do anything special (like always go down to UTF-8 first).
AFAIK UCS-2 and UTF-16 are identical for everything but the extended
character set. To grab a more complete quote:
When interpreting what people have meant by "UCS-2" in past usage, it is
best thought of as not a data format, but as an indication that an
implementation does not interpret any supplementary characters. In
particular, for the purposes of data exchange, UCS-2 and UTF-16 are
identical formats. Both are 16-bit, and have exactly the same code unit
representation.
The effective difference between UCS-2 and UTF-16 lies at a different
level, when one is interpreting a sequence code units as code points or
as characters. In that case, a UCS-2 implementation would not handle
processing like character properties, codepoint boundaries, collation,
etc. for supplementary characters. [MD] & [KW]
So "UCS-2 would not handle ....". So there are things you can represent
in Unicode and UTF-16 which cannot be represented in UCS-2. So my
previous statement stands. Don't worry about the differences yet. :)
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFGVBdcJdeBCYSNAAMRAtlPAKCfi/p03IO4nuA0iSUeEgKLZ7ax2QCdGc1B
69hCby6DInQ+jmCKmeeylpo=
=tmV6
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list