UTF-16 versus UCS2

Dennis Benzinger Dennis.Benzinger at gmx.net
Wed May 23 11:13:10 BST 2007


Am Wed, 23 May 2007 10:40:15 +0200
schrieb John Arbash Meinel <john at arbash-meinel.com>:

> I just checked around and found this page:
> http://en.wikipedia.org/wiki/UTF-16
> 
> Which basically says that UTF-16 == UCS-2 for everything that isn't
> in the "extended" character set.
> 
> Basically, Unicode extended how many codes there was going to be, so
> you need >65,000 values in your serialized data. UCS2 *doesn't*
> support those. UTF-16 supports them by having some codes be encoded
> with 4-bytes instead of just 2. (Like how UTF-8 can use up to 6?)
> 
> Anyway, I would guess that you could write a trivial UTF-16 => UCS2 
> converter. If you want to be really safe, it should check for the 
> special codes, and either complain or just flatten them.
> 
> Either that, or just accept some small chance for inaccuracy when 
> passing UTF-16 to code that is expecting a UCS-2 string. (The extra 
> codes are probably extremely rare, and UCS2 can't handle them anyway).
> 
> John
> =:->


A qoute from the Basic Questions part of the Unicode FAQ
<http://www.unicode.org/faq/basic_q.html#25>:

"In particular, for the purposes of data exchange, UCS-2 and UTF-16 are
identical formats. Both are 16-bit, and have exactly the same code unit
representation."

You didn't give much context info in your mail so I don't really know
if that applies to your use case.


Dennis Benzinger



More information about the bazaar mailing list