UTF-16 versus UCS2

John Arbash Meinel john at arbash-meinel.com
Wed May 23 09:40:15 BST 2007


I just checked around and found this page:
http://en.wikipedia.org/wiki/UTF-16

Which basically says that UTF-16 == UCS-2 for everything that isn't in 
the "extended" character set.

Basically, Unicode extended how many codes there was going to be, so you 
need >65,000 values in your serialized data. UCS2 *doesn't* support 
those. UTF-16 supports them by having some codes be encoded with 4-bytes 
instead of just 2. (Like how UTF-8 can use up to 6?)

Anyway, I would guess that you could write a trivial UTF-16 => UCS2 
converter. If you want to be really safe, it should check for the 
special codes, and either complain or just flatten them.

Either that, or just accept some small chance for inaccuracy when 
passing UTF-16 to code that is expecting a UCS-2 string. (The extra 
codes are probably extremely rare, and UCS2 can't handle them anyway).

John
=:->



More information about the bazaar mailing list