UTF-16 versus UCS2
John Arbash Meinel
john at arbash-meinel.com
Wed May 23 09:40:15 BST 2007
I just checked around and found this page:
http://en.wikipedia.org/wiki/UTF-16
Which basically says that UTF-16 == UCS-2 for everything that isn't in
the "extended" character set.
Basically, Unicode extended how many codes there was going to be, so you
need >65,000 values in your serialized data. UCS2 *doesn't* support
those. UTF-16 supports them by having some codes be encoded with 4-bytes
instead of just 2. (Like how UTF-8 can use up to 6?)
Anyway, I would guess that you could write a trivial UTF-16 => UCS2
converter. If you want to be really safe, it should check for the
special codes, and either complain or just flatten them.
Either that, or just accept some small chance for inaccuracy when
passing UTF-16 to code that is expecting a UCS-2 string. (The extra
codes are probably extremely rare, and UCS2 can't handle them anyway).
John
=:->
More information about the bazaar
mailing list