UTF-16 versus UCS2

Andrew Bennetts andrew at canonical.com
Thu May 24 04:20:28 BST 2007


John Arbash Meinel wrote:
[...]
> The specific discussion is that C# (and .NET in general, I believe) uses
> UCS2 as its internal Unicode representation, while python may use 2 or
> 4-bytes per character. I don't know whether that makes the internal
> python representation UCS2 or UTF-16. They were trying to pass data
> between python an C# (through the C-api), and trying to figure out if
> they needed to do anything special (like always go down to UTF-8 first).

To be clear, the CPython ABI can vary from system to system, because on some
systems CPython uses "UCS2" as its internal representation, and on others it
uses "UCS4", and this internal detail leaks into the ABI.  (If compiling against
the C API, this detail should be hidden from you by the macros/config.h).

So the problem is that from .NET, which can only use the CPython library by ABI
not API, you need to be able to determine which ABIs are present (the
PyUnicode_*UCS2 or the PyUnicode_*UCS4 ones) and convert appropriately, or
otherwise find ways to avoid the problematic symbols (nothing else in the
CPython ABI has this problem, just the PyUnicode_* functions).

-Andrew.




More information about the bazaar mailing list