[RFC] more encodings tests [was: bzr handles unicode]

Sun Jan 15 11:56:55 GMT 2006

On Fri, 2006-01-13 at 01:14 -0600, John A Meinel wrote:
> Jan Hudec wrote:
> > On Unix, the filenames are NUL terminated octet-streams with ascii meaning of
> > '/' and '.'. Python uses locale setting when you pass in (and expect in
> > return) unicode filenames. There does NOT seem to be a way to tell it
> > otherwise and the system does not need to have any utf-8 locale generated.
> > 
> 
> Which as I posted elsewhere has some interesting implications. It turns
> out that räksmörgås can be => unicode in more than one way. I have heard
> of this problem, but this is the first time that I saw it. 'ä' has an
> explicit code point, but can also be created by using the 'a' code
> point, followed by the 'put two dots above the previous character'
> codepoint. and 'iso-8859-1' encoding => unicode produces the first one,
> and Mac produces the second one in the filesystem.

You would be interested in reading about Unicode Normalization forms:
http://www.unicode.org/reports/tr15/

Arguably, since bzr inventories are stored in XML, they should use
Normalization Form C.

Welcome to the happy world of really nasty Unicode problems.
-- 
                                                            -- ddaa
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060115/66cc9a97/attachment.pgp