[RFC] more encodings tests [was: bzr handles unicode]

Sun Jan 8 19:59:58 GMT 2006

On Sun, Jan 08, 2006 at 13:27:32 -0600, John Arbash Meinel wrote:
> Jan Hudec wrote:
> The problem is that on windows, the encoding of the terminal is
> frequently something like cp1251, which has no way to represent all
> characters. It is not a unicode compatible charset. So with cp1251,
> there is no way to represent 'Erik Bågfors'. So if you have a file with
> that name (which is valid on a windows filesystem), 'bzr ls' should
> always fail, because it cannot validly represent the names on the
> filesystem on stdout. My feeling is that you are using the exact paths
> to do something, so you should never be given an incorrect path, bzr
> status, on the other hand, should still succeed, just with a munged path
> name. status is just informing the user, ls is used by scripts.
>
> I can certainly force a specific encoding in the tests. (Force the
> artificial stdout to always be utf-8, or do a few tests, such as utf-8,
> cp1251, latin-1, etc, where we know what will succeed and what will fail).

Yes. That is what I mean.

> I thought it would also be nice to test the exact encoding that the user
> is using, so that we know that bzr is working properly (as best it can)
> on that system.

For that to be actually useful, you'd also need a way to conjure a string,
even nonsensical, in that encoding. For all iso-8859-* you can just take
fixed few octets. But it does not work for utf-8 nor for Shift-JIS and other
variable lenght encodings.

> > Btw, here is a sentense in Czech; should decode in iso-8859-2:
> > u'\u017dlu\u0165ou\u010dk\xfd k\u016f\u0148 \xfap\u011bl \u010f\xe1belsk\xe9 k\xf3dy'
> > ('Žluťoučký kůň úpěl ďábelské kódy')
> > 
> 
> Thanks, can you provide a short translation? It is a nice to have a
> phrase which won't decode into iso-8859-1. (Though Wouter van Heyst sent
> me some Kanji which should do that as well)

It's what is usually used for showing how fonts look, because it contains
most accented characters, ie. in places where Englishman use 'Quick brown fox
jumped over a lazy dog'. The literal translation of the Czech version would
be something like 'Yellow horse groaned devilish codes'. Actually originally
the last word used to be 'ódy' (odes). The 'k' was added as a pun when using
the sentece to check whether one has properly set encoding.

> I'll have to rethink how I do my tests, to make it easier to check
> multiple encodings.
> 
> Also, what filesystem on what platform do we have which won't support
> unicode encoding. (I suppose there is some way on linux to have a
> filesystem which is not utf-8. But how would python know what encoding
> it was in?)

On Unix, the filenames are NUL terminated octet-streams with ascii meaning of
'/' and '.'. Python uses locale setting when you pass in (and expect in
return) unicode filenames. There does NOT seem to be a way to tell it
otherwise and the system does not need to have any utf-8 locale generated.

-- 
						 Jan 'Bulb' Hudec <bulb at ucw.cz>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060108/f149dd56/attachment.pgp