[RFC] more encodings tests [was: bzr handles unicode]

Sun Jan 8 19:27:32 GMT 2006

Jan Hudec wrote:
> On Sun, Jan 08, 2006 at 09:10:26 -0600, John Arbash Meinel wrote:
> 
>>Alexander Belchenko wrote:
>>
>>>John Arbash Meinel пишет:
>>>
>>>
>>>>>Sure. Here is list of not 'OK' tests in blackbox (for r1539):
>>>>
>>>>
>>>>Wow, it looks like they all fail. Can you give me a single traceback, so
>>>>I can figure out where it is failing?
>>>
>>>
>>>Here zip archive with test.log when I run:
>>>
>>>python bzr --no-plugins selftest blackbox -v > test.log
>>>
>>>-- 
>>>Alexander
>>
>>Well, it looks like the ones that fail are the ones which expect your
>>bzrlib.user_encoding to be able to handle european characters, which we
>>already know it won't (since you can't handle Erik's name.)
>>
>>I'm trying to figure out what the best solution is.
>>I could try a few character sets (right now I have Swedish, Arabic,
>>Kanji, and Russian).
>>And do a couple different tests to evaluate what the current encoding is
>>able to handle, and then just use those characters in the rest of the test.
>>
>>Does that seem like it is still a valid test? On platforms which support
>>more (like a utf-8 platform), it could try to use all of the different
>>character sets.
> 
> 
> I think they should always use all of the different character sets. Python
> should always support the recoding, so it should be possible to force
> user_encoding to the respective encodings the test samples decode to. The
> only problem would be filesystem encoding. On windows it is fixed to mcbs, so
> test with that. On unix if you make sure the base path is ascii-only, you
> could probably force it to any ascii-comatible encoding (which should be all
> of them except utf-16).

The problem is that on windows, the encoding of the terminal is
frequently something like cp1251, which has no way to represent all
characters. It is not a unicode compatible charset. So with cp1251,
there is no way to represent 'Erik Bågfors'. So if you have a file with
that name (which is valid on a windows filesystem), 'bzr ls' should
always fail, because it cannot validly represent the names on the
filesystem on stdout. My feeling is that you are using the exact paths
to do something, so you should never be given an incorrect path, bzr
status, on the other hand, should still succeed, just with a munged path
name. status is just informing the user, ls is used by scripts.

I can certainly force a specific encoding in the tests. (Force the
artificial stdout to always be utf-8, or do a few tests, such as utf-8,
cp1251, latin-1, etc, where we know what will succeed and what will fail).

I thought it would also be nice to test the exact encoding that the user
is using, so that we know that bzr is working properly (as best it can)
on that system.

> 
> Btw, here is a sentense in Czech; should decode in iso-8859-2:
> u'\u017dlu\u0165ou\u010dk\xfd k\u016f\u0148 \xfap\u011bl \u010f\xe1belsk\xe9 k\xf3dy'
> ('Žluťoučký kůň úpěl ďábelské kódy')
> 

Thanks, can you provide a short translation? It is a nice to have a
phrase which won't decode into iso-8859-1. (Though Wouter van Heyst sent
me some Kanji which should do that as well)

I'll have to rethink how I do my tests, to make it easier to check
multiple encodings.

Also, what filesystem on what platform do we have which won't support
unicode encoding. (I suppose there is some way on linux to have a
filesystem which is not utf-8. But how would python know what encoding
it was in?)

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 256 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060108/3f627c41/attachment.pgp