[RFC] more encodings tests [was: bzr handles unicode]

John A Meinel john at arbash-meinel.com
Fri Jan 13 07:14:53 GMT 2006


Jan Hudec wrote:
> On Sun, Jan 08, 2006 at 13:27:32 -0600, John Arbash Meinel wrote:
>> Jan Hudec wrote:
>> The problem is that on windows, the encoding of the terminal is
>> frequently something like cp1251, which has no way to represent all
>> characters. It is not a unicode compatible charset. So with cp1251,
>> there is no way to represent 'Erik Bågfors'. So if you have a file with
>> that name (which is valid on a windows filesystem), 'bzr ls' should
>> always fail, because it cannot validly represent the names on the
>> filesystem on stdout. My feeling is that you are using the exact paths
>> to do something, so you should never be given an incorrect path, bzr
>> status, on the other hand, should still succeed, just with a munged path
>> name. status is just informing the user, ls is used by scripts.
>>
>> I can certainly force a specific encoding in the tests. (Force the
>> artificial stdout to always be utf-8, or do a few tests, such as utf-8,
>> cp1251, latin-1, etc, where we know what will succeed and what will fail).
> 
> Yes. That is what I mean.

Well, I've done that. Now I'm using the same adapter pattern that Robert
setup for the transport tests.
It means that I'm finding some oddities in unicode (since I'm now
developing on a mac for a little while).

The only thing that might get us into trouble, is if there is a way to
get a filesystem which will not encode the test filenames properly.

> 
>> I thought it would also be nice to test the exact encoding that the user
>> is using, so that we know that bzr is working properly (as best it can)
>> on that system.
> 
> For that to be actually useful, you'd also need a way to conjure a string,
> even nonsensical, in that encoding. For all iso-8859-* you can just take
> fixed few octets. But it does not work for utf-8 nor for Shift-JIS and other
> variable lenght encodings.

That's why I've been collecting strings. But you are completely right.
I'm thinking the best thing to do is to have a test, which just checks
to see if the user encoding is in our list of tested encodings. And fail
if it isn't.

> 
>>> Btw, here is a sentense in Czech; should decode in iso-8859-2:
>>> u'\u017dlu\u0165ou\u010dk\xfd k\u016f\u0148 \xfap\u011bl \u010f\xe1belsk\xe9 k\xf3dy'
>>> ('Žluťoučký kůň úpěl ďábelské kódy')
>>>
>> Thanks, can you provide a short translation? It is a nice to have a
>> phrase which won't decode into iso-8859-1. (Though Wouter van Heyst sent
>> me some Kanji which should do that as well)
> 
> It's what is usually used for showing how fonts look, because it contains
> most accented characters, ie. in places where Englishman use 'Quick brown fox
> jumped over a lazy dog'. The literal translation of the Czech version would
> be something like 'Yellow horse groaned devilish codes'. Actually originally
> the last word used to be 'ódy' (odes). The 'k' was added as a pun when using
> the sentece to check whether one has properly set encoding.
> 
>> I'll have to rethink how I do my tests, to make it easier to check
>> multiple encodings.
>>
>> Also, what filesystem on what platform do we have which won't support
>> unicode encoding. (I suppose there is some way on linux to have a
>> filesystem which is not utf-8. But how would python know what encoding
>> it was in?)
> 
> On Unix, the filenames are NUL terminated octet-streams with ascii meaning of
> '/' and '.'. Python uses locale setting when you pass in (and expect in
> return) unicode filenames. There does NOT seem to be a way to tell it
> otherwise and the system does not need to have any utf-8 locale generated.
> 

Which as I posted elsewhere has some interesting implications. It turns
out that räksmörgås can be => unicode in more than one way. I have heard
of this problem, but this is the first time that I saw it. 'ä' has an
explicit code point, but can also be created by using the 'a' code
point, followed by the 'put two dots above the previous character'
codepoint. and 'iso-8859-1' encoding => unicode produces the first one,
and Mac produces the second one in the filesystem.

Anyway, I'm getting more complete coverage, and at least uncovering
potential problems.
John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 249 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060113/9bfaba47/attachment.pgp 


More information about the bazaar mailing list