test_nonascii: two unicode a's

John Arbash Meinel john at arbash-meinel.com
Sun Jul 2 15:26:09 BST 2006


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Alexander Belchenko wrote:
> Alexander Belchenko пишет:
>> ----------------------------------------------------------------------
>> Traceback (most recent call last):
>>   File
>> "E:\work\MyCode\bzr\devel\jam.win32\bzrlib\tests\test_nonascii.py",
>> line 80, in test_platform
>>     self.assertEqual(expected, present)
>> AssertionError: [u'\xe4', u'\xe5', u'\u017d'] != [u'\xe4', u'\u017d']
> 
> 
> I'm very surprised why another test is passed:
> test_nonascii.UnicodeFilename.test_access
> 
> I'm run selftest with --keep-output to look into test directories. And
> this is *funniest* thing ever! In directory actually 2 files instead of
> 3. And test_access *passing*. Yes-yes-yes: it passing.
> 
> So I run python interpreter in this directory and saw this:
> 
>>>> import os
>>>> os.getcwd()
> 'E:\\work\\MyCode\\bzr\\devel\\jam.win32\\test0000.tmp\\test_nonascii.UnicodeFilename.test_access'
> 
>>>> os.listdir('.')
> ['a', 'Z']
>>>> os.listdir(u'.')
> [u'\xe4', u'\u017d']
>>>> file(u'\xe4').read()
> 'contents of \xc3\xa4\n'
>>>> file(u'\xe5').read()
> 'contents of \xc3\xa4\n'

So both of them have lost there extended characteristics. I'm *really*
surprised that \xe5 mapped to \xe4, I would have guessed it would have
mapped to just plain 'a'.

Well the 2 a's are:
ä
and
å

Can you open explorer and just cut and paste those characters into a
filename?

> 
> It's looks like Windows or Python convert unicode filenames to plain
> string form and then back. I don't understand how and why. I never seen
> this before.
> 
> One note: both this unicode a (with dots and with circle) cannot be
> represented in russian character set. So maybe here is root of problem?
> 
> Can be this problem related for other failed unicode_paths tests?

Yes, I believe it is. Can you give me the output of:
python -c "import sys; sys.getfilesystemencoding()"

On my win32 machine this gives me 'mbcs'. Which, IIRC, is sort of
UTF-16, but without the extended characters. (Some UTF-16 code points
require 4-bytes to represent, rather than all of them being a fixed size).

> 
> -- 
> Alexander

Also, what version of Windows are you using?

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEp9eBJdeBCYSNAAMRAvoMAJ44nEwgN+QJ10YCWgt2/hcO6bOvUwCgt6iO
fTZyvRTQkFiKvKiPZ5ah+Nk=
=m3ZZ
-----END PGP SIGNATURE-----




More information about the bazaar mailing list