UnicodeEncodeError in add_action_print with non ascii files names
John A Meinel
john at arbash-meinel.com
Sun Feb 5 04:50:59 GMT 2006
Nir Soffer wrote:
>
> On 5 Feb, 2006, at 1:24, John A Meinel wrote:
>
>> The tricky part with filenames is that Mac OSX (which I use)
>> normalizes unicode filenames in an odd way. So we need to be able to
>> re-normalize them internally.)
>
> If foo is a file name, foo may not be equal to
> foo.decode('utf-8').encode('utf-8') ?
>
> Using unicode file names seems to work here on 10.3.9:
>
>>>> hebrew_name = '\327\251\327\234\327\225\327\235'.decode('utf-8')
>>>> file(hebrew_name, 'w').write('')
>>>> os.listdir(u'.')
> [u'.DS_Store', u'\u05e9\u05dc\u05d5\u05dd']
>
> Strangely os.path thinks it does not :-)
>
>>>> os.path.supports_unicode_filenames
> False
It isn't that Mac doesn't support unicode filenames, but that it
normalizes them. Probably this doesn't matter for Hebrew characters,
because they don't have combiners. But for the European character 'å'
(u'\xe5') this has 2 forms. u'a\u030a', and u'\xe5', The former is 'a +
circle', the latter is 'a with circle'.
The issue is that XML states that the latter should be used, while Mac
OS X creates files with the former normalization.
So if you go to Mac and do:
python
>>> import os
>>> open(u'\xe5', 'wb').write('hello')
>>> os.listdir(u'.')
[u'a\u030a']
>>> print open(u'\xe5', 'rb').read()
hello
>>> print open(u'a\u030a', 'rb').read()
hello
Mac will let you access the file with either method, as it treats them
the same.
The problem for bzr is that on Linux, you might create the file
'\xe5.txt', and then bzr will record that filename. Then if you check
that project out on Mac, it will create what it thinks is '\xe5.txt',
but when it tries to list the directory, that file has disappeared, and
this unknown 'a\u030a.txt' file has appeared.
Anyway, right now Mac OS X is the only filesystem that seems to do this.
Windows & Linux leave the normalization alone. That means on Linux you
can have 2 files which *look* like the same filename, Windows doesn't
seem to understand \u030a, and just puts a box for the unknown character.
We discussed the issue, and decided that it made the most sense to
always normalize filenames internally. And complain if the user tries to
add a non-normalized filename. (On Mac you can't create one).
>
> I guess that using PyObjC will solve such problems:
>
>>>> from Foundation import *
>>>> NSString.stringWithString_(hebrew_name).fileSystemRepresentation()
> '\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'
>
> Although It seems to be the same as Python utf-8 encoding:
>
>>>> hebrew_name.encode('utf-8')
> '\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'
>
> There is also Carbon.File, included in the standard library:
>
>>>> from Carbon import File
>>>> File.FSRef(hebrew_name).as_pathname()
> '/Volumes/Home/nir/Desktop/utest/\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'
>
> BTW, Carbon.File.FSRef().as_pathname() is used by MoinMoin to get the
> real name of files, which solve annoying problems with PageName and
> pagename, both seems to exists using os.path.exists(), although only one
> of them can exists on HFS[+] files system.
Well, I believe there is a way to make HFS+ be case sensitive, but they
warn that it may break existing programs.
But yes, normalization is an issue.
By the way, it is nice to have some hebrew characters. Do you have a
specific meaning for 'שלום'? I've been collecting non-english words, and
I prefer to have a translation with them.
John
=:->
>
>
> Best Regards,
>
> Nir Soffer
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 249 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060204/1f701e47/attachment.pgp
More information about the bazaar
mailing list