[BUG] Unicode string must be always used with encodings

Tue Sep 27 02:14:31 BST 2005

On Mon, Sep 26, 2005 at 02:11:23PM -0500, John A Meinel wrote:
> Alexander Belchenko wrote:
> >I suppose that:
> >
> >* for control files it must be used utf-8 always,
> >* for input from user it must be used input_encoding (sys.stdin.encoding 
> >or user_encoding),
> >* for output to user it must be used output_encoding 
> >(sys.stdout.encoding or user_encoding)
> >* for decode filenames to unicode strings it must be used user_encoding
> 
> I'm not sure about this last one. For instance, most Linux systems use 
> utf-8 as the encoding. And Windows uses UTF-16 (of which python doesn't 
> seem able to read).
> 
> I'm not sure about some characters, but I know that I'm not able to read 
> arabic filenames in python (native or cygwin). Now, arabic is extra 
> crazy because of right-to-left vs left-to-right, so there might be 
> better support for other languages. (But IDLE can print arabic 
> characters correctly, and still os.listdir() shows the files as "??????")

os.listdir has no way to know what encoding the filenames have.  It just
returns byte strings (str type), not unicode strings (unicode type).

If you know of a way to fix this on all platforms, python-dev would be
interested to hear about it ;)

There's nothing preventing a directory on at least linux holding files with
names in different encodings, so I'm not sure it easily solvable at all, and
ideally tools like bzr need to be able to cope with that :(

-Andrew.