is utf-8 the standard filename encoding?

Steve Langasek steve.langasek at ubuntu.com
Wed Dec 21 17:42:03 UTC 2011


On Wed, Dec 21, 2011 at 01:51:56PM +1100, Martin Pool wrote:
> We have a question in <https://bugs.launchpad.net/bugs/794353> and
> <http://bugs.python.org/issue13643> about what encoding bzr and Python
> ought to assume for file names if there is no locale configured.

> As a specific example, if you run a Python program from cron, it has
> no locale by default.

Regardless of any other issues, this is either an Ubuntu bug or a very
strange local misconfiguration.  /etc/pam.d/cron is set up to pull in
/etc/default/locale by default, and on an Ubuntu system, barring extreme
measures on the part of the system admin, my understanding is that this
should always define a UTF-8 locale.

It's possible I'm mistaken about the default behavior on Ubuntu Server,
though - someone please correct me if I'm wrong.  Maybe this is another
reason why we need to get the C.UTF-8 locale going everywhere.

On Wed, Dec 21, 2011 at 03:36:50PM +0000, Colin Watson wrote:
> Python's notion of a "file system encoding" is fundamentally
> wrong-headed on Unix.  Far from using UTF-8 names, Unix file systems are
> (perhaps unfortunately) encoding-agnostic.  Unix file names are byte
> sequences with the only forbidden octets being NUL and '/'; there's
> nothing else you can assume.  In practice file names will typically be
> in the locale encoding of the process that created them; Ubuntu has
> defaulted to UTF-8 for all new installations since 5.04, but real-world
> exceptions include people's music collections and source trees that
> either predate the widespread shift of Unix users to UTF-8 or that
> started life on some other operating system.  It is perfectly possible
> and indeed realistic for the same file system to contain files in a
> variety of encodings.

Notwithstanding the above (which indeed also explains why using the locale's
charset value is a poor heuristic for interpreting filenames on the Linux
filesystem), it's my understanding that the GNOME vfs stack has refused for
several years now to work with any filenames that aren't UTF-8.  So desktop
users with non-utf8 filenames are going to have a hard time of it.

-- 
Steve Langasek                   Give me a lever long enough and a Free OS
Debian Developer                   to set it on, and I can move the world.
Ubuntu Developer                                    http://www.debian.org/
slangasek at ubuntu.com                                     vorlon at debian.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 828 bytes
Desc: Digital signature
URL: <https://lists.ubuntu.com/archives/ubuntu-devel/attachments/20111221/e8dcd3ed/attachment.pgp>


More information about the ubuntu-devel mailing list