is utf-8 the standard filename encoding?

Wed Dec 21 15:36:50 UTC 2011

On Wed, Dec 21, 2011 at 01:51:56PM +1100, Martin Pool wrote:
> We have a question in <https://bugs.launchpad.net/bugs/794353> and
> <http://bugs.python.org/issue13643> about what encoding bzr and Python
> ought to assume for file names if there is no locale configured.
> 
> As a specific example, if you run a Python program from cron, it has
> no locale by default.  It tries to decode filenames as ascii.  If it
> encounters a non-ascii filename, it will likely crash.  People hit
> this kind of thing a lot with bzr; we have put in a workaround but it
> seems it would be better to fix it in Python.
> 
> My impression is the vast majority of filesystems use utf-8 names, and
> that other Ubuntu software (Nautilus? U1?) assumes this will generally
> be true.  Does Ubuntu have any policy that filenames ought to be in
> UTF-8?

No, because it would in practice be impossible to enforce such a policy.

Python's notion of a "file system encoding" is fundamentally
wrong-headed on Unix.  Far from using UTF-8 names, Unix file systems are
(perhaps unfortunately) encoding-agnostic.  Unix file names are byte
sequences with the only forbidden octets being NUL and '/'; there's
nothing else you can assume.  In practice file names will typically be
in the locale encoding of the process that created them; Ubuntu has
defaulted to UTF-8 for all new installations since 5.04, but real-world
exceptions include people's music collections and source trees that
either predate the widespread shift of Unix users to UTF-8 or that
started life on some other operating system.  It is perfectly possible
and indeed realistic for the same file system to contain files in a
variety of encodings.

UTF-8 is relatively easy to distinguish heuristically from other
encodings if you have enough text to work with, and in such cases I
think it's reasonable to try UTF-8 first and then fall back to something
else (for example, man-db does this for the contents of manual pages).
It is not clear that that is viable for file names, because the amount
of text involved is small and so ambiguities are more likely, but it
might be worth trying.  However, my feeling is that this is the sort of
decision you have to make application-by-application rather than at the
language level, as the consequences of a mistake will be different.

-- 
Colin Watson                                       [cjwatson at ubuntu.com]