is utf-8 the standard filename encoding?

Thu Dec 22 01:16:01 UTC 2011

uOn 22 December 2011 04:42, Steve Langasek <steve.langasek at ubuntu.com> wrote:
> On Wed, Dec 21, 2011 at 01:51:56PM +1100, Martin Pool wrote:
>> We have a question in <https://bugs.launchpad.net/bugs/794353> and
>> <http://bugs.python.org/issue13643> about what encoding bzr and Python
>> ought to assume for file names if there is no locale configured.
>
>> As a specific example, if you run a Python program from cron, it has
>> no locale by default.
>
> Regardless of any other issues, this is either an Ubuntu bug or a very
> strange local misconfiguration.  /etc/pam.d/cron is set up to pull in
> /etc/default/locale by default, and on an Ubuntu system, barring extreme
> measures on the part of the system admin, my understanding is that this
> should always define a UTF-8 locale.
> It's possible I'm mistaken about the default behavior on Ubuntu Server,
> though - someone please correct me if I'm wrong.

On my laptop, that does indeed work, which is good.  But on a couple
of Canonical servers I looked at, there is no /etc/default/locale.  So
perhaps the people hitting this kind of problem are either on old
Ubuntus (if we ever didn't have the pam hook), or specified plain C,
or have a weird setup, or are on a different operating system.

> Maybe this is another reason why we need to get the C.UTF-8 locale going everywhere.

I think that would be good: it would fix the problem fairly well,
without ruling out people using different encodings if they want.

(Other Unices use iso-8859-1 for their default POSIX/C locale, so
perhaps it is not out of reach that Ubuntu C could eventually be
UTF-8, but probably that would be too hard to change now.)

> On Wed, Dec 21, 2011 at 03:36:50PM +0000, Colin Watson wrote:

>> In practice file names will typically be
>> in the locale encoding of the process that created them; Ubuntu has
>> defaulted to UTF-8 for all new installations since 5.04, but real-world
>> exceptions include people's music collections and source trees that
>> either predate the widespread shift of Unix users to UTF-8 or that
>> started life on some other operating system.  It is perfectly possible
>> and indeed realistic for the same file system to contain files in a
>> variety of encodings.

I know.  Most programs I've worked on that deal with files have
eventually had a bug report about names that can't be read in the
current locale - arguably user misconfiguration error, but still a
waste of everyone's time, and it would be nice to eventually get away
from it.

-- 
Martin