default fsenc patch

Mon Jan 23 22:11:59 UTC 2012

On Tue, Jan 24, 2012 at 07:17:31AM +1100, Martin Pool wrote:
> Hi,
> 
> Around the end of 2011 (which seems like a long time ago) there was
> some discussion in the bzr and python bug trackers about how to handle
> filename encoding when the locale is not set or is null.  I think this
> ended up with bzr taking the patch to set the fsenc, but Python
> rejecting it.
> 
> I wonder if we should actually follow the course of upstream Python
> and just insist the user set eg LANG=C.UTF-8 if they have UTF-8 names,
> and just focus on giving clear messages when names can't be decoded.
> 
I would advise against this but since I'm not sure of the current
implementation (not sure what fsenc does, for instance) it may be that your
proposal is no worse than the current status.

On Linux, filenames are strings of bytes.  Those bytes are interpretted
according to the user's locale to be displayable characters.  If the user
changes their locale, the characters are displayed using the new encoding.
However, despite being displayable as characters, the filenames remain
strings of bytes.  This means that an entirely valid filename may consist of
bytes whose values are supposed to be interpretted in different encodings
(or even bytes which are not intended to be interpretted as characters at
all).

Usually a user will have one locale setting and not change it.  This way,
they are able to see all of the files that they have created interpreted
using the character set that they wanted.  However, Linux is not
a single-user operating system and user's are not the only people who lay
down files on disk.  The operating system vendor, files downloaded from the
internet, and files copied to each other via removable storage may all have
filenames that were written using a different encoding.  So software that
deals with filenames must have some strategy for dealing with files in
different encodings.

There's several ways to deal with this.  You could treat filenames as byte
strings just like the operating system does.  Then you will save the same
identifier (the sequence of bytes) as the os does.  The user may get
filenames that are garbled in their locale settings but there won't be
anything actually wrong with the filenames and you can be sure that all the
data you've brought from the filesystem into your repository is also capable
of being put back down as filenames on disk in another location

You could, instead, attempt to translate the sequence of bytes into
characters and save the character representation in your data store.  When
you need to extract the data, you translate the characters into a sequence
of bytes appropriate for that user's locale settings.  There are numerous
corner cases with this strategy, however.  Some may be reasonable to discard
(for instance, a working tree in a repository may not be sharable between
users who have different locale settings) but others may not (a user with
a locale setting that does not cope with the entirety of unicode may not be
able to operate on a repository that contains filename characters outside of
their chosen locale).

To someone like us who is using a UTF-8 locale, it may seem entirely
reasonable that we should simply ask that those people switch to using
utf-8.  However, in countries where non-utf-8 character sets are the norm,
people are unlikely to want to go through the trouble of A) switching
charsets just to use bzr or B) figuring out how to rename all of their files
from their native charset to the less efficient (for their languages) utf-8
and then keeping them that way as they get new files from friends,
colleagues, and the internet.

All that said, I am a utf-8 user so I'm unlikely to be affected by your
decision one way or the other.  So I'll just send this note about the
drawbacks (which may already be partially present in bzr's current
implementation -- fsenc sounds like you're already doing translation --
you've just hardcoded a fallback on utf-8) and try not to be drawn into
feeling strongly about which way you leap :-)

-Toshio
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <https://lists.ubuntu.com/archives/bazaar/attachments/20120123/4fd872ce/attachment.pgp>