user_encoding fix

John A Meinel john at arbash-meinel.com
Mon Feb 20 18:17:05 GMT 2006


Nir Soffer wrote:
> 
> On 20/02/2006, at 18:04, John A Meinel wrote:
> 
>> Users need to set LANG on Mac OSX anyway. Otherwise 'ls' and friends
>> won't do the right thing. I came across that before I had problems with
>> bzr. (I've never used bzr to control unicode filenames in a real
>> application, but I have created unicode filenames for personal stuff).
> 
> Requiring users to set LANG is a problem, bzr should simply work out of
> the box for the common case. But if it really works, maybe its a better
> solution.
> 
> I tested LANG on 10.3 (same result in x11 terminal and Terminal.app):
> 
>     $ LANG=en_US.UTF-8 python -c 'import locale; print
> locale.getpreferredencoding()'
>     mac-roman
> 
> So it does not help to get the input encoding.
> 
>     $ LANG=en_US.UTF-8 python -c 'import sys; print sys.stdout.encoding'
>     UTF-8
> 
> Works for the output encoding, so the special checks for darwin can be
> eliminated.
> 
> 
> 
> Best Regards,
> 
> Nir Soffer
> 

We already have a workaround for the fact that darwin doesn't set
preferredencoding() properly. Specifically we have:

if sys.platform == 'darwin':
    # work around egregious python 2.4 bug
    sys.platform = 'posix'
    import locale
    sys.platform = 'darwin'
else:
    import locale

Which basically forces python to honor the LANG setting, even on Mac. On
other platforms, it does the right thing. On Mac it is (incorrectly?)
hardcoded to Mac-Roman because of legacy issues.

The code in question is in locale.py. For python2.4 we have:
if sys.platform in ('win32', 'darwin', 'mac'):
    # On Win32, this will return the ANSI code page
    # On the Mac, it should return the system encoding;
    # it might return "ascii" instead
    def getpreferredencoding(do_setlocale = True):
        """Return the charset that the user is likely using."""
        import _locale
        return _locale._getdefaultlocale()[1]
else:
    # On Unix, if CODESET is available, use that.
    try:
        CODESET
    except NameError:
        # Fall back to parsing environment variables :-(
        def getpreferredencoding(do_setlocale = True):
            """Return the charset that the user is likely using,
            by looking at environment variables."""
            return getdefaultlocale()[1]
    else:
        def getpreferredencoding(do_setlocale = True):
            """Return the charset that the user is likely using,
            according to the system configuration."""
            if do_setlocale:
                oldloc = setlocale(LC_CTYPE)
                setlocale(LC_CTYPE, "")
                result = nl_langinfo(CODESET)
                setlocale(LC_CTYPE, oldloc)
                return result
            else:
                return nl_langinfo(CODESET)

Which basically says that use the compiled-in default for win32 and
darwin, but use LANG or LC_CTYPE for all other platforms. At least that
is how I read the above code.

Specifically, you can test it with this:
LANG=en_US.UTF-8 python -c "import sys; sys.platform = 'posix'; import
locale; print locale.getpreferredencoding()"

mac-roman is a really poor encoding, and doesn't match up to what the
terminal can do.

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 249 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060220/71222f6d/attachment.pgp 


More information about the bazaar mailing list