[MERGE] setlocale (#128496)

Thu Jun 26 04:25:59 BST 2008

Thanks.

bb:comment

To recap my understanding: Some libraries, particularly Subversion,
require a locale to be set via the C setlocale function in order to
support non-ascii filenames.  Python wraps this as locale.setlocale.

I think this is a somewhat questionable requirement, considering that a
library client may use several different encodings in the same session.
 The Windows filesystem / console discrepancy is an obvious example.
Surely it would make more sense to just use unicode in some form, and
let the clients re-encode as needed.

But we don't control these libraries, and setlocale doesn't hurt us.  So
I think it's reasonable to use it.  But it has lots of knock-on effects.
- - time.strftime may emit dates in the user encoding
- - time.strftime will use localized strings

Some drive-by fixes:
- - the short log formatter was generating a date that it didn't use
- - the test suite's handling of unicode issues was improved

> === modified file 'bzrlib/osutils.py'
> --- bzrlib/osutils.py	2008-06-11 03:56:46 +0000
> +++ bzrlib/osutils.py	2008-06-17 08:27:57 +0000
> @@ -672,7 +672,13 @@
>          offset_str = ' %+03d%02d' % (offset / 3600, (offset / 60) % 60)
>      else:
>          offset_str = ''
> -    return (time.strftime(date_fmt, tt) +  offset_str)
> +    date_str = time.strftime(date_fmt, tt)
> +    if not isinstance(date_str, unicode):
> +        try:
> +            date_str = date_str.decode('utf8')
> +        except UnicodeDecodeError:
> +            date_str = date_str.decode(bzrlib.user_encoding, 'replace')
> +    return date_str + offset_str

Functions usually return either unicode strings or bytestrings, not
both.  We know strftime returns bytestrings.  Is there any reason to
believe it ever returns unicode strings?  If not, we should assume
bytestrings.

It seems wrong to try 'utf8' before trying the detected user_encoding.
   I would think the user_encoding is much more likely to be right than
'utf8'.  It is also possible, though rare, for utf-8 decoding to succeed
when the value is encoded using a different encoding.

IOW, I would rather do this:

    date_str = time.strftime(date_fmt, tt)
    return date_str.decode(bzrlib.user_encoding()) + offset_str

Is there a reason we can't?

> @@ -1348,15 +1354,7 @@
>      if _cached_user_encoding is not None and use_cache:
>          return _cached_user_encoding
>  
> -    if sys.platform == 'darwin':
> -        # work around egregious python 2.4 bug
> -        sys.platform = 'posix'
> -        try:
> -            import locale
> -        finally:
> -            sys.platform = 'darwin'
> -    else:
> -        import locale
> +    import locale
>  
>      try:
>          user_encoding = locale.getpreferredencoding()

So if a bzrlib client wants to use bzrlib.osutils.get_user_locale(),
they must do the darwin workaround themselves?  That doesn't seem nice.
 Is there any harm in continuing to do it here?

> === modified file 'bzrlib/tests/__init__.py'
> --- bzrlib/tests/__init__.py	2008-06-11 07:54:54 +0000
> +++ bzrlib/tests/__init__.py	2008-06-21 16:37:
> @@ -3075,7 +3089,7 @@
>  
>      def _probe(self):
>          try:
> -            os.stat(u'\u03b1')
> +            os.stat(u'\u03b1\u283f')
>          except UnicodeEncodeError:
>              return False
>          except (IOError, OSError):
> 

^^^ Why was this change needed?

> === modified file 'bzrlib/tests/blackbox/test_info.py'
> --- bzrlib/tests/blackbox/test_info.py	2007-12-15 18:33:10 +0000
> +++ bzrlib/tests/blackbox/test_info.py	2008-06-16 18:19:47 +0000
> @@ -114,7 +114,7 @@
>          self.assertEqual('', err)
>  
>          out, err = self.run_bzr('info branch --verbose')
> -        self.assertEqualDiff(
> +        self.assertEqualDiff((
>  """Standalone tree (format: weave)
>  Location:
>    branch root: branch
> @@ -154,7 +154,7 @@
>         # having the ui test dependent on the exact overhead of a given store.
>         branch2.repository._revision_store.total_size(
>          branch2.repository.get_transaction())[1] / 1024,
> -       ), out)
> +       )).encode(bzrlib.user_encoding), out)
>          self.assertEqual('', err)
>  
>          # Branch and bind to standalone, needs upgrade to metadir

If you're fixing assertEqualDiff to handle encoding issues, is this
still necessary?

> === modified file 'bzrlib/tests/test_bundle.py'
> --- bzrlib/tests/test_bundle.py	2008-05-08 04:33:38 +0000
> +++ bzrlib/tests/test_bundle.py	2008-06-08 14:24:56 +0000
> @@ -897,6 +897,21 @@
>          builder.build(['newfile'])
>          builder.finish_tree()
>  
> +        if sys.platform == 'darwin':
> +            # work around egregious python 2.4 bug
> +            sys.platform = 'posix'
> +            try:
> +                import locale
> +            finally:
> +                sys.platform = 'darwin'
> +        else:
> +            import locale

Haven't we handled the darwin workaround earlier in the codepath?

Further, this looks like a duplicate of the bzr script version.  Can we
put it in osutils and call it from both places?

> +        try:
> +            # We rely on the formatting string, so we need a fixed locale.
> +            old_lc_time = locale.setlocale(locale.LC_TIME, 'C')
> +        except locale.Error, e:
> +            old_lc_time = None
> +
>          # Asia/Colombo offset = 5 hours 30 minutes
>          self.tree1.commit('non-hour offset timezone', rev_id='tz-1',
>                            timezone=19800, timestamp=1152544886.0)
> @@ -908,6 +923,8 @@
>          self.assertEqual(19800, rev.timezone)
>          self.assertEqual(1152544886.0, rev.timestamp)
>          self.tree1.unlock()
> +        if not old_lc_time is None:
> +            locale.setlocale(locale.LC_TIME, old_lc_time)

unsetting the locale should always be done.  The best ways are with a
TestCase.addCleanup, or using try/finally.

> === modified file 'bzrlib/timestamp.py'
> --- bzrlib/timestamp.py	2008-04-24 07:22:53 +0000
> +++ bzrlib/timestamp.py	2008-06-16 18:19:47 +0000
> @@ -16,6 +16,7 @@
>  
>  import calendar
>  import time
> +import locale
>  
>  from bzrlib import osutils
>  
> @@ -56,10 +57,19 @@
>          offset = 0
>      tt = time.gmtime(t + offset)
>  
> -    return (time.strftime("%a %Y-%m-%d %H:%M:%S", tt)
> -            # Get the high-res seconds, but ignore the 0
> -            + ('%.9f' % (t - int(t)))[1:]
> -            + ' %+03d%02d' % (offset / 3600, (offset / 60) % 60))
> +    try:
> +        old_locale = locale.setlocale(locale.LC_TIME, 'C')
> +    except locale.Error:
> +        old_locale = None
> +
> +    try:
> +        return (time.strftime("%a %Y-%m-%d %H:%M:%S", tt)
> +                # Get the high-res seconds, but ignore the 0
> +                + ('%.9f' % (t - int(t)))[1:]
> +                + ' %+03d%02d' % (offset / 3600, (offset / 60) % 60))
> +    finally:
> +        if old_locale is not None:
> +            locale.setlocale(locale.LC_TIME, old_locale)

Since setlocale isn't threadsafe, we really shouldn't use it in library
code.  Is there no other way?

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIYwxH0F+nu1YWqI0RAoQ3AJ4j8lcbCoSHcAOvGxqJ79tOfDoZ/ACcDnMo
A0q9Z5fH4rqVPS2NZs7LBrM=
=j9OE
-----END PGP SIGNATURE-----