[MERGE] setlocale (#128496)

Thu Jun 26 18:58:24 BST 2008

Hi Aaron!

Thanks for your review!

A question of general interest up front:
do we want the day of week displayed to local users (e.g. for log or 
info display) in English or in their local language?

I'd think local language, as a first step towards localization and 
internationalization of the application. Right now, however, you might 
think it funny to see local date elements in an otherwise English user 
interface.

If you don't want localized day of week names, neither now nor in the 
far future, or if you want to keep changesets small, you might wish to 
instead apply setlocale.mini, a bare-bones version of this patch which I 
will post independently in a moment.

Aaron Bentley wrote:
> To recap my understanding: Some libraries, particularly Subversion,
> require a locale to be set via the C setlocale function in order to
> support non-ascii filenames.  Python wraps this as locale.setlocale.

Correct. More precisely, the LC_CTYPE facet of the current locale is 
used on Linux (and probably most POSIX systems) to control the encoding 
for file names, as the file system stores only byte sequences.

> I think this is a somewhat questionable requirement, considering that a
> library client may use several different encodings in the same session.
>  The Windows filesystem / console discrepancy is an obvious example.
> Surely it would make more sense to just use unicode in some form, and
> let the clients re-encode as needed.
> 
> But we don't control these libraries, and setlocale doesn't hurt us.  So
> I think it's reasonable to use it.  But it has lots of knock-on effects.
> - time.strftime may emit dates in the user encoding
> - time.strftime will use localized strings

Those effects could be avoided by setting only LC_CTYPE, not LC_ALL. 
However, in my opinion bzr should not rely on things like LC_TIME being 
at its un-localized posix default. As I understand it, the bzr library 
is intended to be embeddable. A localized application that does embed 
bzrlib is likely to have called setlocale for LC_ALL. So bzrlib should 
be able to cope with that situation. And the best way to ensure this is 
by setting all facets in the bzr application itself as well.

If you want to play it safe and avoid any effects that setlocale has on 
time formats, you'd have to avoid strftime as well, at least for the day 
of week formatting. Should be easy enough for those formats used by bzr. 
Quoting from the python docu: http://docs.python.org/lib/node745.html

"If, when coding a module for general use, you need a locale independent 
version of an operation that is affected by the locale (such as 
string.lower(), or certain formats used with time.strftime()), you will 
have to find a way to do it without using the standard library routine. 
Even better is convincing yourself that using locale settings is okay. 
Only as a last resort should you document that your module is not 
compatible with non-"C" locale settings."

The attached bundle implements this locale-independent day of week 
handling by default, with an option for local names in format_date. My 
setlocale.mini branch omits that option, always using English names.

>> === modified file 'bzrlib/osutils.py'
>> --- bzrlib/osutils.py	2008-06-11 03:56:46 +0000
>> +++ bzrlib/osutils.py	2008-06-17 08:27:57 +0000
>> @@ -672,7 +672,13 @@
>>          offset_str = ' %+03d%02d' % (offset / 3600, (offset / 60) % 60)
>>      else:
>>          offset_str = ''
>> -    return (time.strftime(date_fmt, tt) +  offset_str)
>> +    date_str = time.strftime(date_fmt, tt)
>> +    if not isinstance(date_str, unicode):
>> +        try:
>> +            date_str = date_str.decode('utf8')
>> +        except UnicodeDecodeError:
>> +            date_str = date_str.decode(bzrlib.user_encoding, 'replace')
>> +    return date_str + offset_str
> 
> Functions usually return either unicode strings or bytestrings, not
> both.  We know strftime returns bytestrings.  Is there any reason to
> believe it ever returns unicode strings?  If not, we should assume
> bytestrings.

The main problem with strftime on the C level is that afaik it returns a 
byte string encoded according to LC_TIME, while the user's default 
encoding is controlled by LC_CTYPE. Those two should agree, but may not. 
I guess the correct way to handle things might be to save the current 
value of LC_CTYPE, set LC_CTYPE to LC_TIME, read the encoding using 
nl_langinfo(CODESET), and restore the value of LC_CTYPE. But I wouldn't 
want to do this.

As this is obviously a real pain, I could imagine a future Python 
version to handle this and always provide a unicode string to the user. 
I just wanted to be prepared for that case. A testcase might handle that 
for the time being, though, so we could remove that functionality for 
now and re-add it later if needed.

> It seems wrong to try 'utf8' before trying the detected user_encoding.
>    I would think the user_encoding is much more likely to be right than
> 'utf8'.  It is also possible, though rare, for utf-8 decoding to succeed
> when the value is encoded using a different encoding.

When coding this I had the case in mind of LC_TIME indicating a 
different encoding from LC_CTYPE. This might e.g. happen if an 
application only sets LC_TIME before embedding bzrlib. Then decoding 
with the user_encoding would fail, but you have a good chance of the 
locale being utf8, so I wanted to give that a try.

I guess you can do it the other way round. I had considered accidential 
successful utf8-decoding of a bytestring not intended as utf8 to be 
unlikely enough to use successful utf8 decoding as an indication that 
something really is utf8-encoded. That's the reason for the order. At 
least for the scenario of LC_TIME set to some utf8 locale and LC_CTYPE 
at its default, the order doesn't really matter, so we could do it in 
that order as well.

> IOW, I would rather do this:
> 
>     date_str = time.strftime(date_fmt, tt)
>     return date_str.decode(bzrlib.user_encoding()) + offset_str
> 
> Is there a reason we can't?

Should work in 99.9% of the cases, and for the rest it's really the 
user's fault, so yes, we could do that.

In the atached patch, I left the check for unicode in place, but dropped 
the utf-8 handling completely.

>> @@ -1348,15 +1354,7 @@
>>      if _cached_user_encoding is not None and use_cache:
>>          return _cached_user_encoding
> 
>> -    if sys.platform == 'darwin':
>> -        # work around egregious python 2.4 bug
>> -        sys.platform = 'posix'
>> -        try:
>> -            import locale
>> -        finally:
>> -            sys.platform = 'darwin'
>> -    else:
>> -        import locale
>> +    import locale
> 
>>      try:
>>          user_encoding = locale.getpreferredencoding()
> 
> So if a bzrlib client wants to use bzrlib.osutils.get_user_locale(),
> they must do the darwin workaround themselves?  That doesn't seem nice.
>  Is there any harm in continuing to do it here?

Good point! I had added this in response to a comment from Martin Pool, 
but hadn't thought of the embedded case.

>> === modified file 'bzrlib/tests/__init__.py'
>> --- bzrlib/tests/__init__.py	2008-06-11 07:54:54 +0000
>> +++ bzrlib/tests/__init__.py	2008-06-21 16:37:
>> @@ -3075,7 +3089,7 @@
> 
>>      def _probe(self):
>>          try:
>> -            os.stat(u'\u03b1')
>> +            os.stat(u'\u03b1\u283f')
>>          except UnicodeEncodeError:
>>              return False
>>          except (IOError, OSError):
> 
> 
> ^^^ Why was this change needed?

Because with my tests in the ja_JP.EUC-JP locale, tests requiring a 
unicode filesystem were not skipped, because the alpha character used 
before can be expressed in EUC-JP as well. Just because one unicode 
character can be used in a file name, that doesn't mean that all can. So 
I added a braille pattern to that greek alpha, as I think it unlikely 
for any non-unicode encoding to contain them both.

This fix is still contained in the patch attached here, but not in my 
setlocale.mini patch. As this fails 
test_msgeditor.MsgEditorTest.test__create_temp_file_with_commit_template_in_unicode_dir 
from the current bzr.dev as well in ja_JP.EUC-JP, I'd resubmit this 
again if you prefer smaller changesets or reject this approach here 
altogether.

>> === modified file 'bzrlib/tests/blackbox/test_info.py'
>> --- bzrlib/tests/blackbox/test_info.py	2007-12-15 18:33:10 +0000
>> +++ bzrlib/tests/blackbox/test_info.py	2008-06-16 18:19:47 +0000
>> @@ -114,7 +114,7 @@
>>          self.assertEqual('', err)
> 
>>          out, err = self.run_bzr('info branch --verbose')
>> -        self.assertEqualDiff(
>> +        self.assertEqualDiff((
>>  """Standalone tree (format: weave)
>>  Location:
>>    branch root: branch
>> @@ -154,7 +154,7 @@
>>         # having the ui test dependent on the exact overhead of a given store.
>>         branch2.repository._revision_store.total_size(
>>          branch2.repository.get_transaction())[1] / 1024,
>> -       ), out)
>> +       )).encode(bzrlib.user_encoding), out)
>>          self.assertEqual('', err)
> 
>>          # Branch and bind to standalone, needs upgrade to metadir
> 
> If you're fixing assertEqualDiff to handle encoding issues, is this
> still necessary?

Yes. The fix to assertEqualDiff only handles the output. What we have 
here is a comparison between a unicode string for the expected value and 
a byte string for the actual value. Those two will compare different 
unless the unicode string gets encoded as well.

I found out I had to do similar adjustments for test_log. I guess I 
should have run the complete test suite one final time after fixing all 
the issues from previous runs, as It seems I had introduced new issues 
along the way. This time I did run it.

>> === modified file 'bzrlib/tests/test_bundle.py'

I realized that bundles should not be localized. I guess I had these 
modifications introduced in a stage where bundles had localized dates, 
and fixed those later on. Therefore I now dropped all this locale code 
in bundle tests and reverted to the previous state.

>> === modified file 'bzrlib/timestamp.py'
>> --- bzrlib/timestamp.py	2008-04-24 07:22:53 +0000
>> +++ bzrlib/timestamp.py	2008-06-16 18:19:47 +0000
>> @@ -16,6 +16,7 @@
> 
>>  import calendar
>>  import time
>> +import locale
> 
>>  from bzrlib import osutils
> 
>> @@ -56,10 +57,19 @@
>>          offset = 0
>>      tt = time.gmtime(t + offset)
> 
>> -    return (time.strftime("%a %Y-%m-%d %H:%M:%S", tt)
>> -            # Get the high-res seconds, but ignore the 0
>> -            + ('%.9f' % (t - int(t)))[1:]
>> -            + ' %+03d%02d' % (offset / 3600, (offset / 60) % 60))
>> +    try:
>> +        old_locale = locale.setlocale(locale.LC_TIME, 'C')
>> +    except locale.Error:
>> +        old_locale = None
>> +
>> +    try:
>> +        return (time.strftime("%a %Y-%m-%d %H:%M:%S", tt)
>> +                # Get the high-res seconds, but ignore the 0
>> +                + ('%.9f' % (t - int(t)))[1:]
>> +                + ' %+03d%02d' % (offset / 3600, (offset / 60) % 60))
>> +    finally:
>> +        if old_locale is not None:
>> +            locale.setlocale(locale.LC_TIME, old_locale)
> 
> Since setlocale isn't threadsafe, we really shouldn't use it in library
> code.  Is there no other way?

According to my quotation from up there, the only feasible way is to not 
use %a for day of week formatting, but do that ourself instead. It's 
easy enough, and done in the attached patch.

By the way, at least my linux glibc here seems to handle setlocale in a 
threadsafe way, from what I can tell from the sources. As there are no 
guarantees, though, I agree that this is ugly.

> Aaron

To sum it up, these are the changes since r3496 of this branch:
* English weekdays by default in format_date, independent of setlocale
* Explicitely requested local weekday names in log and info output
* Python 2.4 workaround for darwin in osutils restored for embedded case
* Dropped unneccessary locale handling code in bundle tests
* Fixed tests for log formats

I also merged current bzr.dev and ran the complete test suite in the 
ja_JP.EUC-JP locale. Got only expected failures and the known bug #225020.

Greetings,
  Martin von Gagern
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: setlocale-3506.patch
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20080626/fe98f22a/attachment-0001.diff 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 260 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080626/fe98f22a/attachment-0001.pgp