l10n approach for bzr

Mon Mar 17 20:30:04 GMT 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Mark Hammond wrote:
> Hi all,
>   I notice from the London Sprint notes
> (http://bazaar-vcs.org/SprintLondonMarch08/Brainstorms) that gettext() is
> likely to be used for i18n support.  My question relates to the localization
> efforts that will use this internationalization framework.
> 
> My question can be summarised as: when using gettext(), do we use "english
> strings", or "message ids"?  The long version of this question follows...
> 
> Many projects introduce i18n after the project is already working.  As a
> result, they have a large code base with many string literals sprinkled
> throughout the code.  For example, you may see:
> 
>     ask("Do you want to replace the file on the server?")
> 
> when the time comes to support i18n, the code is often changed to:
> 
>     ask(_("Do you want to replace the file on the server?")) # _() is the
> l10n lookup function
> 
> As a result, the literal string becomes the message ID.  When someone makes
> a translation into another language, you might see the following in a .po
> file (in the Brazilian translation in this case):
> 
>     msgid "Do you want to replace the file on the server?"
>     msgstr "Você deseja substituir o arquivo no servidor?"
> 
> This works fine, but in my opinion is fragile and may lead to confusion in
> the future.  For example, consider that 6 months after the translation is
> made, a usability expert recommends a change to the wording.  For example,
> the English message should be prefixed with "Are you sure...".
> 
> Once this change request is made, we have 2 options:
> 
> 1) Change the text in the code to the new version.  This will/might work
> fine for English, but has the effect of breaking all other existing
> translations - the "message ID" has now changed.  It is unlikely that all
> translators will be available at the appropriate time so that all
> translations always stay in synch.
> 
> 2) Leave the source code as it is, but change the file containing the
> english "translations".  To my mind, this is pretty horrible, as the source
> code is misleading.  Having the source code contain a long, apparently
> literal string, but having the runtime substitute it for a similar but
> different string is confusing to developers and translators alike.
> 
> This isn't a hypothetical problem.  I'm looking at tortoisebzr, and I see
> the following literal code:
> 
>             result.append((_("BZR Branch..."), _("Branch a bazzar branch"),
> self._branch))
> 
> I can see that the literal text 'Branch a bazzar branch' may end up changing
> in the future - 'branch' appearing twice might be considered clumsy, and the
> 'bazzar' may want to be corrected wrt spelling and capitalized ;).  This
> realization may end up being made well after a number of translations have
> already been made.
> 
> My primary question is: does anyone else see this as a problem?  If there is
> a consensus that this can be managed, I'm happy to go-with-the-flow.
> However, my preference would be for a more formalized model - eg, move to
> message IDs that are obviously message IDs.  For example, I would change the
> above to something like:
> 
>             result.append((_("M_BZR_BRANCH_CMD"), _("M_BZR_BRANCH_DESC"),
> self._branch))
> 
> and rely on the English translation to supply the english text.  The
> gettext() framework is likely to need a change so that the English
> translation is explicitly used when no translation is available (as the
> msg_id itself is no longer a suitable alternative), but that should be
> simple (although obviously performance would suffer in that case - 2
> translation lookups instead of 1).  The readability of the code also suffers
> somewhat - the developer can't immediately see what the english string is -
> but something has to give somewhere...
> 
> Any thoughts on this matter?
> 
> Cheers,
> 
> Mark

1) I fully agree that we don't want to leave the strings as is and just
happen to correct the English at the appropriate time. So "Branch a
bazzar branch" should indeed be fixed directly.

2) Indirecting through IDs does have an appeal. It certainly makes it
clearer that you are using ids and where the "correct" place to fix them is.

3) However, IDs make the code a bit harder to read.  You need to use
them everywhere, and then you don't actually know what they are saying.
One of the key places we will be doing this sort of thing is in
bzrlib/errors.py

In errors.py, though, I think we might actually want to wait to
translate until just before the exception is displayed. So rather than
doing:

class MyError(BzrError):

  _fmt = i18n("My Error says %(foo)s")

We would instead do:

class BzrError(...):

  def __str__(self):
    l10nfmt = i18n(self._fmt)
    return l10nfmt % self.__dict__

And the alternative:

class MyError(BzrError):

  _fmt = i18n("M_BZR_MY_ERROR_FMT")

Which gets exceptionally tricky if you are changing the number of %s you
want to put in the string. You now have to go find the format, figure
out that it has 3, add the 4th, etc.

4) I'm assuming we will generally be using Launchpad to do the
translations, since it is supposed to have a nice web interface, etc.
(And we should be dogfooding the feature. If it is bad, then we can get
it fixed.)

And in that case, I think the .po file ends up looking like:

M_BZR_MY_ERROR_FMT
Yo no entiendo espanol

Versus:

My error is %(foo)s
error mio esta %(foo)s

So fairly long winded, but my view is that we should use English strings
for the message ids, and that when the locale is English we shouldn't
need any translating. Which means that when strings are cleaned up, they
will get new message ids.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH3tTMJdeBCYSNAAMRApxqAKDRvmLNMC0hOPY9ZOVswGykJduuVACg14nh
MKX5wyvLVyyf8fCWMlhC3hc=
=0iEU
-----END PGP SIGNATURE-----