l10n approach for bzr

Mon Mar 17 20:05:18 GMT 2008

Hi all,
  I notice from the London Sprint notes
(http://bazaar-vcs.org/SprintLondonMarch08/Brainstorms) that gettext() is
likely to be used for i18n support.  My question relates to the localization
efforts that will use this internationalization framework.

My question can be summarised as: when using gettext(), do we use "english
strings", or "message ids"?  The long version of this question follows...

Many projects introduce i18n after the project is already working.  As a
result, they have a large code base with many string literals sprinkled
throughout the code.  For example, you may see:

    ask("Do you want to replace the file on the server?")

when the time comes to support i18n, the code is often changed to:

    ask(_("Do you want to replace the file on the server?")) # _() is the
l10n lookup function

As a result, the literal string becomes the message ID.  When someone makes
a translation into another language, you might see the following in a .po
file (in the Brazilian translation in this case):

    msgid "Do you want to replace the file on the server?"
    msgstr "Você deseja substituir o arquivo no servidor?"

This works fine, but in my opinion is fragile and may lead to confusion in
the future.  For example, consider that 6 months after the translation is
made, a usability expert recommends a change to the wording.  For example,
the English message should be prefixed with "Are you sure...".

Once this change request is made, we have 2 options:

1) Change the text in the code to the new version.  This will/might work
fine for English, but has the effect of breaking all other existing
translations - the "message ID" has now changed.  It is unlikely that all
translators will be available at the appropriate time so that all
translations always stay in synch.

2) Leave the source code as it is, but change the file containing the
english "translations".  To my mind, this is pretty horrible, as the source
code is misleading.  Having the source code contain a long, apparently
literal string, but having the runtime substitute it for a similar but
different string is confusing to developers and translators alike.

This isn't a hypothetical problem.  I'm looking at tortoisebzr, and I see
the following literal code:

            result.append((_("BZR Branch..."), _("Branch a bazzar branch"),
self._branch))

I can see that the literal text 'Branch a bazzar branch' may end up changing
in the future - 'branch' appearing twice might be considered clumsy, and the
'bazzar' may want to be corrected wrt spelling and capitalized ;).  This
realization may end up being made well after a number of translations have
already been made.

My primary question is: does anyone else see this as a problem?  If there is
a consensus that this can be managed, I'm happy to go-with-the-flow.
However, my preference would be for a more formalized model - eg, move to
message IDs that are obviously message IDs.  For example, I would change the
above to something like:

            result.append((_("M_BZR_BRANCH_CMD"), _("M_BZR_BRANCH_DESC"),
self._branch))

and rely on the English translation to supply the english text.  The
gettext() framework is likely to need a change so that the English
translation is explicitly used when no translation is available (as the
msg_id itself is no longer a suitable alternative), but that should be
simple (although obviously performance would suffer in that case - 2
translation lookups instead of 1).  The readability of the code also suffers
somewhat - the developer can't immediately see what the english string is -
but something has to give somewhere...

Any thoughts on this matter?

Cheers,

Mark