i18n: infrustructure [long]

Fri Apr 15 17:14:28 UTC 2005

Hello,

For all authors and translators. You may know that last release we did not 
have much support of the i18n process. To address this I have restructured 
the repos. While the restructure seems to have solved our issues related to 
management and packaging introduced by i18n it does not solve a number of 
sematic and processing issues. 

One of the main problems was that in order to manage files better and simplify 
the packaging system, we needed directories for each translation (prose and 
images). Organizing in this method, by nature, resulted in changes to the 
file paths and impacted on the location of an XML-instance in relation to the 
images it references. As a result, the values specified for the fileref 
attributes of all imagedata elements was broken (NULL) since the file no 
longer resides in the same location as when we wrote the documents. We needed 
to update all these values, but introduction of i18n support meant it was not 
enough for us to just change the value to reflect the new path. Instead of 
just having to specify the fileref values the English document and propagate 
the value in i18n versions, we now have to consider that translated documents 
have screen captures taken using the specific i18n locale.

While, at present, it is entirely possible for us to simply update the fileref 
values for each document and its translations to reflect the new paths, this 
would not be a good long term solution. First problem is that the translated 
versions of an English document are generated through a process comprising 
pot and po files. Each time we update a POT file the changes are merged into 
the respective PO files that are finally reconstructed into XML-instances 
based on the original XML-instance used to create the POT. Since the POT and 
PO files do not contain all element data, the values for fileref attributes 
would be propagated to all translated documents during this process and 
result in us having to continually maintain fileref values. 

Being a lazy person, I thought this was just too much overhead and could 
easily lead to errors as things get forgotten or time runs out near release. 
What we needed was a way to abstract, as best possible, so that the fileref 
values propagated would work with little or no modification throughout the 
work flow. My first thought was to script it in a make file or shell script, 
but then I realized that doing so would not be of great benefit as it would 
not result in solution that is easily supported by Document and Content 
Management Systems. We needed a solution that was inline and maintainable 
within a pure XML environment. After some investigation the following 
solution was reached.

First we have modularized our entity structure to accommodate a number of 
layers. I will not go into all now. To the internal subset of our document 
prolog we have added an internal entity called 'language.' The value of 
language is an entity reference to an entity defined in the external entity  
called 'globalent' that defined and declared in the internal subset. The two 
entities are shown below.

<!ENTITY % globalent SYSTEM "../../../libs/global.ent">
%globalent;
<!ENTITY % language "&EnglishAmerican;">

Within 'globalent' all the two letter, ISO language codes used in the i18n 
process and the structure of our file system are defined as entities. For 
example:

<!ENTITY German 'de' >
<!ENTITY Bhutani 'dz' >
<!ENTITY Greek 'el' >
<!-- <!ENTITY EnglishAmerican 'en'> -->
<!ENTITY EnglishAmerican 'C'>
<!ENTITY Esperanto 'eo' >
<!ENTITY Spanish 'es' >
<!ENTITY Estonian 'et' >

Hope you are still with me.

Looking at the language entity you will see that the value is 
"&EnglishAmerican;" which expands to <!ENTITY EnglishAmerican 'C'> which 
expands to "C". The current value of language is therefore "C", meaning it is 
an English document. In the sample from 'globalent' you will notice that this 
entity occurs twice except one instance is commented out. Look closer and you 
will see that the value of the commented instance is not 'C' but 'en'. This 
is the actual two letter ISO code, but we do not use it since within the 
directory structure English documents and images are maintained in a 
directory named 'C'. This is done to maintain compatability with a GNOME 
convention that places all English resources in C.

Back to our entities. By setting the value of language to an entity reference 
we have a single parameter by which to control the value of any parametized 
entity references to 'language' throughout an XML-instance. Since the 
'language' entity is not declared we can use a parameter entity to at any 
point in the body of a document to substitue the value of language with the 
value of the entity reference defined as its value.

For example:

<article id="art-about-ubuntu" status="complete" lang="%language;">
<title>....</title>
<para> ...... The language is %language;</para>
<mediaobject>
 <imageobject>
 <imagedata fileref="../../images/%language;/IconUbuntu.png" format="PNG"/>
 </imageobject>
</mediaobject>
 ....
</article>

In the article node lang attribute %language; is used to denote the language 
of the document. When the language attribute is matched by the Docbook XSLs 
the value of 'lang' is used to select documents containing translated texts 
called generated texts. They are part of the Docbook XSL package insalled on 
your system. These include texts for labels, captions, etc.
Note: If a lang attribute is not defined the stylesheets default to en. This 
is a small problem since if the value of 'language' is &EnglishAmerica; the 
value of lang will be C. The stylesheets do not match lang="C" and a warning 
is therefore generated 

"No localization exists for "c" or "". Using default "en". null	"

This is a warning by for our purpose has the desired effect of selecting the 
en genetexts. It is a desired error.

In the para node the result is "The language is C", if the value of 'language' 
is &EnglishAmerican;

In the imagedata node the result is
<imagedata fileref="../../images/C/IconUbuntu.png" format="PNG"/>

Substitute the value &EnglishAmerican; with &German; and the results will be:
<article id="art-about-ubuntu" status="complete" lang="de">
<para> ...... The language is de</para>
<imagedata fileref="../../images/de/IconUbuntu.png" format="PNG"/>

Hence we have:
1. a method to ensure that we can use the appropriate gentexts when 
transforming to HTML and PDF etc.
2. a method to ensure compatability with yelp and GNOME folder conventions.
3. a method to ensure compatability under Document and Content Management 
Systems.

All the above has one problem. We must somehow change the value of 'language' 
for each XML-instance to the entity reference relevant to the language of the 
document. I have to test if this can be done under a Document or Content 
Management system, but am reasonably confident that it can be done using 
pipes and XSLT.

Hope this helps.

-- 
Sean Wheller
Technical Author
sean at inwords.co.za
http://www.inwords.co.za
Registered Linux User #375355
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <https://lists.ubuntu.com/archives/ubuntu-doc/attachments/20050415/4ba75442/attachment.pgp>