[i18n] Input Method and Fonts improvements for Gutsy

Arne Goetje arne at ubuntu.com
Fri Aug 10 11:35:34 BST 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Matt Zimmerman wrote:
> On Tue, Aug 07, 2007 at 01:15:23AM +0800, Arne Goetje wrote:
>> 1. Input Method (SCIM):
>> Both Live CD and default installation come with the SCIM package
>> installed, however it is not properly set up, so that the user actually
>> cannot use it.
> 
> This was working at one point; Michael Vogt was involved with it.  CCing
> him.
> 
>> SCIM depends on some environment variables and the SCIM demon started in
>> the background. There is a nice tool, called im-switch, which takes care
>> of this.
> 
> im-switch is installed by language-support packages corresponding to
> languages which require it.  The trouble, of course, is that none of these
> are installed on the live CD due to space constraints.
> 
> So we may need to find a way to get scim installed, but selectively enabled
> depending on the language, or perhaps rethink the way we handle
> language-support.

OK, I did some more tests...
SCIM does work, if the user makes a right mouse click in the application
window and selects "Input Method -> SCIM". This works in all UTF-8 locales.
But as this step is not obvious to the general novice user, I recommend
to set the environment variable(s)
GTK_IM_MODULE=scim (and QT_IM_MODULE=scim). That way SCIM works like
expected.
For the Live CD, this approach is enough, but for the default
installation the scim-bridge-* packages should also be installed. I'll
have to dig a bit further how they need to be configured, but the are
supposed to solve a few problems with 3rd party applications (Acrobat
Reader, Skype, etc.)

>> I highly recommend, that we put the following packages and their
>> dependencies into the Live CD and the default installation to make it
>> become more useful:
>>  * scim-anthy or scim-prime: Japanese input methods, scim-prime is a
>> dictionary based IM, which has a great advantage over anthy. Although
>> both are widely used in Japan.
>>  * scim-chewing: Traditional Chinese phonetic IM, widely used in Taiwan
>>  * scim-pinyin: Simplified and Traditional Chinese Pinyin IM, widely
>> used in China and by foreigners in Taiwan. ;)
>>  * scim-hangul: As the name says it - Korean.
>>  * scim-tables-zh: additional table based IMs for Simplified and
>> Traditional Chinese, many of them are popular in China, Hong Kong and
>> Taiwan.
>>  * scim-thai: well, Thai. :)
>>  * scim-m17n: bridge to the m17n library, which adds a lot of additional
>>  IMs, including Latin based ones for the European languages with
>> diacritics. (not everyone likes to fiddle with XKB settings. ;) )
> 
> As with im-switch, these modules are installed by the relevant
> language-support packages.  It would be useful for you to review their
> dependencies and establish whether they are correct.  We can then make
> decisions on language support simply by selecting the relevant
> language-support package, which will conveniently keep track of which
> packages are relevant for which languages.

Well, if I need to input Chinese and Japanese on an English system, I
don't want to install a few dozen files from the language packs,
especially if the translations are useless for me anyways. ;)

Installing all above mentioned modules with their dependencies on the
Live CD, needs about 48MB additional space. (I selected scim-anthy here
over scim-prime).

If we remove some font packages and create a core-fonts package instead,
we can save about 30 MB or more (see below).

>> The following packages may NOT be installed:
>>  * scim-uim: BROKEN, will trash the SCIM setup tool. Don't install it.
>>  * scim-chinese: old version of scim-pinyin, not compatible with the
>> current scim package; breaks dependency handling.
> 
> scim-uim seems to be installed with Edubuntu only.  What is the trouble with
> it?  Can it be fixed?  If not, should it be removed entirely?

scim-uim is not actively maintained. When this package is installed, the
SCIM setup tool (GUI) always crashes with a segfault. Removing the
package solves the issue.

> Likewise for scim-chinese.  We don't seem to be using it, so if it isn't
> needed, it should probably be removed to reduce confusion.

scim-chinese is the old version of scim-pinyin. The package got renamed
with the SCIM API change between 1.2.x and 1.4.0. scim-chinese does not
work with the current scim version and actually conflicts with it.
Therefor it should be removed.

>> 3. Fonts:

>>  b) Font packages:

>> Option 1: We craft a seperate package, just for the Live CD and put
>> selected fonts from the other font packages together, just for this
>> single purpose.
>> Caveat: might conflict with the other font packages (duplicate fonts
>> files), should probably not be used on the default installation on the
>> users' harddisks.
> 
> This is an interesting idea, as it would allow us to continue to provide
> legible fonts for many languages without creating so much confusion with a
> huge number of default fonts.

I have spent some time to compare the default installed fonts on the
Live CD with additional fonts available in the repositories.
Currently the /usr/share/fonts/truetype/ directory uses about 94 MB of
space.
Below is a list of fonts, I consider necessary as core fonts to display
all kinds of scripts. I made the selection with screen readability and
complex font requirements in mind.

- ------------------------------------------------------------------------
Font Name	Package			Scripts			Filesize
- ------------------------------------------------------------------------
DejaVu Sans	ttf-dejavu		Multiple		519412
DejaVu Sans Bold	ttf-dejavu	Multiple		493320
DejaVu Sans Mono	ttf-dejavu	Multiple		289712
DejaVu Sans Mono Bold	ttf-dejavu	Multiple		278376
DejaVu Serif		ttf-dejavu	Multiple		213360
DejaVu Serif Bold	ttf-dejavu	Multiple		204988
MgOpenCanonica		ttf-mgopen	Greek			281580
MgOpenCanonica Bold	ttf-mgopen	Greek			284968
MgOpenModerna		ttf-mgopen	Greek			60404
MgOpenModerna Bold	ttf-mgopen	Greek			57592
Abyssinica SIL	ttf-sil-abyssinica	Ethiopian(Amharic)	619012
Ezra SIL	ttf-sil-ezra		Hebrew			153392
PakType Tehreer	ttf-paktype		Arabic, Farsi, Urdu	308756
Scheherazade	ttf-scheherazade	Arabic, Farsi, Urdu	260392
Lohit Bengali	ttf-bengali-fonts	Bengali			138536
Chandas		ttf-devanagari-fonts	Devanagari		2584956
Lohit Gujarati	ttf-gujarati-fonts	Gujarati		79168
Lohit Kannada	ttf-kannada-fonts	Kannada			186364
AnjaliOldLipi	ttf-malayalam-fonts	Malayalam		433556
Lohit Oriya	ttf-oriya-fonts		Oriya			93140
Saab		ttf-punjabi-fonts	Punjabi			114092
Lohit Tamil	ttf-tamil-fonts		Tamil			64760
Pothana2000	ttf-telugu-fonts	Telugu			194268
Padauk		ttf-sil-padauk		Myanmar			146104
Padauk Bold	ttf-sil-padauk		Myanmar			148632
Khmer OS System	ttf-khmeros		Khmer			265624
PhetsarathOT	ttf-lao			Lao			92828
Loma		ttf-thai-tlwg		Thai			37140
Loma-Bold	ttf-thai-tlwg		Thai			37964
AR PL ShanHeiSun Uni	ttf-arphic-uming	Han		20890468
UnBatang	ttf-unfonts		Hangul			3678974
UnBatangBold	ttf-unfonts		Hangul			4070868
UnDotum		ttf-unfonts		Hangul			2209390
UnDotumBold	ttf-unfonts		Hangul			2808360
Sazanami Mincho	ttf-sazanami-mincho	Japanese		10554196
Sazanami Gothic	ttf-sazanami-gothic	Japanese		7690324
SIL Yi		ttf-sil-yi		Yi			463336
TibetianMachineUniAlpha	ttf-tmuni	Tibetian, Dzongkha	1355768
- -------------------------------------------------------------------------
Total								62364080
- -------------------------------------------------------------------------

 * The filesizes for DejaVu and AR PL ShanHeiSun Uni fonts are those
from the current packages, newer versions will differ.
 * DejaVu should be upgraded to 2.18 to include Georgian script.
 * Paktype Tehreer and Scheharazade both contain almost the same glyphs
and face and I think only one of them is needed. They are supposed to
replace the ttf-arabeyes fonts, because those lack Farsi and Urdu support.
 * Question is if we need to keep the Bold versions... cold save some
additional space.
 * the Unfonts  fonts are supposed to replace the Baekmuk fonts.
 * The Sazanami fonts are supposed to replace the Kochi fonts.
 * All these fonts are supposed to be used instead of the DejaVu fonts
for their individual script coverage, because their complex script
support and/or shapes are better than DejaVu's.
 * These fonts are supposed to be taken out of their packages and put
together into a new core-fonts packages. Installing their original
packages will waste a lot of space.


>>  b) CJK fonts:
>> This topic really is... erm... difficult.
>> For the Arphic fonts (and probably also a Heiti (sans-serif, like DejaVu
>> Sans) and Yuanti (rounded, like Kochi Gothic) font) I have the following
>> in mind:
>> The problem is, that many characters share the same codepoint in
>> Unicode, but have a different shape (number of strokes and stroke order)
>> in the different CJK regions (China, Hong Kong / Macao, Taiwan, Japan,
>> Korea). This is one of the main reasons why users in these regions
>> prefer different fonts.
>> My approach would be to put all character shape variants into a single
>> TTC (TrueType Collection) and use a different glyph ID to Unicode
>> codepoint mapping for each "virtual font".
>> Instead of having 5 separate TTF files, each about 25MB in size, we
>> would end up with only one TTC file (about 30 MB in size), which
>> produces 5 "virtual fonts". Saves a lot of space. ;)
>>
>> (If you need more details about this technology, I can elaborate about
>> it in a follow up mail)
> 
> This is a key problem, and an interesting proposed solution.  Would this
> reqire any changes outside of the fonts themselves?

No. TTC works already with GTK2 and QT4 >= 4.3. OpenOffice.org is
supposed to work, at least it does on SuSE Linux... The debian package
seems to have a bug... it cannot use TTC correctly.
However, Qt3, GTK1 and other legacy software cannot use TTC.

Cheers
Arne


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGvD92bp/QbmhdHowRAqhgAJ0UIdahEzeJOjAOwfAb9k0WJWOYRwCZAba9
pmmVMvkKeh50ftDUrWmzA8Q=
=5wtI
-----END PGP SIGNATURE-----



More information about the ubuntu-devel mailing list