Unicode through filesystem tricks

Fri Jan 13 15:34:17 GMT 2006

Denys Duchier wrote:
> John A Meinel <john at arbash-meinel.com> writes:
> 
>> Basically, it seems we need some sort of unicode normalization.
> 
> maybe something like this:
> 
>>>> unicodedata.normalize('NFKC',u"ra\u0308ksmo\u0308rga\u030as")
> u'r\xe4ksm\xf6rg\xe5s'
> 
> --Denys
> 

Thanks for the pointer.
The question becomes, when do we normalize/when do we need to normalize.

I don't think we really need to support the Linux use case, where we can
have 2 files which encode to the equivalent filename. (u'\xe4' ~=
u'a\u0308')

But does that mean that now anytime we read from the user, or read from
the filesystem we need to do:

s = unicodedata.normalize('????', s.decode(bzrlib.user_encoding))

That may be the sanest way. Or maybe we would only have to do it on
certain platforms...

I found this page:
http://www.unicode.org/unicode/reports/tr15/#Primary_Exclusion_List_Table

Which defines the four normalization forms:
NFD - Canonical Decomposition
NFC - Canonical Decomposition, followed by Canonical Composition
NFKD - Compatibility Decomposition
NFKC - Compatibility Decomposition, followed by Canonical Composition

I'm still trying to understand it. So far, it seems like 'canonical'
means that they are exactly the same character. I guess there are
katakana characters which can be half or full width. So they are both
the 'letter a', but one is wider than the other. So they are not
canonically equivalent, but they are compatibility equivalent.

Also, it seems like Mac is doing one of the Decomposition forms, while
we normally think of the Composition forms.
>>> unicodedata.normalize('NFD', u'r\xe4ksm\xf6rg\xe5s')
u'ra\u0308ksmo\u0308rga\u030as'

So how do we want to represent unicode strings inside bzr? It seems they
should be normalized, but which form? Is there a speed penalty for
certain forms? NFC and NFKC sound like they first do NFD/NFKD and then
post-process the string. So they sound less efficient in CPU cycles,
though they end up being shorter in physical bytes.

My first preference would be to use NFKC, since those would end up being
more compact, and by using Compatibility, we are more likely to get
matches, rather than missing a match because one platform represents
characters in wide katakana rather than narrow katakana.

Any thoughts?

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 249 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060113/f9baebcc/attachment.pgp