Unicode through filesystem tricks

Fri Jan 13 16:57:55 GMT 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John A Meinel wrote:
> But does that mean that now anytime we read from the user, or read from
> the filesystem we need to do:
> 
> s = unicodedata.normalize('????', s.decode(bzrlib.user_encoding))

I think we can do this on a case-by-case basis.  Technically, though,
any data generated on a different system may have a different encoding.

> That may be the sanest way. Or maybe we would only have to do it on
> I'm still trying to understand it. So far, it seems like 'canonical'
> means that they are exactly the same character.

Canonical means the sole sanctioned representation.  The canonical
composed representation for a-with-acute has the same numerical value as
the iso-8859-1 character.  The canonical decomposed representation, I
assume, has 'a' followed by the 'accute' combining character.

> So how do we want to represent unicode strings inside bzr? It seems they
> should be normalized, but which form?

NFC is the most comptatible.

> So they sound less efficient in CPU cycles,
> though they end up being shorter in physical bytes.

I doubt the spec requires them to actually do the decomposition, as long
as the effect is as though they had.

> My first preference would be to use NFKC, since those would end up being
> more compact, 

Also, because of its similarity to iso-8859, more compatible.

> Any thoughts?

If some filesystems are doing normalization, we must ensure that
normalization is always performed, because normalized filesystems have
fewer possible names.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFDx9wT0F+nu1YWqI0RAvFeAJ0QR7jFEKBGJLoF3yzCEifGGf+oyACeMbHD
z15z0EUcMy9fhNRffaG1jKg=
=DJzv
-----END PGP SIGNATURE-----