Unicode through filesystem tricks
Aaron Bentley
aaron.bentley at utoronto.ca
Fri Jan 13 16:57:55 GMT 2006
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
John A Meinel wrote:
> But does that mean that now anytime we read from the user, or read from
> the filesystem we need to do:
>
> s = unicodedata.normalize('????', s.decode(bzrlib.user_encoding))
I think we can do this on a case-by-case basis. Technically, though,
any data generated on a different system may have a different encoding.
> That may be the sanest way. Or maybe we would only have to do it on
> I'm still trying to understand it. So far, it seems like 'canonical'
> means that they are exactly the same character.
Canonical means the sole sanctioned representation. The canonical
composed representation for a-with-acute has the same numerical value as
the iso-8859-1 character. The canonical decomposed representation, I
assume, has 'a' followed by the 'accute' combining character.
> So how do we want to represent unicode strings inside bzr? It seems they
> should be normalized, but which form?
NFC is the most comptatible.
> So they sound less efficient in CPU cycles,
> though they end up being shorter in physical bytes.
I doubt the spec requires them to actually do the decomposition, as long
as the effect is as though they had.
> My first preference would be to use NFKC, since those would end up being
> more compact,
Also, because of its similarity to iso-8859, more compatible.
> Any thoughts?
If some filesystems are doing normalization, we must ensure that
normalization is always performed, because normalized filesystems have
fewer possible names.
Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFDx9wT0F+nu1YWqI0RAvFeAJ0QR7jFEKBGJLoF3yzCEifGGf+oyACeMbHD
z15z0EUcMy9fhNRffaG1jKg=
=DJzv
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list