Filename normalisation handling
John Arbash Meinel
john at arbash-meinel.com
Tue Aug 21 17:51:30 BST 2007
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Robert Collins wrote:
> On Tue, 2007-08-21 at 16:51 +1000, Ian Clatworthy wrote:
>> I'm hoping the short answer is that the test is broken - it doesn't
>> need
>> the "rename to something than cannot be normalised" scenario to test
>> id2path anyway. I don't pretend to have the long answer and suspect
>> I'm
>> opening a Padora's box? If our position hasn't changed, should I take
>> something like
>> https://bugs.launchpad.net/bzr/+bug/77657/comments/3 and
>> put it into the FAQ? Well clearer in the FAQ than it is currently?
>
> I suspect that the scenario the test is testing is:
> On linux add a unicode filename, commit.
> Go to MacOS X, make a copy of the branch.
> bzr commit.
>
> -Rob
No, it is just testing a rename to a Unicode filename.
The specific bug is that u'\xb5' == µ == (mu) is not considered a normalized
character. Even though it is not a combining character, strict normalization
forms change it to u'\u03bc'
>>> unicodedata.normalize('NFC', u'\xb5')
u'\xb5'
>>> unicodedata.normalize('NFD', u'\xb5')
u'\xb5'
>>> unicodedata.normalize('NFKD', u'\xb5')
u'\u03bc'
>>> unicodedata.normalize('NFKC', u'\xb5')
u'\u03bc'
The first forms don't change if they don't have to, the K forms try to be a bit
stricter about always normalizing. (I don't remember all the details, I just
try to use K because it seemed correct when I was doing it.)
The big problems with this are:
a) u'\xb5' has a code point in iso-8859-1 (\xb5), u'\u03bc' does not.
b) On Mac, the filename u'\xb5' is translated to: '\xc2\xb5', which is utf-8
for u'\xb5'. Which means that Mac still uses u'\xb5'. Even though in general
they use NFD/NFKD while we would like to use NFC/NFKC.
Some of this *might* be solved by using NFC instead of NFKC. But I thought
there was a reason I preferred it. Also, Mac's encoding isn't strictly NFD,
apparently they use an encoding which is close, but not quite it.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFGyxgSJdeBCYSNAAMRAshpAKDNcbk1Zfcyq1VkAMgnWIvIxrhmTwCbB/EB
M5o2ebeOI/ISNOQv4Znch1s=
=+Lic
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list