Filename normalisation handling

John Arbash Meinel john at
Tue Aug 21 17:51:30 BST 2007

Hash: SHA1

Robert Collins wrote:
> On Tue, 2007-08-21 at 16:51 +1000, Ian Clatworthy wrote:
>> I'm hoping the short answer is that the test is broken - it doesn't
>> need
>> the "rename to something than cannot be normalised" scenario to test
>> id2path anyway. I don't pretend to have the long answer and suspect
>> I'm
>> opening a Padora's box? If our position hasn't changed, should I take
>> something like
>> and
>> put it into the FAQ? Well clearer in the FAQ than it is currently? 
> I suspect that the scenario the test is testing is:
> On linux add a unicode filename, commit.
> Go to MacOS X, make a copy of the branch.
> bzr commit.
> -Rob

No, it is just testing a rename to a Unicode filename.

The specific bug is that u'\xb5' == µ == (mu) is not considered a normalized
character. Even though it is not a combining character, strict normalization
forms change it to u'\u03bc'

>>> unicodedata.normalize('NFC', u'\xb5')
>>> unicodedata.normalize('NFD', u'\xb5')
>>> unicodedata.normalize('NFKD', u'\xb5')
>>> unicodedata.normalize('NFKC', u'\xb5')

The first forms don't change if they don't have to, the K forms try to be a bit
stricter about always normalizing. (I don't remember all the details, I just
try to use K because it seemed correct when I was doing it.)

The big problems with this are:

a) u'\xb5' has a code point in iso-8859-1 (\xb5), u'\u03bc' does not.
b) On Mac, the filename u'\xb5' is translated to: '\xc2\xb5', which is utf-8
for u'\xb5'. Which means that Mac still uses u'\xb5'. Even though in general
they use NFD/NFKD while we would like to use NFC/NFKC.

Some of this *might* be solved by using NFC instead of NFKC. But I thought
there was a reason I preferred it. Also, Mac's encoding isn't strictly NFD,
apparently they use an encoding which is close, but not quite it.

Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla -


More information about the bazaar mailing list