Unicode Normalization

Mon Jun 26 15:32:21 BST 2006

My recent fixes to bzr's handling of unicode filenames revealed a
unicode issue on Mac OSX. Specifically Mac doesn't preserve the
bytestream that you send it, but interprets it as Unicode, and changes
the normalization.

I think people who have followed this list have seen my earlier posts on
the subject, but the root is that Unicode says there is more than one
way to represent certain characters, and while win32 and linux just
return what you gave them, Mac decides that there is 'one true way', and
it is different from what everyone else considers the best way.

However, just like with case-insensitivity Mac will let you access the
filename by either form. However, it is not like case-preserving, it
that it actually changes the form on disk. So you can do:

>>> open(u'\xe5', 'wb').write('foo')
>>> os.listdir(u'.')
[u'\xcc\x8a']

Anyway, my original plan for handling this was to have a function that
filenames go through. On Mac, it would just change the normalization to
the one we prefer.

On other platforms, it wouldn't change the normalization, but higher
levels could detect that they don't support that kind of filename.

We could make it an exception, but we don't really want to barf just
because a file exists in the working directory that we don't handle. But
we should let the user know that they can't version this file (and why).

The problem with this approach is that it creates an overhead for every
file that we are versioning. It is one more function that every filename
needs to go through. And so I'm a little worried that it would hurt
performance on kernel sized trees (especially since all filenames are
ASCII anyway).

Right now, I think the best way to go would be to do something in
list_files, similar to how WorkingTree does it now for ignored files.

Basically, you go through, and if you know a file is versioned, you just
return it. If it doesn't match the inventory, you check if it needs to
be normalized. And if the name changes, you then check again if it is
versioned, and then go on to check if it is ignored, etc.

Does this seem reasonable? It adds an extra function call, and an if
statement to the list_files loop. Which I'm not super keen on (since it
affects initial 'add' performance).
But I think it has the least impact in the case that most of the files
are versioned, and most of them are not fancy unicode, while still
correctly handling filenames on all platforms.

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060626/2240ef34/attachment.pgp