Any plans to fix Unicode normalization issues on Mac OS X before bzr 2?

Wed Jul 22 04:41:19 BST 2009

2009/7/22 Jean-Francois Roy <bahamut at macstorm.org>:
> The bugs for unicode normalization "awareness" have been open for almost 2
> years now. Is anyone even considering fixing them? With bzr 2 coming and the
> 2a format practically finalized, it would be a shame to miss on addressing
> this very real issue because of required format changes (possibly) being
> rejected because they came in too late. I'm somewhat interested in the issue
> because I speak French and do have files with non-ASCII characters and have
> came across this issue with other version control systems (namely svn); I
> was kind of hoping bzr could improve on that.
>
> The bugs https://bugs.launchpad.net/bzr/+bug/172383 and
> https://bugs.launchpad.net/bzr/+bug/102935 seem to track the issue, and
> although they have some good information, they don't have much discussion
> for possible practical solutions. I'm not too familiar with Unicode, so I am
> not sure what the correct approach is, beyond that it seems bzr should
> assume precomposed form, and on Mac OS X have an additional layer to
> decompose characters when writing their name out.

To start with the bad news: no, I don't think they will be fixed for
2.0.  Of course I would like this fixed, but I wouldn't slip 2.0 for
the sake of them, and I wouldn't ask Canonical people to spend work
time on them ahead of other bugs targetted to 2.0.

That said, if someone did come up with a fix that passed review, I
wouldn't specifically block it.  (Even if it bumped the format it's
possible we could land it, but let's not spend too much time
discussing hypotheticals.)

There has been a fair amount of discussion of this topic, particularly
by John and Vincent.

> Perhaps some sort of content filter mechanism could be used to shield bzr
> from the idiosyncrasies of Unicode composition. One possible idea would be
> to use extended attributes to store the name of the file as it appears in
> the branch index and use that instead of the file system name for all
> operations. This would completely shield bzr from any transformation Mac OS
> X might do to the file name, while ensuring the information follows the file
> diligently. This would also (most likely) work without any modifications to
> existing branch formats, and may only require (this is a guess) a checkout
> format change.

I don't think this would be 'content filtering' as we currently use
the term, which is for the contents of files, but yes, some kind of
translation seems to make sense.  I don't think it would necessarily
need a format bump, as the dirstate and committed inventories are
already defined to be unicode.

You can probably find some discussion of this in the list history so I
won't try to reinvent it here myself.  Probably a good next step would
be to have a specific proposal in one of those bugs.

-- 
Martin <http://launchpad.net/~mbp/>