File names normalization
v.ladeuil+lp at free.fr
Fri Sep 12 07:18:56 BST 2008
>>>>> "john" == John Arbash Meinel <john at arbash-meinel.com> writes:
>> So, I'll summarize as follows:
>> - internally, bzr represent file names as NFC normalized, utf-8
>> encoded strings,
john> - at one point, bzr internally represented file names as NFC
john> normalized, Unicode strings.
john> I've somewhat given up on that because of people who
john> want mixed strings, and it fighting with people who
john> want automatic normalization between platforms.
If people wanted un-normalized strings... well, there is little
we can do...
john> I we don't force normalization on Win32 and Linux, but
john> *do* enforce it on Mac (because we know Mac "renames"
john> stuff), then *sometimes* people will commit
john> non-normalized filenames on Windows, and then when they
john> check it out on Mac it will appear renamed.
Indeed. I can't see how to address that since there is an
Either we have a canonical representation (NFC) and we may need
to rename files (but people should not be able to visually
notice) or we don't. In the later case, we can't guarantee
that the files can be checked out on Mac.
I view the later case as somewhat similar to the case sensitive
issues because it allows two different files with the same NFC
form to coexist.
john> Compare that to never normalizing, where people check
john> in NFC names on Linux, and then we re-normalize for
john> Mac. Probably we would get *more* cases correct, but we
john> still lose edge cases.
Yes, that's why I would prefer to always normalize (whatever
canonical form we chose).
>> - it checks, during add and rename that files are in a canonical
>> - using non-NFC encoding file systems requires transcoding at
>> various points.
>> The last point being the one that needs an unknown amount of work
>> to be 100% correct.
john> And hasn't been proven to be a win for the "majority"
john> of users. (Why force NFC encoding on Windows/Linux for
john> people who may never check the code out on Mac? etc)
Can you remind me why it was a loss (except for ) ?
john> My current idea was to just reduce the internal
john> checking, let people version any Unicode name they
john> want. And let people on Mac suffer the consequences of
john> a filesystem that doesn't play the same rules everyone
john> else does.
Hmm, it seems you went to the dark side ! :D
john> I just didn't get far enough through to ripping out all
john> of the normalization checks.
john> We should do a proper fix and audit the code. It just
john> hasn't been a high priority. Also, I haven't found
john> anyone else particularly interested in the topic.
We first need to clearly define the rules.
john> And *I* don't version non-ascii filenames at this time,
john> nor do I have a habit of checking them out on other
john> platforms. (Mac used to be my primary laptop, but is
john> not anymore.)
Indeed, you went to the dark side :)
More information about the bazaar