File names normalization
Vincent Ladeuil
v.ladeuil+lp at free.fr
Fri Sep 12 07:18:56 BST 2008
>>>>> "john" == John Arbash Meinel <john at arbash-meinel.com> writes:
john> ...
>> So, I'll summarize as follows:
>>
>> - internally, bzr represent file names as NFC normalized, utf-8
>> encoded strings,
john> - at one point, bzr internally represented file names as NFC
john> normalized, Unicode strings.
Too bad.
john> I've somewhat given up on that because of people who
john> want mixed strings, and it fighting with people who
john> want automatic normalization between platforms.
If people wanted un-normalized strings... well, there is little
we can do...
john> I we don't force normalization on Win32 and Linux, but
john> *do* enforce it on Mac (because we know Mac "renames"
john> stuff), then *sometimes* people will commit
john> non-normalized filenames on Windows, and then when they
john> check it out on Mac it will appear renamed.
Indeed. I can't see how to address that since there is an
inherent contradiction.
Either we have a canonical representation (NFC) and we may need
to rename files (but people should not be able to visually
notice[1]) or we don't. In the later case, we can't guarantee
that the files can be checked out on Mac.
I view the later case as somewhat similar to the case sensitive
issues because it allows two different files with the same NFC
form to coexist.
john> Compare that to never normalizing, where people check
john> in NFC names on Linux, and then we re-normalize for
john> Mac. Probably we would get *more* cases correct, but we
john> still lose edge cases.
Yes, that's why I would prefer to always normalize (whatever
canonical form we chose).
>>
>> - it checks, during add and rename that files are in a canonical
>> form,
>>
>> - using non-NFC encoding file systems requires transcoding at
>> various points.
>>
>> The last point being the one that needs an unknown amount of work
>> to be 100% correct.
>>
john> And hasn't been proven to be a win for the "majority"
john> of users. (Why force NFC encoding on Windows/Linux for
john> people who may never check the code out on Mac? etc)
Can you remind me why it was a loss (except for [1]) ?
john> My current idea was to just reduce the internal
john> checking, let people version any Unicode name they
john> want. And let people on Mac suffer the consequences of
john> a filesystem that doesn't play the same rules everyone
john> else does.
Hmm, it seems you went to the dark side ! :D
john> I just didn't get far enough through to ripping out all
john> of the normalization checks.
Don't :)
john> We should do a proper fix and audit the code. It just
john> hasn't been a high priority. Also, I haven't found
john> anyone else particularly interested in the topic.
We first need to clearly define the rules.
john> And *I* don't version non-ascii filenames at this time,
john> nor do I have a habit of checking them out on other
john> platforms. (Mac used to be my primary laptop, but is
john> not anymore.)
Indeed, you went to the dark side :)
Vincent
More information about the bazaar
mailing list