Support for Unicode files
John Arbash Meinel
john at arbash-meinel.com
Wed May 23 10:19:25 BST 2007
Robert Collins wrote:
> On Mon, 2007-05-21 at 10:30 -0700, Paul Schauble wrote:
>> That's too bad. Is there any chance of adding support?
>> I develop internationalized software for Windows. For example, my
>> current project contains files in Simplified Chinese, Greek, German,
>> English, and chezkoslovakian. All these are coded in UTF-16LE.
> I'm sure we can add support at some point. Specifically we would need to
> ensure that we serialise merged files in the original text encoding
> rather than as utf-8, which is what would happen by default once we
> teach bzr to obtain the text from a utf-16LE encoded file.
> Probably this is sufficiently likely to cause unexpected consequences
> like this that it will need some thought - a wiki page gathering details
> would be a good idea.
I'm curious what sort of Newlines are used in UTF-16LE. I thought there
were some specific extra codes for newlines other than the standard \n
If we new what those were, we could certainly change the merge code to
detect when the files aren't strictly binary, but instead one of the
UTF-16LE/BE. And then use the same merge algorithm, just working on
strings that have some null characters.
Also, do all of your files have the BOM mark at the beginning? That is
another good way to detect UTF-16 files.
Another possibility is to look closer at per-file properties. Having
merge code that can handle UTF-16 seems reasonable.
A bigger question, though. What to do if you are merging a file which
claims it is UTF-16 against a file which claims it is UTF-8? If we are
opening this can of worms, it seems like people are going to start
asking us to decode into full Unicode, do the merge, and then encode
back into one of them. (either one is possible).
I still feel like we shouldn't to transcoding on the fly (including
line-endings). But I'm at least I'm starting to entertain the thought.
Oh, and having something that could be switch on via plugin would
probably satisfy me.
More information about the bazaar