Support for Unicode files

John Arbash Meinel john at arbash-meinel.com
Wed May 23 10:19:25 BST 2007


Robert Collins wrote:
> On Mon, 2007-05-21 at 10:30 -0700, Paul Schauble wrote:
> 
>> That's too bad. Is there any chance of adding support?
>>
>> I develop internationalized software for Windows. For example, my 
>> current project contains files in Simplified Chinese, Greek, German, 
>> English, and chezkoslovakian. All these are coded in UTF-16LE. 
> 
> I'm sure we can add support at some point. Specifically we would need to
> ensure that we serialise merged files in the original text encoding
> rather than as utf-8, which is what would happen by default once we
> teach bzr to obtain the text from a utf-16LE encoded file.
> 
> Probably this is sufficiently likely to cause unexpected consequences
> like this that it will need some thought - a wiki page gathering details
> would be a good idea.
> 
> -Rob

I'm curious what sort of Newlines are used in UTF-16LE. I thought there 
were some specific extra codes for newlines other than the standard \n 
and \r.

If we new what those were, we could certainly change the merge code to 
detect when the files aren't strictly binary, but instead one of the 
UTF-16LE/BE. And then use the same merge algorithm, just working on 
strings that have some null characters.

Also, do all of your files have the BOM mark at the beginning? That is 
another good way to detect UTF-16 files.

Another possibility is to look closer at per-file properties. Having 
merge code that can handle UTF-16 seems reasonable.

A bigger question, though. What to do if you are merging a file which 
claims it is UTF-16 against a file which claims it is UTF-8? If we are 
opening this can of worms, it seems like people are going to start 
asking us to decode into full Unicode, do the merge, and then encode 
back into one of them. (either one is possible).

I still feel like we shouldn't to transcoding on the fly (including 
line-endings). But I'm at least I'm starting to entertain the thought.

Oh, and having something that could be switch on via plugin would 
probably satisfy me.

John
=:->



More information about the bazaar mailing list