Support for Unicode files
Dennis.Benzinger at gmx.net
Wed May 23 12:29:58 BST 2007
Am Wed, 23 May 2007 11:19:25 +0200
schrieb John Arbash Meinel <john at arbash-meinel.com>:
> I'm curious what sort of Newlines are used in UTF-16LE. I thought
> there were some specific extra codes for newlines other than the
> standard \n and \r.
For example there are \u0085 NEXT LINE (NEL) and \u2028 LINE SEPARATOR
(LS). For a complete list you need to read the Unicode Standard Annex
#14 Line Breaking Properties <http://www.unicode.org/reports/tr14/>
(especially Table 1. Line Breaking Classes) and
http://www.unicode.org/Public/UNIDATA/LineBreak.txt where the line
breaking properties of each Unicode character is defined.
> Another possibility is to look closer at per-file properties.
I think that's a good way to handle this problem. There should be a
mime-type property for each file (like in Subversion) and a merge
plugin registry where plugins for different MIME types can register and
depending on the MIME type of the file to merge the suitable plugin is
used. MIME types for which no plugin is registered are treated as
> A bigger question, though. What to do if you are merging a file which
> claims it is UTF-16 against a file which claims it is UTF-8?
Refuse to merge.
> I still feel like we shouldn't to transcoding on the fly (including
Me too. The users should have to decide which line endings to use.
Bazaar shouldn't automatically convert any file. That's too much magic.
More information about the bazaar