Unicode (UTF-16) files on Windows

Philippe Lhoste PhiLho at GMX.net
Thu Aug 20 09:38:39 BST 2009


I was puzzled because I had a simple .reg file (exported by regedit) which I hacked to add 
support for a new source code extension (icon, editor/compiler, etc.), and Bazaar was 
seeing it as binary although my editor shown only CR and LF control chars...

The Bazaar User Reference mentions (casually) that binary status is guessed by content (I 
suppose looking for some control chars at first bytes, as usual).

When I opened the file with a hex editor, I saw the reason: it is an UTF-16 file with Bom 
(0xFF 0xFE).

It is annoying because I cannot do diffs (it says just "Binary files ... differ and qdiff 
shows nothing -- at least I can do an external diff), cats are strange (letters are double 
spaced -- qcat shows a hex view), etc.

How come Bazaar doesn't handle properly UTF-16 with Bom? Maybe you can add the detection 
of the Bom to the heuristic of binary file detection? Of course, it means other commands 
(like cat) should understand UTF-16 as well, so it might imply more work than it seems.

I found back a similar case: https://lists.ubuntu.com/archives/bazaar/2006q2/010794.html
"Have you ever seen a UTF-16/UCS-2 source file in a tree? I know they might occur on 
Windows but it seems unlikely even there."
Well, we have here a "typical" case. That, and text files (documents) written with Notepad 
(an error, I know...) which defaults (?) to UTF-16, for example.

I suppose such support is low priority (after all I can use WinDiff, my editor and other 
external tools) but that's the kind of glitches that make some users to say that Bazaar 
support of Windows is lacking (saw that in StackOverflow: 
http://stackoverflow.com/questions/995636 ).

-- 
Philippe Lhoste
--  (near) Paris -- France
--  http://Phi.Lho.free.fr
--  --  --  --  --  --  --  --  --  --  --  --  --  --




More information about the bazaar mailing list