Binary file handling discussion

Sun Nov 12 15:43:06 GMT 2006

On Wed, Nov 08, 2006 at 11:23:20AM +1100, Lachlan Patrick wrote:
> Jari Aalto wrote:
> > Nicholas Allen <allen at ableton.com> writes:
> > 
> >> The way CVS does it is really bad and it often
> >> makes mistakes by assuming that all files are text files unless the
> >> user specifies that they are binary (and often users forget this). So
> >> CVSs policy is one of data destruction by default and I do not think
> >> this would be a good idea for bzr!
> > 
> > I understood that a VCS is primary for text files and only secondary
> > used for binary files
> 
> For this reason, my two cents says all files should be treated as
> _binary_ while only specified file types should be treated as text.
> There are just too many binary formats out there, if you need to name
> them all, it'd be a pain. By contrast, I think the text files used in
> source code can be enumerated in a set of patterns, like *.c *.h *.cpp
> *.cc *.py *.pl *.java *.cs *.txt *.xml *.svg etc. So, have a sensible
> information-preserving format for the [binary] data files which are
> occasionally included in a repository (and therefore are likely to be
> mistakenly managled if text is the default), and explicitly specify the
> types of [source] files you *know* are textual.

There are two, completely distinct (and in fact also completely
unrelated) places, where a version control system can consider text vs.
binary distinction:
 - Converting format on store/restore. The default should be binary
   here, ie. always store byte-for-byte equal copies. This is where CVS
   screwed up horribly by assuming text and we don't want to repeat
   that.
 - When merging. Autodetection is fully appropriate here, with the
   possibility for user to override it.

> We need to be careful when we talk about 'binary' and 'text' too... one
> of the annoying things about certain VCS implementations is the way they
> get confused on UTF-8 or UTF-16. Put a single UTF-8 character into
> otherwise ordinary ASCII and suddenly, oh no, it's binary. To me the

I think you meant UTF-16, but any format in which you can do that is
really binary (as in text merge won't work on it). And UTF-8 is fully
ascii-compatible, so no problem there whatever.

> only important thing here is whether \r\n is left alone or converted to
> \n, so if you want to go down the 'auto-detection' route you'd need to
> agree on a good algorithm for that. (Auto-detection can be hard to get
> right. Try typing "this app can break" into a text file in Windows, save
> it, then open it again in Notepad and be surprised by some Windows
> auto-detection wackiness.)

Notepad has no autodetection whatsoever, as far as I know.

> On the other hand, maybe you could save and restore all files as binary,
> always, and instead make the diff tools, compiler, etc treat both \r\n
> and \n as equivalent line-end markers. In other words, fix the problem
> in a different spot, by making all the text-handling tools robust.
> Personally I'd prefer a solution like that, because the question of
> whether a file is textual or binary gets very murky when UTF-16 is
> involved (and particularly with Shift-JIS), and I don't like the idea of
> a VCS performing data conversions on top of its real job. But there are
> probably too many tools out there to fix, so this may not be practical.

As for UTF-16, that's on one hand easy to detect (because of ambiguity
in it's definition, all UTF-16 files really need to start with a
byte-order-mark and anything else using UTF-16 byte-order-mark as
a magic number is extremely unlikely.

> Perhaps part of the problem is you don't want massive diffs when a Unix
> file in the repository gets checked in as a DOS file, but couldn't you
> perform a line-ending check which says "can I store this fact as a
> single bit without losing data" and reduce the size of the diff that
> way? I.e. if you find \r\n in the source, check if there is a reversible
> mapping to \n, so no information would be lost. If there are no
> problems, then converting the initial text-file-format will reduce the
> size of the diffs, so do that. If the mapping isn't reversible (as is
> the case with JPGs, Word docs, etc) don't do it.

90% of time, the windows tool will not convert the file to CRLF, but
rather create a horrible mix of CRLFs and LFs. In which case this won't
work. It's probably not worth it.

Besides it's the diff tool that should have an 'ignore whitespace'
option to help you out in this case.

> The other problem I think you're trying to solve is a shared repository
> between programmers working on different platforms, e.g. Linux and
> Windows. You want the files to come out of the repository as Unix-text
> for the Linux user, and DOS-text for the Windows user, and for them both
> to be happily oblivious of the other's aberrant text file formats. I
> can't see any guaranteed solution to that problem except implementing a
> text-file pattern matching or explicit naming scheme, as discussed
> elsewhere in this thread.

This indeed has to be explicit.

--------------------------------------------------------------------------------
                  				- Jan Hudec `Bulb' <bulb at ucw.cz>