Binary file handling discussion

Lachlan Patrick loki at research.canon.com.au
Wed Nov 8 00:23:20 GMT 2006


Jari Aalto wrote:
> Nicholas Allen <allen at ableton.com> writes:
> 
>> The way CVS does it is really bad and it often
>> makes mistakes by assuming that all files are text files unless the
>> user specifies that they are binary (and often users forget this). So
>> CVSs policy is one of data destruction by default and I do not think
>> this would be a good idea for bzr!
> 
> I understood that a VCS is primary for text files and only secondary
> used for binary files

For this reason, my two cents says all files should be treated as
_binary_ while only specified file types should be treated as text.
There are just too many binary formats out there, if you need to name
them all, it'd be a pain. By contrast, I think the text files used in
source code can be enumerated in a set of patterns, like *.c *.h *.cpp
*.cc *.py *.pl *.java *.cs *.txt *.xml *.svg etc. So, have a sensible
information-preserving format for the [binary] data files which are
occasionally included in a repository (and therefore are likely to be
mistakenly managled if text is the default), and explicitly specify the
types of [source] files you *know* are textual.

We need to be careful when we talk about 'binary' and 'text' too... one
of the annoying things about certain VCS implementations is the way they
get confused on UTF-8 or UTF-16. Put a single UTF-8 character into
otherwise ordinary ASCII and suddenly, oh no, it's binary. To me the
only important thing here is whether \r\n is left alone or converted to
\n, so if you want to go down the 'auto-detection' route you'd need to
agree on a good algorithm for that. (Auto-detection can be hard to get
right. Try typing "this app can break" into a text file in Windows, save
it, then open it again in Notepad and be surprised by some Windows
auto-detection wackiness.)

On the other hand, maybe you could save and restore all files as binary,
always, and instead make the diff tools, compiler, etc treat both \r\n
and \n as equivalent line-end markers. In other words, fix the problem
in a different spot, by making all the text-handling tools robust.
Personally I'd prefer a solution like that, because the question of
whether a file is textual or binary gets very murky when UTF-16 is
involved (and particularly with Shift-JIS), and I don't like the idea of
a VCS performing data conversions on top of its real job. But there are
probably too many tools out there to fix, so this may not be practical.

Perhaps part of the problem is you don't want massive diffs when a Unix
file in the repository gets checked in as a DOS file, but couldn't you
perform a line-ending check which says "can I store this fact as a
single bit without losing data" and reduce the size of the diff that
way? I.e. if you find \r\n in the source, check if there is a reversible
mapping to \n, so no information would be lost. If there are no
problems, then converting the initial text-file-format will reduce the
size of the diffs, so do that. If the mapping isn't reversible (as is
the case with JPGs, Word docs, etc) don't do it.

The other problem I think you're trying to solve is a shared repository
between programmers working on different platforms, e.g. Linux and
Windows. You want the files to come out of the repository as Unix-text
for the Linux user, and DOS-text for the Windows user, and for them both
to be happily oblivious of the other's aberrant text file formats. I
can't see any guaranteed solution to that problem except implementing a
text-file pattern matching or explicit naming scheme, as discussed
elsewhere in this thread.

Loki




More information about the bazaar mailing list