Binary file handling discussion
allen at ableton.com
Fri Nov 3 15:15:46 GMT 2006
John and I were discussing this issue yesterday and I thought it might
be useful to bring up the problem so it can be discussed here.
Handling end of line conversions is something that an RCS needs to do
*really* well so that the same files can be edited in multiple OSs that
have different line ending styles and file encodings. It would also be
nice that bzr could tell a text from a binary because it may wish to do
other conversions such as keyword expansion as well as line ending and
file encoding translations (that would never be done for binary files).
So to me at least, it is clear we need some way to tell bzr what is text
and what is binary (or at least what conversions should be done on a file).
I think, for text files, it would make sense to store them in the
repository using one line ending style. This will prevent massive diffs
happenning in the repo when the same file is edited on different OSs and
the line endings are converted back and forth. I think it would make
sense to store the text files with the \n character rather than windows
\r\n as it is shorter and makes more sense anyway.
As I understand it, Bazaar does not do any conversion yet as there is
the risk for data loss (for example, a file that is binary is mistakenly
assumed to be a text file and the \r character codes get stripped from
it). The way CVS does it is really bad and it often makes mistakes by
assuming that all files are text files unless the user specifies that
they are binary (and often users forget this). So CVSs policy is one of
data destruction by default and I do not think this would be a good idea
Subversion is much better and with its auto property setting feature
and, in my experience with it at least, it works pretty well. In fact, I
have not had a problem with it at all so the assumption that everything
is binary unless otherwise stated seems to be a good solution in my opinion.
I think what Bazaar should do is have some sensible defaults and always
assume binary unless told otherwise. If it thinks a file is text (eg
because it ends in an extension that the user has explicitly configured
as a text file) but it determines that the file has binary bytes in it
(text files don't have a \0 character in them but binary files very
often do for example) then it could warn the user when they add the file
and turn off all conversions. This would be a rare case but a valuable
extra check just to make sure it doesn't do line end conversion on what
is really a binary file. It would also be possible, but not nearly as
useful, to do a check for the opposite situation - so it could give a
warning to a file that it thinks is binary but contains no binary
characters and the user may wish to add conversions to it.
It would not be until commit time that any data is converted and these
warnings would be before commit time and would therefore deal with the
majority of data loss concerns. I think this kind of solution would do
the right thing the majority of the time and be very unlikely to loose
data. It's similar to svn but would also double check that conversions
really should be applied when the user sets up auto conversions and the
file appears to contain some binary chars. So it would be safer than svn
and I have not even had a problem with svn yet.
More information about the bazaar