Binary file handling discussion

Nicholas Allen allen at
Fri Nov 3 15:15:46 GMT 2006

John and I were discussing this issue yesterday and I thought it might 
be useful to bring up the problem so it can be discussed here.

Handling end of line conversions is something that an RCS needs to do 
*really* well so that the same files can be edited in multiple OSs that 
have different line ending styles and file encodings. It would also be 
nice that bzr could tell a text from a binary because it may wish to do 
other conversions such as keyword expansion as well as line ending and 
file encoding translations (that would never be done for binary files). 
So to me at least, it is clear we need some way to tell bzr what is text 
and what is binary (or at least what conversions should be done on a file).

I think, for text files, it would make sense to store them in the 
repository using one line ending style. This will prevent massive diffs 
happenning in the repo when the same file is edited on different OSs and 
the line endings are converted back and forth. I think it would make 
sense to store the text files with the \n character rather than windows 
\r\n as it is shorter and makes more sense anyway.

As I understand it, Bazaar does  not do any conversion yet as there is 
the risk for data loss (for example, a file that is binary is mistakenly 
assumed to be a text file and the \r character codes get stripped from 
it). The way CVS does it is really bad and it often makes mistakes by 
assuming that all files are text files unless the user specifies that 
they are binary (and often users forget this). So CVSs policy is one of 
data destruction by default and I do not think this would be a good idea 
for bzr!

Subversion is much better and with its auto property setting feature 
and, in my experience with it at least, it works pretty well. In fact, I 
have not had a problem with it at all so the assumption that everything 
is binary unless otherwise stated seems to be a good solution in my opinion.

I think what Bazaar should do is have some sensible defaults and always 
assume binary unless told otherwise. If it thinks a file is text (eg 
because it ends in an extension that the user has explicitly configured 
as a text file) but it determines that the file has binary bytes in it 
(text files don't have a \0 character in them but binary files very 
often do for example) then it could warn the user when they add the file 
and turn off all conversions. This would be a rare case but a valuable 
extra check just to make sure it doesn't do line end conversion on what 
is really a binary file. It would also be possible, but not nearly as 
useful, to do a check for the opposite situation - so it could give a 
warning to a file that it thinks is binary but contains no binary 
characters and the user may wish to add conversions to it.

It would not be until commit time that any data is converted  and these 
warnings would be before commit time and would therefore deal with the 
majority of data loss concerns. I think this kind of solution would do 
the right thing the majority of the time and be very unlikely to loose 
data. It's similar to svn but would also double check that conversions 
really should be applied when the user sets up auto conversions and the 
file appears to contain some binary chars. So it would be safer than svn 
and I have not even had a problem with svn yet.



More information about the bazaar mailing list