Binary file handling discussion

Fri Nov 3 18:53:19 GMT 2006

What about this:

use a file like '.bzrignore', let's say '.bzrtypes'.
The file is a list of wildcard expressions (regex and/or glob) and a set of
flags.
Each line starts with a wildcard expression. The expression is used to match
absolute file names. Behind the expression is a set of flags which tell bzr
what to
do with the file. Possible flags are (defaults in brackets):

  - store-delta  (store complete)
  - diff (do not diff because it is a binary or it does not make sense)
  - merge (do not attempt to merge file when doing bzr merge, because it is
binary or it does not make sense)
  - replaceEOF (do not touch the file)
  - substituteKeywords (do not touch the file)
  - .....

The expressions are evaluated top down. The first hit determines what flags
are applied to the file.
When no expression matches a filename, then bzr uses the most conservative
settings.

Here is an example:

============ start ===============

.*?\/config\/.*?\.xml          delta, diff, merge, replaceEOF
.*?\.xml                            delta, diff, merge, replaceEOF,
substituteKeywords
.*?\.ps                             delta

============ end  ===============

First line:  all xml files in all config directories are treated like source
files but without keyword substitution.
Second line: all other xml files are treated like source files
Third line: all postscript files can be stored as ascii files, but it makes
no sense to diff or merge them

all other files are stored in the repository as they are or using a binary
diff, the user will never see a diff output, bzr will never try to merge
those files, ....

I think this will give you a fair amount of flexebility and the schema can
be expanded later, introducing new flags.

Ciao,
  Steffen

On 11/3/06, Nicholas Allen <allen at ableton.com> wrote:
>
> John and I were discussing this issue yesterday and I thought it might
> be useful to bring up the problem so it can be discussed here.
>
> Handling end of line conversions is something that an RCS needs to do
> *really* well so that the same files can be edited in multiple OSs that
> have different line ending styles and file encodings. It would also be
> nice that bzr could tell a text from a binary because it may wish to do
> other conversions such as keyword expansion as well as line ending and
> file encoding translations (that would never be done for binary files).
> So to me at least, it is clear we need some way to tell bzr what is text
> and what is binary (or at least what conversions should be done on a
> file).
>
> I think, for text files, it would make sense to store them in the
> repository using one line ending style. This will prevent massive diffs
> happenning in the repo when the same file is edited on different OSs and
> the line endings are converted back and forth. I think it would make
> sense to store the text files with the \n character rather than windows
> \r\n as it is shorter and makes more sense anyway.
>
> As I understand it, Bazaar does  not do any conversion yet as there is
> the risk for data loss (for example, a file that is binary is mistakenly
> assumed to be a text file and the \r character codes get stripped from
> it). The way CVS does it is really bad and it often makes mistakes by
> assuming that all files are text files unless the user specifies that
> they are binary (and often users forget this). So CVSs policy is one of
> data destruction by default and I do not think this would be a good idea
> for bzr!
>
> Subversion is much better and with its auto property setting feature
> and, in my experience with it at least, it works pretty well. In fact, I
> have not had a problem with it at all so the assumption that everything
> is binary unless otherwise stated seems to be a good solution in my
> opinion.
>
> I think what Bazaar should do is have some sensible defaults and always
> assume binary unless told otherwise. If it thinks a file is text (eg
> because it ends in an extension that the user has explicitly configured
> as a text file) but it determines that the file has binary bytes in it
> (text files don't have a \0 character in them but binary files very
> often do for example) then it could warn the user when they add the file
> and turn off all conversions. This would be a rare case but a valuable
> extra check just to make sure it doesn't do line end conversion on what
> is really a binary file. It would also be possible, but not nearly as
> useful, to do a check for the opposite situation - so it could give a
> warning to a file that it thinks is binary but contains no binary
> characters and the user may wish to add conversions to it.
>
> It would not be until commit time that any data is converted  and these
> warnings would be before commit time and would therefore deal with the
> majority of data loss concerns. I think this kind of solution would do
> the right thing the majority of the time and be very unlikely to loose
> data. It's similar to svn but would also double check that conversions
> really should be applied when the user sets up auto conversions and the
> file appears to contain some binary chars. So it would be safer than svn
> and I have not even had a problem with svn yet.
>
> Cheers,
>
> Nick
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.ubuntu.com/archives/bazaar/attachments/20061103/224b3c79/attachment.htm