[MERGE REVIEW] Binary file handling

Martin Pool mbp at sourcefrog.net
Tue Apr 18 07:14:31 BST 2006


On 16 Apr 2006, Aaron Bentley <aaron.bentley at utoronto.ca> wrote:
> Matthieu Moy wrote:
> | Aaron Bentley <aaron.bentley at utoronto.ca> writes:

> |>Binary files are defined as files containing the NUL character (\x00) in
> |>their first 1024 bytes.  Reportedly, this is the heuristic used by diff.
> |>This does, unfortunately, mean that UTF-16 files will be treated as
> |>binary.

Have you ever seen a UTF-16/UCS-2 source file in a tree?  I know they
might occur on Windows but it seems unlikely even there.  I suppose the
current diff code will (unknowningly) probably do the right thing with
them by seeing the end of lines.

Incidentally Perl does

perl> The -T  and -B  switches work as follows. The first block or so of the
perl> file is examined for odd characters such as strange control codes or
perl> characters with the high bit set. If too many strange characters (>30%)
perl> are found, it's a -B  file; otherwise it's a -T  file. Also, any file
perl> containing null in the first block is considered a binary file.

I don't think checking for the high bit is a very good idea anymore,
considering the increasing tendency towards utf-8 files.

Possibly we would eventually want bzr to know about both the line
endings and the character encoding to handle this properly, much as a
text editor has "utf-8 with cr", "ucs-2 with crlf", etc.

> | Perhaps it would be worth adding a way to tell bzr "this is text/this
> | is binary" in the user-interface (and this means a meta-info in the
> | repository)?
> 
> Martin felt that approach was too baroque, and that we should do it
> this way.

Not so much "baroque" as: why make people set something if it can be
automatically detected, and what should happen if it's set wrongly.

Suppose the binary flag is not set, but the file is actually binary -
would you want to display the binary garbage to the terminal, or do a
line-wise merge?  It seems to me that you would not; diff ought to check
whether the file contents are actually safe to display, regardless of
whether the user said it was binary or not.  Conversely if you (perhaps
incorrectly) marked it as binary you might still want to display the
diff.

The operations that would be affected by binary files are

1- storage of the file (line-wise diffs might not be useful for
   binaries)
2- diff
3- other operations that display file content, such as annotate or
   cat
4- merging 

A single "binary" bit is probably not quite enough: you might have files
(e.g.  vimrc) containing wierd characters that are mergeable as text, or
plain text files that should never be automatically merged.

It seems to me the first thing is to make the internal operations have
options to treat them as either binaries or text, and to connect those
to either heuristics or user preferences expressed at the time.  (For
example 'bzr diff --text' to disable detection of binaries.)  So this is
a good step.

-- 
Martin




More information about the bazaar mailing list