[MERGE REVIEW] Binary file handling

Jan Hudec bulb at ucw.cz
Tue Apr 18 22:01:14 BST 2006


On Tue, Apr 18, 2006 at 16:14:31 +1000, Martin Pool wrote:
> On 16 Apr 2006, Aaron Bentley <aaron.bentley at utoronto.ca> wrote:
> > Matthieu Moy wrote:
> > | Aaron Bentley <aaron.bentley at utoronto.ca> writes:
> 
> > |>Binary files are defined as files containing the NUL character (\x00) in
> > |>their first 1024 bytes.  Reportedly, this is the heuristic used by diff.
> > |>This does, unfortunately, mean that UTF-16 files will be treated as
> > |>binary.
> 
> Have you ever seen a UTF-16/UCS-2 source file in a tree?  I know they
> might occur on Windows but it seems unlikely even there.  I suppose the
> current diff code will (unknowningly) probably do the right thing with
> them by seeing the end of lines.

I recently did - VC++6 does not accept UTF-8, but it somewhat accepted
UTF-16. Gave ClearCase nuts.

> Incidentally Perl does
> 
> perl> The -T  and -B  switches work as follows. The first block or so of the
> perl> file is examined for odd characters such as strange control codes or
> perl> characters with the high bit set. If too many strange characters (>30%)
> perl> are found, it's a -B  file; otherwise it's a -T  file. Also, any file
> perl> containing null in the first block is considered a binary file.
> 
> I don't think checking for the high bit is a very good idea anymore,
> considering the increasing tendency towards utf-8 files.
> 
> Possibly we would eventually want bzr to know about both the line
> endings and the character encoding to handle this properly, much as a
> text editor has "utf-8 with cr", "ucs-2 with crlf", etc.

Yes and no. We have to be careful to avoid such files giving bzr nuts the way
that gave them to clearcase. The use-case was:

    We had a resource-file in some legacy encoding. Copied it over to
    translate to another language and committed. But the easiest way to
    actually fill it in was to convert it to utf-16. When I tried to commit
    the converted version over the previous one, clearcase just gave up with
    some error telling me that it can't add this as a text_file version.
    So I removed the file and added the utf-16 version. It autodetected the
    files as msword (which it certainly wasn't, but which is the oldest
    format actually using utf-16), so I had to manually tell it to treat is
    as binary_delta_file to get it to work.

So the resume is, that we could have properties to tell diff what to
*display*, but the storage should always deal on it's own.

> > | Perhaps it would be worth adding a way to tell bzr "this is text/this
> > | is binary" in the user-interface (and this means a meta-info in the
> > | repository)?
> > 
> > Martin felt that approach was too baroque, and that we should do it
> > this way.
> 
> Not so much "baroque" as: why make people set something if it can be
> automatically detected, and what should happen if it's set wrongly.
> 
> Suppose the binary flag is not set, but the file is actually binary -
> would you want to display the binary garbage to the terminal, or do a
> line-wise merge?  It seems to me that you would not; diff ought to check
> whether the file contents are actually safe to display, regardless of
> whether the user said it was binary or not.  Conversely if you (perhaps
> incorrectly) marked it as binary you might still want to display the
> diff.
> 
> The operations that would be affected by binary files are
> 
> 1- storage of the file (line-wise diffs might not be useful for
>    binaries)
> 2- diff
> 3- other operations that display file content, such as annotate or
>    cat
> 4- merging 
> 
> A single "binary" bit is probably not quite enough: you might have files
> (e.g.  vimrc) containing wierd characters that are mergeable as text, or
> plain text files that should never be automatically merged.
> 
> It seems to me the first thing is to make the internal operations have
> options to treat them as either binaries or text, and to connect those
> to either heuristics or user preferences expressed at the time.  (For
> example 'bzr diff --text' to disable detection of binaries.)  So this is
> a good step.
> 
> -- 
> Martin
> 
-- 
						 Jan 'Bulb' Hudec <bulb at ucw.cz>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060418/c6b35b49/attachment.pgp 


More information about the bazaar mailing list