[MERGE REVIEW] Binary file handling
Jan Hudec
bulb at ucw.cz
Tue Apr 18 22:01:14 BST 2006
On Tue, Apr 18, 2006 at 16:14:31 +1000, Martin Pool wrote:
> On 16 Apr 2006, Aaron Bentley <aaron.bentley at utoronto.ca> wrote:
> > Matthieu Moy wrote:
> > | Aaron Bentley <aaron.bentley at utoronto.ca> writes:
>
> > |>Binary files are defined as files containing the NUL character (\x00) in
> > |>their first 1024 bytes. Reportedly, this is the heuristic used by diff.
> > |>This does, unfortunately, mean that UTF-16 files will be treated as
> > |>binary.
>
> Have you ever seen a UTF-16/UCS-2 source file in a tree? I know they
> might occur on Windows but it seems unlikely even there. I suppose the
> current diff code will (unknowningly) probably do the right thing with
> them by seeing the end of lines.
I recently did - VC++6 does not accept UTF-8, but it somewhat accepted
UTF-16. Gave ClearCase nuts.
> Incidentally Perl does
>
> perl> The -T and -B switches work as follows. The first block or so of the
> perl> file is examined for odd characters such as strange control codes or
> perl> characters with the high bit set. If too many strange characters (>30%)
> perl> are found, it's a -B file; otherwise it's a -T file. Also, any file
> perl> containing null in the first block is considered a binary file.
>
> I don't think checking for the high bit is a very good idea anymore,
> considering the increasing tendency towards utf-8 files.
>
> Possibly we would eventually want bzr to know about both the line
> endings and the character encoding to handle this properly, much as a
> text editor has "utf-8 with cr", "ucs-2 with crlf", etc.
Yes and no. We have to be careful to avoid such files giving bzr nuts the way
that gave them to clearcase. The use-case was:
We had a resource-file in some legacy encoding. Copied it over to
translate to another language and committed. But the easiest way to
actually fill it in was to convert it to utf-16. When I tried to commit
the converted version over the previous one, clearcase just gave up with
some error telling me that it can't add this as a text_file version.
So I removed the file and added the utf-16 version. It autodetected the
files as msword (which it certainly wasn't, but which is the oldest
format actually using utf-16), so I had to manually tell it to treat is
as binary_delta_file to get it to work.
So the resume is, that we could have properties to tell diff what to
*display*, but the storage should always deal on it's own.
> > | Perhaps it would be worth adding a way to tell bzr "this is text/this
> > | is binary" in the user-interface (and this means a meta-info in the
> > | repository)?
> >
> > Martin felt that approach was too baroque, and that we should do it
> > this way.
>
> Not so much "baroque" as: why make people set something if it can be
> automatically detected, and what should happen if it's set wrongly.
>
> Suppose the binary flag is not set, but the file is actually binary -
> would you want to display the binary garbage to the terminal, or do a
> line-wise merge? It seems to me that you would not; diff ought to check
> whether the file contents are actually safe to display, regardless of
> whether the user said it was binary or not. Conversely if you (perhaps
> incorrectly) marked it as binary you might still want to display the
> diff.
>
> The operations that would be affected by binary files are
>
> 1- storage of the file (line-wise diffs might not be useful for
> binaries)
> 2- diff
> 3- other operations that display file content, such as annotate or
> cat
> 4- merging
>
> A single "binary" bit is probably not quite enough: you might have files
> (e.g. vimrc) containing wierd characters that are mergeable as text, or
> plain text files that should never be automatically merged.
>
> It seems to me the first thing is to make the internal operations have
> options to treat them as either binaries or text, and to connect those
> to either heuristics or user preferences expressed at the time. (For
> example 'bzr diff --text' to disable detection of binaries.) So this is
> a good step.
>
> --
> Martin
>
--
Jan 'Bulb' Hudec <bulb at ucw.cz>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060418/c6b35b49/attachment.pgp
More information about the bazaar
mailing list