[MERGE REVIEW] Binary file handling

Jan Hudec bulb at ucw.cz
Wed Apr 19 06:30:19 BST 2006


On Tue, Apr 18, 2006 at 17:21:33 -0400, Aaron Bentley wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Jan Hudec wrote:
> > On Tue, Apr 18, 2006 at 16:14:31 +1000, Martin Pool wrote:
> >>Have you ever seen a UTF-16/UCS-2 source file in a tree?  I know they
> >>might occur on Windows but it seems unlikely even there.  I suppose the
> >>current diff code will (unknowningly) probably do the right thing with
> >>them by seeing the end of lines.
> 
> It kinda will.  It will treat \x00\n correctly, but not \x01\n, etc.

I think it actually still does mostly the right thing.

> >>Possibly we would eventually want bzr to know about both the line
> >>endings and the character encoding to handle this properly, much as a
> >>text editor has "utf-8 with cr", "ucs-2 with crlf", etc.
> > 
> > 
> > Yes and no. We have to be careful to avoid such files giving bzr nuts the way
> > that gave them to clearcase.
> ...
> > So the resume is, that we could have properties to tell diff what to
> > *display*, but the storage should always deal on it's own.
> 
> Our storage system will handle any kind of file you feed it.  My changes
> are just about display and merge.

Yes. That's reasonable. What I meant is that even if we add properties,
the storage must not take them too seriously.

> But the behaviour of weave merge on a UTF-16 file would be improved if
> we correctly split on newlines, which would require us to detect the
> file encoding.

Well, I think all or almost all UTF-16 files out there start with
byte-order-mark. Windows uses it to tell utf-16 files from 8-bit ones
and unix tools are hopefully aware of the original reason for them --
different unices may be of different byte sex.

> >>>| Perhaps it would be worth adding a way to tell bzr "this is text/this
> >>>| is binary" in the user-interface (and this means a meta-info in the
> >>>| repository)?
> >>>
> >>>Martin felt that approach was too baroque, and that we should do it
> >>>this way.
> >>
> >>Not so much "baroque" as: why make people set something if it can be
> >>automatically detected, 
> 
> I always conceived it as an override for the automatic detection.
> 
> >>and what should happen if it's set wrongly.
> 
> It would probably be more useful for forcing files as binary (e.g.
> uuencoded files) rather than as a way of forcing files to be text.

... yes. And forcing something as binary only makes sense for merge (ie.
don't try to merge this textually, result would be garbage anyway).

> >>Suppose the binary flag is not set, but the file is actually binary -
> >>would you want to display the binary garbage to the terminal, or do a
> >>line-wise merge?  It seems to me that you would not; diff ought to check
> >>whether the file contents are actually safe to display, regardless of
> >>whether the user said it was binary or not.  
> 
> If there were a significant number of text formats that used NUL,
> perhaps we should.  We can wait and see, I think.
> 
> >>Conversely if you (perhaps
> >>incorrectly) marked it as binary you might still want to display the
> >>diff.
> 
> I think if the user makes the effort to mark it as binary, it probably
> is, even if it contains no NULs.
> 
> >>A single "binary" bit is probably not quite enough: you might have files
> >>(e.g.  vimrc) containing wierd characters that are mergeable as text, or
> >>plain text files that should never be automatically merged.
> 
> I think we're in agreement here.

Yes, just as with binary, if user /explicitly/ marks something as text,
it is probably text even if it does contain NULs.

> >>It seems to me the first thing is to make the internal operations have
> >>options to treat them as either binaries or text, and to connect those
> >>to either heuristics or user preferences expressed at the time.  (For
> >>example 'bzr diff --text' to disable detection of binaries.)  So this is
> >>a good step.
> 
> Sounds reasonable.
> 
> Aaron
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.1 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
> 
> iD8DBQFERVhd0F+nu1YWqI0RAtb0AJ43J+2mtROLDkVFzLKpURuC1RT/dgCfcI/G
> satEODwJdQxUimu/XkL5CMo=
> =Z5li
> -----END PGP SIGNATURE-----
> 
-- 
						 Jan 'Bulb' Hudec <bulb at ucw.cz>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060419/f506ba7e/attachment.pgp 


More information about the bazaar mailing list