Binary file support

John Arbash Meinel john at arbash-meinel.com
Thu Oct 13 15:18:17 BST 2005


Aaron Bentley wrote:
> Martin Pool wrote:
> 
>>>On 13/10/05, John Arbash Meinel <john at arbash-meinel.com> wrote:
>>>
>>>
>>>>I know Aaron mentioned a patch in the past, to add a binary flag to
>>>>files, so that we can more properly handle diff and merge.
>>>
>>>
>>>I'd rather have bzr just notice that the file is binary and therefore
>>>shouldn't be run through a text diff or merge.
> 
> 
> Well, it depends on how and where you intend to detect binaries.
> 
> diff's heuristic for 'binary' is reported to be 'contains NUL in the
> first 1k'.  For text diffing, another useful test is 'contains VT-102
> control characters'.

This also fails for UTF-16 files, which could be good candidates for
diff & patch. I believe a lot of Java files are UTF-16. I don't know the
specifics, other than because the character size is 16-bits if you are
writing in a western language every other byte is NUL.

I'm guessing difflib wouldn't have any problems, it is just an issue of
how to detect the "newline" character.

> 
> The problems with diffing binaries are
> 1. they don't (usually) tell humans anything useful.
> 2. they mess up terminals.
> 
> I would prefer to not run diff on binary files, rather than running it,
> then having it fail halfway through.  It's especially ugly if you're
> iterating through the diff, because you may have printed some of it
> before you realize you're diffing a binary.
> 
> Similarly, I would rather not attempt a text merge on binaries, instead
> of trying and failing.  Having it as a property would make it cheap to
> do detection in advance.  It would also allow the user to force a text
> to be treated as text or binary when the heuristic was wrong.
> 
> I suppose another option would be to have a 'binary-test-cache', indexed
> by sha-1 sum, but it just seemed simpler to ship the results of the test
> around in the inventory.
> 
> 
>>>I think a fast weave-like format needs to allow storing full copies
>>>from time to time (like arch cacherevs), so that you don't need to
>>>traverse all of history.
> 
> 
> Also, it allows you to truncate history.

So this is talking about the idea of a revised weave format, right? The
question is what would need to be put in a cached revision. Because
wouldn't you still want annotations for every line that is present? Now,
you might get some compaction because not all ancestors contribute to
the current text.

Unless you really just want to truncate the ancestry. And pretend like
the current text is the baseline.

> 
> 
>>> For binary files (or some binary files) we
>>>could just store a full copy every time, so avoiding calculating
>>>useless diffs but still using just a single format.
> 
> 
> While some binaries will change dramatically with every revision (e.g.
> compression formats), others like executibles or tarfiles will be
> largely the same.  So I'd be inclined to allow all binaries to be weaved.
> 
> Aaron

I think for a lot of binaries you could get some compression out of a
weave, though I don't know if it is worth it or not.

John
=:->
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 256 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20051013/04fa2e2e/attachment.pgp 


More information about the bazaar mailing list