Binary file question

Thu Nov 2 15:56:37 GMT 2006

Nicholas Allen wrote:
> Hi,
> 
> We are also considering using bazaar for other areas than just source
> code management. I was wondering how efficiently bzr stores changes to
> large binary files. As we are a music software company with have many
> large files such as waveforms and so on. Bzr is especially interesting
> for us here as well because one of our sound preset developers works
> from home and checks his changes into subversion. Sometimes one set of
> changes may be hundreds of MBs in size and this takes a very long time
> to transmit over network. Also it may not be accepted and svn often
> crashes trying to accept these large changes.
> 
> I know on the FAQ it says bzr can be used in this way but it is not a
> priority. But it would be really interesting for us because he could
> commit locally and sync up when he comes to work every few days.
> 
> Does bzr store a binary diff of the changes or would the repository
> simply grow to enormous proportions if it were used this way?
> 
> Cheers,
> 
> Nick

Short answer, sure... we store binary deltas, but I wouldn't call them
*optimal* binary deltas.

It treats all files the same, ATM. We basically treat all files as
binary, with the only caveat that our diff algorithm breaks on LF.

If you are having problems with SVN, then I don't think binary diffs
would help you much anyway, considering SVN *has* binary diffs.

With something like waveforms, I don't know if you would have hunks in
common or not. Certainly common operations like "filter out noise" or
"increase/decrease volume" are going to touch almost every bit of the
file. Doing something like "remove this section" or "insert a section
here" are going to leave a lot of chunks alone.

But the truth is, (most?, many?) binary files don't binary diff that
well anyway. Frequently they are compressed, which means a modification
near the beginning tends to have a chain reaction over a large distance
(possibly the whole rest of the file).

So bzr will handle binary files. If they happen to contain character
0x0a characters it will break them up into chunks.

*I* would like to see us work more on a per-hunk rather than a per-line
mode, mostly for efficiency because frequently per-line is too fine a
storage. But for binary files per-line could be too coarse.

If we detect a binary file, we abort merging, mark it as a conflict, and
drop .THIS and .OTHER files in the working directory so the user can
pick which one they want to use.

So having the user mark files as 'binary' *might* open us up to using a
different diff algorithm, and fast-path the detection of not wanting to
merge differences. The problem is that because 'binary' is an optional
user setting, we have to do the detection anyway.

Especially because we don't want to go down the CVS route and default to
munging users source. (By default checking in a binary file in CVS
breaks it, not exactly the right thing to do).

So feel free to play with bzr on large binary files. We won't claim to
perform optimally. But will claim that we won't destroy your data.

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20061102/1d06bbdd/attachment.pgp