Diff and merge of archives - proposal

Wed Oct 13 21:41:23 BST 2010

On 13/10/2010, jbowtie at amathaine.com <jbowtie at amathaine.com> wrote:
> On Thu, Oct 14, 2010 at 9:15 AM, Martin (gzlist) <gzlist at googlemail.com>
> wrote:
>>
>> My thought was why not store the uncompressed archive contents in the
>> repo, using pristine tar style hacks to reproduce the original archive
>> as needed? Versioning the content and a small hunk of archive metadata
>> would be better for bzr and is along the lines of Martin Pool's
>> thoughts on content filtering. Would mean existing binary blobs
>> wouldn't magically grow nice diff and merge behaviour, but saves repo
>> bloat from blob changes.
>>
>
> That *could* be viable, maybe. I think I'd need to work through some
> use cases to convince myself that it's worth the potentital tradeoffs.
> Imagine versioning a DVD .iso - would you really want to create that
> on the fly?  Personally I'd rather investigate fixing large file
> handling via librsync, then repo bloat is less of a problem.

It's certainly more work to unpack archives for versioning rather than
just for diff and merge. However, even rsync and fancier binary diff
algorithms can't prevent blob bloat. Decent compression produces
output indistinguishable* from random noise so every time the blob
changes it's the same as adding a whole new blob. Even steadfasts like
gzip need a particular flag passed on archive creation to produce
rsync-friendly output, and it's not widely used as it increases the
compressed size.

Martin

* In theory. Practice of course, varies.