[RFC] compression-grouping records in pack files.

Fri Jun 22 02:09:32 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:
>>If you can
>> make it smaller by compressing a series, I'd very much like to see
>> *that* number.

> If you don't compress the hunks, and then compress them all in a single
> bzip stream you can get:
> 
>   lh_parent	   251,010  479:1
>   lh_child_child   199,552  602:1
>   linear           413,927  290:1
>   by_size	   370,358  325:1

To even be in the same ballpark as a binary diff is pretty cool.

> Part of that, though is that xdelta3's output does not lend itself well
> to secondary compression. In fact, if I switch to bdiff, I get these
> numbers:
> 
>   lh_parent	   151,674	792:1
>   lh_child_child   164,117	732:1
>   linear	   266,584	451:1
>   by_size	   250,520	480:1

This is also a line-based diff?  Wow.

> xdelta also has the simple advantage that it is much more efficient on
> files that don't have any '\n' in them. Both bdiff, mpdiff, difflib, etc
> fail on that account.

Yes.  OTOH, mpdiff can be used for generating annotated versions.  So
I've wondered whether we shouldn't use both.

> Although I'm guessing you are also re-using the deltas from
> the knits here, right? So while it *is* representative of how long it
> would take to convert a Knit-based repository into a bundle. It *isn't*
> representative of how long it would take to create/commit to a
> bundle-based repository.

Correct.

>> The plugin you're using is doing that, and in fact, so does the new
>> bundle format.  But the new bundle format avoids *comparisons*.
> 
> Because it has already done them?

Because we've already done the comparisons when we generated the knit
deltas.

> It doesn't quite follow that you can
> not do comparisons if you have already done comparisons. It is an
> important number for "given I already have the data I want, turn it into
> a bundle".

This is more, given the data we have today, generate a bundle reusing as
much info as you can.

> Which is very different from "generate a delta from these
> texts". They are both valid, but I'm evaluating a delta operation, not
> creating a bundle. So we have a bit of talking past each other.

I make no bones about focusing on bundle generation performance at the
moment.

>>>> For comparison, creating all 1300 deltas takes 2.1s with xdelta3. Yes,
>>>> that is *seconds* not minutes. Running 'mp-regen' on the same data takes
>>>> 23 *minutes*.
>> Yes, but that is essentially a worst-case scenario, where the poor thing
>> has no caching or snapshots and has to extract every parent text before
>> it can even start comparing...
> 
> Well, are you extracting it from the target before you do the next one?

Yes, each time I generate an mpdelta, I extract the parents from the
mpknit, from scratch (i.e. no significant snapshots).

>> Well, whatever we switch to, that's probably a sensible choice to stick
>> in our bundles, also.

> There are tradeoffs which we could weigh. Like some algorithms produce
> significantly smaller at the expense of extraction speed.
> 
> I would guess that ultimately simplicity outweighs the rest of the
> tradeoffs, though.

Well, the ability to avoid recalculating the diffs in order to produce
bundles is pretty valuable, I think.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGeyFM0F+nu1YWqI0RAkU3AJ98s4q3+1vdYc1u2JK4ouZjztvQPQCdHSUT
m2/jGH8WMvLJoURVmR9fVkI=
=/rrV
-----END PGP SIGNATURE-----