[RFC] compression-grouping records in pack files.
Aaron Bentley
aaron.bentley at utoronto.ca
Fri Jun 22 02:09:32 BST 2007
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
John Arbash Meinel wrote:
>>If you can
>> make it smaller by compressing a series, I'd very much like to see
>> *that* number.
> If you don't compress the hunks, and then compress them all in a single
> bzip stream you can get:
>
> lh_parent 251,010 479:1
> lh_child_child 199,552 602:1
> linear 413,927 290:1
> by_size 370,358 325:1
To even be in the same ballpark as a binary diff is pretty cool.
> Part of that, though is that xdelta3's output does not lend itself well
> to secondary compression. In fact, if I switch to bdiff, I get these
> numbers:
>
> lh_parent 151,674 792:1
> lh_child_child 164,117 732:1
> linear 266,584 451:1
> by_size 250,520 480:1
This is also a line-based diff? Wow.
> xdelta also has the simple advantage that it is much more efficient on
> files that don't have any '\n' in them. Both bdiff, mpdiff, difflib, etc
> fail on that account.
Yes. OTOH, mpdiff can be used for generating annotated versions. So
I've wondered whether we shouldn't use both.
> Although I'm guessing you are also re-using the deltas from
> the knits here, right? So while it *is* representative of how long it
> would take to convert a Knit-based repository into a bundle. It *isn't*
> representative of how long it would take to create/commit to a
> bundle-based repository.
Correct.
>> The plugin you're using is doing that, and in fact, so does the new
>> bundle format. But the new bundle format avoids *comparisons*.
>
> Because it has already done them?
Because we've already done the comparisons when we generated the knit
deltas.
> It doesn't quite follow that you can
> not do comparisons if you have already done comparisons. It is an
> important number for "given I already have the data I want, turn it into
> a bundle".
This is more, given the data we have today, generate a bundle reusing as
much info as you can.
> Which is very different from "generate a delta from these
> texts". They are both valid, but I'm evaluating a delta operation, not
> creating a bundle. So we have a bit of talking past each other.
I make no bones about focusing on bundle generation performance at the
moment.
>>>> For comparison, creating all 1300 deltas takes 2.1s with xdelta3. Yes,
>>>> that is *seconds* not minutes. Running 'mp-regen' on the same data takes
>>>> 23 *minutes*.
>> Yes, but that is essentially a worst-case scenario, where the poor thing
>> has no caching or snapshots and has to extract every parent text before
>> it can even start comparing...
>
> Well, are you extracting it from the target before you do the next one?
Yes, each time I generate an mpdelta, I extract the parents from the
mpknit, from scratch (i.e. no significant snapshots).
>> Well, whatever we switch to, that's probably a sensible choice to stick
>> in our bundles, also.
> There are tradeoffs which we could weigh. Like some algorithms produce
> significantly smaller at the expense of extraction speed.
>
> I would guess that ultimately simplicity outweighs the rest of the
> tradeoffs, though.
Well, the ability to avoid recalculating the diffs in order to produce
bundles is pretty valuable, I think.
Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFGeyFM0F+nu1YWqI0RAkU3AJ98s4q3+1vdYc1u2JK4ouZjztvQPQCdHSUT
m2/jGH8WMvLJoURVmR9fVkI=
=/rrV
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list