[RFC] compression-grouping records in pack files.

Fri Jun 22 00:08:29 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:
> Quick note: could you use a '\n' somewhere in your index for mpidx?

It was really just something I threw together, that I don't plan to use
in production.  Feel free to tweak it.

> Specifically, I think it makes the most sense to have it implemented as
> a nested *something*. So that you might say:
> 
>   The next blob is compressed blob containing these records.
> 
> At the very least, I think you need a way to skip past a compressed
> portion without having to decompress it. Which means you need a wrapper
> which tells you how long it is.

Yes, I was thinking that the compression groups would be length-prefixed
records.

>> This is builtins.py?
> 
>> With individially gzipped records, I get 475, 051.
> 
>> Re-bzip2ing the whole thing gives me 191, 166.  (Hah!  eat *that*, xdelta.)
> 
> 
> Well, this is also individually compressed (by xdelta3). You can
> actually tell it not to compress, and instead compress "out of band". I
> didn't try that across the whole range. I can say that gzip/bzip2 of
> xdelta hunks is worse than xdelta compression. (Probably the xdelta
> sorting is bad for gzip/bzip2). I would expect bzip2 over the range to
> be better.
> 
> So the generally "comparable" number is:
> lh_parent        384,440 (4 full=51,394, 1285 delta=333,046)
> versus
> 475,051 for gzipped multi-parent diffs.

It might be comparable, but it's not a very useful number.  For
multi-parent diffs, we would definitely want to compress them a lot.

I assumed that 384, 440 was the smallest xdelta would go.  If you can
make it smaller by compressing a series, I'd very much like to see
*that* number.

> So it is very comparable to lh_parent with xdelta. (Except for the
> speed, which is off by a factor of around 200+:1)

That speed isn't reflective of mpdiff's potential-- I was deliberately
turning off snapshotting there, and not caching and stuff.

I would say my "4 minutes to convert bzr.dev to a bundle" is more
representative.

> I think you would find a benefit (the whole children add stuff so deltas
> are just "remove lines 10-20"). However, I think it would mean you have
> considerably more work to do (since you probably have to extract the
> actual texts). But maybe you already do that for MP.

The plugin you're using is doing that, and in fact, so does the new
bundle format.  But the new bundle format avoids *comparisons*.

> For comparison, creating all 1300 deltas takes 2.1s with xdelta3. Yes,
> that is *seconds* not minutes. Running 'mp-regen' on the same data takes
> 23 *minutes*.

Yes, but that is essentially a worst-case scenario, where the poor thing
has no caching or snapshots and has to extract every parent text before
it can even start comparing...

> Well, that presumes we've done everything as Knits to start with. And
> since we are considering switching wholesale to something else, both
> need to be evaluated.

Well, whatever we switch to, that's probably a sensible choice to stick
in our bundles, also.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGewTt0F+nu1YWqI0RAlYLAJ9JMYdjTHUQavjcRm9hp2lUNWoqGwCfdRIk
fgrDpvsyM/i9DmCX+s4CHR0=
=HWD3
-----END PGP SIGNATURE-----