[RFC] compression-grouping records in pack files.

Fri Jun 22 01:02:44 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aaron Bentley wrote:
> John Arbash Meinel wrote:
>>> Quick note: could you use a '\n' somewhere in your index for mpidx?
> 
> It was really just something I threw together, that I don't plan to use
> in production.  Feel free to tweak it.
> 
>>> Specifically, I think it makes the most sense to have it implemented as
>>> a nested *something*. So that you might say:
>>>
>>>   The next blob is compressed blob containing these records.
>>>
>>> At the very least, I think you need a way to skip past a compressed
>>> portion without having to decompress it. Which means you need a wrapper
>>> which tells you how long it is.
> 
> Yes, I was thinking that the compression groups would be length-prefixed
> records.

Which brings up a question about whether we would also be interested in
having it mention what is contained in the compressed section. Though
certainly you could just decompress and figure that out, and write your
index based on that.

> 
>>>> This is builtins.py?
>>>> With individially gzipped records, I get 475, 051.
>>>> Re-bzip2ing the whole thing gives me 191, 166.  (Hah!  eat *that*, xdelta.)
>>>
>>> Well, this is also individually compressed (by xdelta3). You can
>>> actually tell it not to compress, and instead compress "out of band". I
>>> didn't try that across the whole range. I can say that gzip/bzip2 of
>>> xdelta hunks is worse than xdelta compression. (Probably the xdelta
>>> sorting is bad for gzip/bzip2). I would expect bzip2 over the range to
>>> be better.
>>>
>>> So the generally "comparable" number is:
>>> lh_parent        384,440 (4 full=51,394, 1285 delta=333,046)
>>> versus
>>> 475,051 for gzipped multi-parent diffs.
> 
> It might be comparable, but it's not a very useful number.  For
> multi-parent diffs, we would definitely want to compress them a lot.
> 
> I assumed that 384, 440 was the smallest xdelta would go.  If you can
> make it smaller by compressing a series, I'd very much like to see
> *that* number.

Well, the "optimum" for that series of texts with per-hunk compression was:

  lh_child_child   280,676 (1 full=40,956, 1288 delta=239,720)

If you don't compress the hunks, and then compress them all in a single
bzip stream you can get:

  lh_parent	   251,010  479:1
  lh_child_child   199,552  602:1
  linear           413,927  290:1
  by_size	   370,358  325:1

You can also enable "secondary compression" for xdelta. Which basically
enables Huffman encoding of the delta stream.

I believe test_mp.mpknit with the same data source is 391,567 bytes, and
the zcat test_mp.mpknit | bzip2 is 159,655 bytes.

Part of that, though is that xdelta3's output does not lend itself well
to secondary compression. In fact, if I switch to bdiff, I get these
numbers:

  lh_parent	   151,674	792:1
  lh_child_child   164,117	732:1
  linear	   266,584	451:1
  by_size	   250,520	480:1

So actually, bdiff + bzip2 of the whole stream does better than even
mpdiff. And oddly enough lh_child_child does worse than lh_parent.
(Apparently bzip2 doesn't like removing sections as much as it likes
adding sections). Or certainly it could have something to do with the
fact that lh_parent has very small full texts which bzip2 can compress
the hell out of (since they are virtually identical). Versus one big
full text (which doesn't have nearly as much in common with itself).

I should also note that in the simple cases I originally tested, bdiff's
output was very much on par with that of patiencediff and difflib. Both
being line based, and thus finding the exact same lines that differed.

xdelta3 doesn't get the same win from bzip2 because it already does a
delta compression of the text against itself. using an inferior
algorithm that messes up bzip2's method.

Ultimately, I think they are all within a range to be considered. And
any real variation here needs to be weighed against a much larger corpus
of data.

xdelta also has the simple advantage that it is much more efficient on
files that don't have any '\n' in them. Both bdiff, mpdiff, difflib, etc
fail on that account.

> 
>>> So it is very comparable to lh_parent with xdelta. (Except for the
>>> speed, which is off by a factor of around 200+:1)
> 
> That speed isn't reflective of mpdiff's potential-- I was deliberately
> turning off snapshotting there, and not caching and stuff.
> 
> I would say my "4 minutes to convert bzr.dev to a bundle" is more
> representative.

fairy-nuff. Although I'm guessing you are also re-using the deltas from
the knits here, right? So while it *is* representative of how long it
would take to convert a Knit-based repository into a bundle. It *isn't*
representative of how long it would take to create/commit to a
bundle-based repository.

> 
>>> I think you would find a benefit (the whole children add stuff so deltas
>>> are just "remove lines 10-20"). However, I think it would mean you have
>>> considerably more work to do (since you probably have to extract the
>>> actual texts). But maybe you already do that for MP.
> 
> The plugin you're using is doing that, and in fact, so does the new
> bundle format.  But the new bundle format avoids *comparisons*.

Because it has already done them? It doesn't quite follow that you can
not do comparisons if you have already done comparisons. It is an
important number for "given I already have the data I want, turn it into
a bundle". Which is very different from "generate a delta from these
texts". They are both valid, but I'm evaluating a delta operation, not
creating a bundle. So we have a bit of talking past each other.

> 
>>> For comparison, creating all 1300 deltas takes 2.1s with xdelta3. Yes,
>>> that is *seconds* not minutes. Running 'mp-regen' on the same data takes
>>> 23 *minutes*.
> 
> Yes, but that is essentially a worst-case scenario, where the poor thing
> has no caching or snapshots and has to extract every parent text before
> it can even start comparing...

Well, are you extracting it from the target before you do the next one?
Because I will admit that I'm extracting all 1300 texts into memory at
the start, and then delta-ing against them. It is only about 130MB of
RAM, so isn't terribly expensive. And I specifically wanted to check the
speed of doing deltas. I can say that extracting one far from the
beginning starts to take 100=>1000ms you get into 1200*0.1=120s =
2minutes up to 1200s = 20minutes.

Extracting everything from a knit isn't terrible, though (neither is it
super fast :). It takes about 9s to extract all 1300 revisions. Now this
is biased by the fact that the Knit.get_texts() code is biased towards
extracting more than one at a time, so it has a relatively efficient
extraction of all texts. (my test_clean.py also tries to use the maximum
distance between snapshots that bzrlib will let it).

But according to my numbers xdelta to decompress all 1300 revisions
takes 650ms, and bdiff is taking 325ms. Both of which are a wee bit
smaller than 9s.

I was able to eek out another <10% from xdelta by using inline and
unlikely macros. But I'm writing up another email as to why the data
format is not likely to get much faster. (Basically, there is a few
layers of indirection defined by the data format for both compression
size and for allowing different encoders to encode differently). So we
should keep in mind that it is only "slow" because it isn't faster than
one possible alternative at extraction speed.

> 
>>> Well, that presumes we've done everything as Knits to start with. And
>>> since we are considering switching wholesale to something else, both
>>> need to be evaluated.
> 
> Well, whatever we switch to, that's probably a sensible choice to stick
> in our bundles, also.
> 
> Aaron

There are tradeoffs which we could weigh. Like some algorithms produce
significantly smaller at the expense of extraction speed.

I would guess that ultimately simplicity outweighs the rest of the
tradeoffs, though.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGexGkJdeBCYSNAAMRAgAJAJ9wm4Ju2UZEs+1KuW1OuOIlCWGInQCcDw26
1rLJhAyDZZoslDxw1yKbdmc=
=mM0L
-----END PGP SIGNATURE-----