[RFC] compression-grouping records in pack files.

Thu Jun 21 22:45:08 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Quick note: could you use a '\n' somewhere in your index for mpidx?
I realize it isn't strictly necessary as you are using length prefixed
strings, but it would make reading the index in a text editor possible.
(then again you may just be using bencode, which is pretty awful for
text readers).

Aaron Bentley wrote:
> John Arbash Meinel wrote:
>>> Aaron Bentley wrote:
> 
>>>> It occurs to me that, for bundles at least, there would be advantages to
>>>> compressing a series of records as a group.
> 
>>> And then you could bz/gzip/etc compress that whole series. Which also
>>> has the advantage that as you start reading the compressed stream, you
>>> only have to decompress to the point that you've extracted what you care
>>> about (delta 3) you don't have to decompress delta 4 (though a "naive"
>>> implementation might, and probably wouldn't lose a whole lot unless the
>>> chain was really long (100s/1000s))
> 
> Right.
> 
>>> There are some nice I/O patterns, etc to this layout. But I'm not sure
>>> how it fits in the "Bundle Format" that has been discussed. (Whether you
>>> would actually need to create a container in a container, or whether it
>>> would be explicitly addressed by the overall format).
> 
> I was suggesting addressing it in the overall format.  It would be
> possible to do it by nesting containers, as you note, but seems awkward
> to use.

That was my discussion. I agree that the overall format would be nice,
but it does introduce a level of complexity.

Specifically, I think it makes the most sense to have it implemented as
a nested *something*. So that you might say:

  The next blob is compressed blob containing these records.

At the very least, I think you need a way to skip past a compressed
portion without having to decompress it. Which means you need a wrapper
which tells you how long it is.

> 
>>> The other thing I've discovered, is that compression over reverse
>>> ancestry is *WAY* better than any of the other base selection methods
>>> (at least for our builtins.py file). Which doesn't play as nicely with
>>> having deltas based on the one previous in the file (since you will have
>>> some amount of branching).
> 
> I think we should do it based on whatever gives us the best compression.
>  If that means skipping some data in the stream, it's not the worst thing.
> 
>>> Some comparisons:
>>>
>>>         lh_child_child   280,676 (1 full=40,956, 1288 delta=239,720)
>>>         lh_parent        384,440 (4 full=51,394, 1285 delta=333,046)
>>>         by_size          540,134 (1 full=40,956, 1288 delta=499,178)
>>>         linear           614,800 (1 full=37,897, 1288 delta=576,903)
> 
> This is builtins.py?
> 
> With individially gzipped records, I get 475, 051.
> 
> Re-bzip2ing the whole thing gives me 191, 166.  (Hah!  eat *that*, xdelta.)
> 

Well, this is also individually compressed (by xdelta3). You can
actually tell it not to compress, and instead compress "out of band". I
didn't try that across the whole range. I can say that gzip/bzip2 of
xdelta hunks is worse than xdelta compression. (Probably the xdelta
sorting is bad for gzip/bzip2). I would expect bzip2 over the range to
be better.

So the generally "comparable" number is:
lh_parent        384,440 (4 full=51,394, 1285 delta=333,046)
versus
475,051 for gzipped multi-parent diffs.

It certainly would be good to compare apples to apples. My test data is
part of the plugin (as test.knit, test_clean.knit, test_tiny...)

test_clean.knit is the one that gave the above numbers. ("clean" because
I removed the extra 100 tips which biased things). I updated your plugin
slightly to allow me to give an explicit Knit, and ran it and got:
391,567 for gzip chunks, and 159,655 for bzip2 over the whole thing.

So it is very comparable to lh_parent with xdelta. (Except for the
speed, which is off by a factor of around 200+:1)

> That's with my mpknit/multiparent plugin (bzr mp-regen builtins.py
> --snapshot-interval 500000 --outfile builtin)
> 
>>> Now, this has each delta individually compressed, rather than compressed
>>> along the whole range (because that is how xdelta3 works).
> 
> 
>>> builtins.py has 4 'root' nodes, and 1285 deltas (and in my repository
>>> actually has about 100 'tip' nodes).
> 
> I don't remember what mine has.  It might be nice to get some numbers on
> identical repositories...
> 
>>> Anyway, if you compress each node versus a child who claims it as a left
>>> hand parent (or any child if you can't find the first kind) the whole
>>> history takes up 280kB.
> 
> This was against lefthand parents.  I should try it against lefthand
> children.

I think you would find a benefit (the whole children add stuff so deltas
are just "remove lines 10-20"). However, I think it would mean you have
considerably more work to do (since you probably have to extract the
actual texts). But maybe you already do that for MP.

...

>>> And one more thing I should say. xdelta3 is pretty darn good. It is
>>> fast, works on binary data, and is fast at compression. But Mercurials
>>> bdiff/mpatch algorithm smokes it at extraction. Especially since they
>>> already have the "composing patches" logic worked out.
> 
> Hmm.
> 
>>> linear on 100 revisions (1 fulltext + 98 deltas):
>>> xdelta3		235ms		194:1
>>> bdiff		 79ms		123:1
>>> bdiff+composing	 26ms
>>>
>>>
>>> So again, it would make sense to use xdelta3 for binary files or bundles
>>> sent across the wire. But for a repository format where we want to
>>> optimize for extraction speed, bdiff does a lot better.
> 
> but, but, isn't xdelta C?

Yes, the patch numbers are slightly worse for patiencediff (5ms versus
2ms for xdelta, and "0ms" for bdiff for a single patch), but the 'diff'
numbers meant I never bothered to go that far. 100ms versus 3-10ms per
delta means generating a full set of deltas takes a long time. And I'm
running these in a "Benchmark" test suite, so I'm shooting for tests
that run in less than 10min each :).

For comparison, creating all 1300 deltas takes 2.1s with xdelta3. Yes,
that is *seconds* not minutes. Running 'mp-regen' on the same data takes
23 *minutes*.

Xdelta is faster than bdiff at creating deltas, and it can handle binary
files. It depends a lot on the flags you use. You can get it down to
1.9s for all 1288 revisions. While bdiff+zlib is around 3.7s for the
same data (at ~2x the final size).

It is just that the xdelta extraction speed is poor versus bdiff. To
decompress all 1289 texts takes 3.5s for all xdelta forms (with and
without compression enabled), and for bdiff takes 1.3s. If we could
"fix" that then we would have something that was universally better.

Notice that bdiff gets a 3x performance boost by compositing before
applying. So we would go from 235ms to <100ms for applying ~100 patches.
(In theory).

I can say that we probably aren't paying a lot of malloc() overhead,
because you supply xdelta with a buffer to work in, rather than having
it malloc as it goes. It might just be that the VC_DIFF format is
written such that it takes significantly more work to extract. (Consider
the fact that bzip2 extraction is vastly slower than gzip, and both are C).

It certainly might be possible to exploit VCDIFF (rfc3284) such that we
bias it for extraction speed.

Interestingly the description of how to do it goes something like:
      a. Concatenate source and target data.
      b. Parse the data from left to right as in LZ'77 but make sure
         that a parsed segment starts the target data.
      c. Start to output when reaching target data.

And then it seems they break the file up a bit more and do this on subsets.

Also interesting is this statement:
      Decoding efficiency:
         Except for secondary encoder issues, the decoding algorithm
         runs in time proportionate to the size of the target file and
         uses space proportionate to the maximal window size.  Vcdiff
         differs from more conventional compressors in that it uses only
         byte-aligned data, thus avoiding bit-level operations, which
         improves decoding speed at the slight cost of compression
         efficiency.

So they try to acknowledge that decoding efficiency is important (and
avoid bit-shifts to achieve this).

However, they also do stuff like use variable-length integers such that
the first bit defines whether there are more chunks. Which means rather
than just reading a 32-bit number and deciding whether you need to
change its endianness, you have to keep reading and checking the first
bit. I don't know how often these numbers are used, so I don't know the
performance impact.

I also know, though, that xdelta3 is biased towards streaming data
chunks at a time, rather than expecting that you have a giant memory
buffer to extract into. So it may be trying to minimize memory
consumption in exchange for performance.

In reading through the VCDIFF description, it doesn't seem that complex,
so it might be worth examining the xdelta3 code. Though profiling C code
is a bit more difficult than "bzr --lsprof foo" :)

> 
>>> Oh, and both of these smoke the hell out of patiencediff or difflib at
>>> creating deltas. To delta the last revision against the previous is:
> 
> The nice thing about multiparent deltas is you only have to do a diff
> 25% of the time.  As long as there's a knit delta, you can derive the
> matching_blocks from it.
> 
> Aaron

Well, that presumes we've done everything as Knits to start with. And
since we are considering switching wholesale to something else, both
need to be evaluated.

Certainly we could change the knit contents to xdelta chunks without
changing anything else.

So far, from reading the VCDIFF document, it would genuinely be possible
to extend the xdelta algorithm to have multiple sources. It would take a
bit to write it, and I would expect generating a delta to be slower. But
it should compress better.

Then again, lh_child_child generally performs better than lh_parent, and
I'm not sure how you would define a delta versus multiple children... I
don't know whether it would be better or not. (Multiple parents is
fairly obvious).

Anyway, I think it is worth exploring this area some more, and I'm
reading through the VCDIFF document. The compression basically boils
down to a bunch of ADD, RUN, and COPY statements (ADD 4 a b c d => abcd,
RUN 4 z=>zzzz => COPY 4 5 => (copy 4 bytes starting at byte 5, but this
can include target data)

So I think conceptually we could batch all of this information together,
and apply it. Though there is some difficulty in that it "cache
compresses" the addresses that are used. It has a couple lookup tables,
and as addresses are used it switches between the tables to find the
minimal sized output addresses. We could theoretically write an xdelta3
diff generator which only used the simpliest one (exact offsets), and
see what that does to compressed size and extraction speed. (Especially
if we hack the decompressor to ignore updating the cache tables, etc).

They would still be valid VCDIFF data, just hacked to avoid what may be
expensive "busybody" overhead. Obviously they felt it was necessary to
get good compression rates. But from what I'm reading it is because they
are thinking their deltas will be very small (2 bytes here, 2 bytes
there), and having 4-bytes written to explain 2 bytes of change is
expensive.

They also use the variable-sized integers, and I think it would be
interesting to try just fixing them at 32-bit (LE or BE fixed).

I would guess that the final output stream would be bigger, but I would
think not as big as bdiff, and we may get a lot of performance back.

I really wish it was easy to instrument the C code (and especially not
have it directly impact the performance)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGevFkJdeBCYSNAAMRAt5FAJ46QKn39CyC0fkH7X8xoDbUn/aMyACguvfU
nTPenlRbS98MoNBAS2kieXw=
=phcs
-----END PGP SIGNATURE-----