[RFC] compression-grouping records in pack files.

Thu Jun 21 18:00:46 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:
> Aaron Bentley wrote:

>> It occurs to me that, for bundles at least, there would be advantages to
>> compressing a series of records as a group.

> And then you could bz/gzip/etc compress that whole series. Which also
> has the advantage that as you start reading the compressed stream, you
> only have to decompress to the point that you've extracted what you care
> about (delta 3) you don't have to decompress delta 4 (though a "naive"
> implementation might, and probably wouldn't lose a whole lot unless the
> chain was really long (100s/1000s))

Right.

> There are some nice I/O patterns, etc to this layout. But I'm not sure
> how it fits in the "Bundle Format" that has been discussed. (Whether you
> would actually need to create a container in a container, or whether it
> would be explicitly addressed by the overall format).

I was suggesting addressing it in the overall format.  It would be
possible to do it by nesting containers, as you note, but seems awkward
to use.

> The other thing I've discovered, is that compression over reverse
> ancestry is *WAY* better than any of the other base selection methods
> (at least for our builtins.py file). Which doesn't play as nicely with
> having deltas based on the one previous in the file (since you will have
> some amount of branching).

I think we should do it based on whatever gives us the best compression.
 If that means skipping some data in the stream, it's not the worst thing.

> Some comparisons:
> 
>         lh_child_child   280,676 (1 full=40,956, 1288 delta=239,720)
>         lh_parent        384,440 (4 full=51,394, 1285 delta=333,046)
>         by_size          540,134 (1 full=40,956, 1288 delta=499,178)
>         linear           614,800 (1 full=37,897, 1288 delta=576,903)

This is builtins.py?

With individially gzipped records, I get 475, 051.

Re-bzip2ing the whole thing gives me 191, 166.  (Hah!  eat *that*, xdelta.)

That's with my mpknit/multiparent plugin (bzr mp-regen builtins.py
- --snapshot-interval 500000 --outfile builtin)

> Now, this has each delta individually compressed, rather than compressed
> along the whole range (because that is how xdelta3 works).

> builtins.py has 4 'root' nodes, and 1285 deltas (and in my repository
> actually has about 100 'tip' nodes).

I don't remember what mine has.  It might be nice to get some numbers on
identical repositories...

> Anyway, if you compress each node versus a child who claims it as a left
> hand parent (or any child if you can't find the first kind) the whole
> history takes up 280kB.

This was against lefthand parents.  I should try it against lefthand
children.

> Some other interesting facts...
> 4 compressed fulltexts for lh_parent fit in approximately the same size
> as 1 compresed fulltext for the other algorithms. This is because those
> nodes are when builtins.py was small (revno <100), rather than when
> builtins.py was big (revno >2000).

I think that's due to ghosting.  I wouldn't expect to see that behavior
on new branches.  And I bet you could compress 3 of them against the
fourth *really* efficiently.

> Now, if you don't record the ancestry information in the index, you
> still have to read all 100 records... But that is actually cheaper than
> *applying* all 100 records. Which depends significantly on your algorithm.

> We might consider a skip-delta base approach for bundle deltas, since it
> would give us better "worst case" values. But if we have to read the
> whole series anyway because of index considerations, it doesn't seem a
> net win.

For bundles, I'm still looking at it primarily as a format that you
install into your repository.  Which means that you have to read all of
it anyway.

> And one more thing I should say. xdelta3 is pretty darn good. It is
> fast, works on binary data, and is fast at compression. But Mercurials
> bdiff/mpatch algorithm smokes it at extraction. Especially since they
> already have the "composing patches" logic worked out.

Hmm.

> linear on 100 revisions (1 fulltext + 98 deltas):
> xdelta3		235ms		194:1
> bdiff		 79ms		123:1
> bdiff+composing	 26ms
> 
> 
> So again, it would make sense to use xdelta3 for binary files or bundles
> sent across the wire. But for a repository format where we want to
> optimize for extraction speed, bdiff does a lot better.

but, but, isn't xdelta C?

> Oh, and both of these smoke the hell out of patiencediff or difflib at
> creating deltas. To delta the last revision against the previous is:

The nice thing about multiparent deltas is you only have to do a diff
25% of the time.  As long as there's a knit delta, you can derive the
matching_blocks from it.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGeq6+0F+nu1YWqI0RAlMOAKCI15eEmcjYDEj1/Smkx18BUP0epACeP27K
Y0JNhjaRJpzUzKFAWfpQcGM=
=t27K
-----END PGP SIGNATURE-----