[RFC] compression-grouping records in pack files.
Aaron Bentley
aaron.bentley at utoronto.ca
Thu Jun 21 18:00:46 BST 2007
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
John Arbash Meinel wrote:
> Aaron Bentley wrote:
>> It occurs to me that, for bundles at least, there would be advantages to
>> compressing a series of records as a group.
> And then you could bz/gzip/etc compress that whole series. Which also
> has the advantage that as you start reading the compressed stream, you
> only have to decompress to the point that you've extracted what you care
> about (delta 3) you don't have to decompress delta 4 (though a "naive"
> implementation might, and probably wouldn't lose a whole lot unless the
> chain was really long (100s/1000s))
Right.
> There are some nice I/O patterns, etc to this layout. But I'm not sure
> how it fits in the "Bundle Format" that has been discussed. (Whether you
> would actually need to create a container in a container, or whether it
> would be explicitly addressed by the overall format).
I was suggesting addressing it in the overall format. It would be
possible to do it by nesting containers, as you note, but seems awkward
to use.
> The other thing I've discovered, is that compression over reverse
> ancestry is *WAY* better than any of the other base selection methods
> (at least for our builtins.py file). Which doesn't play as nicely with
> having deltas based on the one previous in the file (since you will have
> some amount of branching).
I think we should do it based on whatever gives us the best compression.
If that means skipping some data in the stream, it's not the worst thing.
> Some comparisons:
>
> lh_child_child 280,676 (1 full=40,956, 1288 delta=239,720)
> lh_parent 384,440 (4 full=51,394, 1285 delta=333,046)
> by_size 540,134 (1 full=40,956, 1288 delta=499,178)
> linear 614,800 (1 full=37,897, 1288 delta=576,903)
This is builtins.py?
With individially gzipped records, I get 475, 051.
Re-bzip2ing the whole thing gives me 191, 166. (Hah! eat *that*, xdelta.)
That's with my mpknit/multiparent plugin (bzr mp-regen builtins.py
- --snapshot-interval 500000 --outfile builtin)
> Now, this has each delta individually compressed, rather than compressed
> along the whole range (because that is how xdelta3 works).
> builtins.py has 4 'root' nodes, and 1285 deltas (and in my repository
> actually has about 100 'tip' nodes).
I don't remember what mine has. It might be nice to get some numbers on
identical repositories...
> Anyway, if you compress each node versus a child who claims it as a left
> hand parent (or any child if you can't find the first kind) the whole
> history takes up 280kB.
This was against lefthand parents. I should try it against lefthand
children.
> Some other interesting facts...
> 4 compressed fulltexts for lh_parent fit in approximately the same size
> as 1 compresed fulltext for the other algorithms. This is because those
> nodes are when builtins.py was small (revno <100), rather than when
> builtins.py was big (revno >2000).
I think that's due to ghosting. I wouldn't expect to see that behavior
on new branches. And I bet you could compress 3 of them against the
fourth *really* efficiently.
> Now, if you don't record the ancestry information in the index, you
> still have to read all 100 records... But that is actually cheaper than
> *applying* all 100 records. Which depends significantly on your algorithm.
> We might consider a skip-delta base approach for bundle deltas, since it
> would give us better "worst case" values. But if we have to read the
> whole series anyway because of index considerations, it doesn't seem a
> net win.
For bundles, I'm still looking at it primarily as a format that you
install into your repository. Which means that you have to read all of
it anyway.
> And one more thing I should say. xdelta3 is pretty darn good. It is
> fast, works on binary data, and is fast at compression. But Mercurials
> bdiff/mpatch algorithm smokes it at extraction. Especially since they
> already have the "composing patches" logic worked out.
Hmm.
> linear on 100 revisions (1 fulltext + 98 deltas):
> xdelta3 235ms 194:1
> bdiff 79ms 123:1
> bdiff+composing 26ms
>
>
> So again, it would make sense to use xdelta3 for binary files or bundles
> sent across the wire. But for a repository format where we want to
> optimize for extraction speed, bdiff does a lot better.
but, but, isn't xdelta C?
> Oh, and both of these smoke the hell out of patiencediff or difflib at
> creating deltas. To delta the last revision against the previous is:
The nice thing about multiparent deltas is you only have to do a diff
25% of the time. As long as there's a knit delta, you can derive the
matching_blocks from it.
Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFGeq6+0F+nu1YWqI0RAlMOAKCI15eEmcjYDEj1/Smkx18BUP0epACeP27K
Y0JNhjaRJpzUzKFAWfpQcGM=
=t27K
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list