Compressing packages with bzip2 instead gzip?
psusi at cfl.rr.com
Tue Jan 17 23:40:21 GMT 2006
Paul Sladen wrote:
>I don't think you can generalise like. One of the (many) algorithms that
>7zip selects between is bzip2, the other being bascially zlib with a huge
>(out of spec) window size.
I was refering to 7zip's native compression algorithm of LZMA. With the
default maximum compression setting which uses a 32 MB dictionary,
decompressing was faster than bzip2 using a 900 KB dictionary, and the
data sizes were 20-30% smaller.
>The reason that 7zip does better on jump-infested i386 code is that it runs
>a pre-processor over the binary to convert relative jumps into absolute
>values (and the absolute values are more likely to repeat).
I don't believe this to be the case. Firstly, iirc, local jumps on i386
are limited to a signed 8 bit displacement, so are only suitable for
local branches within a function, and thus, aren't very likely to
repeat. Secondly, I've done plenty of compression comparisons on non
i386 executable data and LZMA still delivers much better compression and
faster decompression than bzip2.
>I think there is a case for adding support for a post-processing stage to
>dpkg/cloop (like PNG does with predictors). Secondly, if bzip2 speed is a
>problem, I suspect there is room for optimisation (refactoring, _not_
>turning on --omg-faster) and using bzip2 compression with a smaller
>windowsize (<900kB) if the difference is less than 1% in size.
This whole discussion involves using larger dictionary sizes to get
better compression. Using a smaller dictionary size doesn't have much
effect on decompression speed, but does impact compression size. If you
think you can optimize the bzip2 code to run faster though, by all
means, go for it, but I think the slowness is simply inherant in the BWT
algorithm it employs.
>(For perspective, one seek() on the CD image is going to cost about the same
>as the difference in execution speed of zlib/bzip2).
>Delta-debs on Mirrors.
>For mirror distribution, my conclusion from 12months ago was that zsync
>(pre-computed, client-size executed rsync) against the data.tar regenerated
>by dpkg-repack is the way to go.
That is good if the client still has the original deb and just wants to
update it to the new version. Most of the time they won't have the
original deb, so it would be better to send them binary diffs against
the files inside the package ( which they have installed ), rather than
a diff on the package itself. This would require the authentication
being changed from checking the md5 sum on the package to checking the
md5 sum on the individual files inside the package, which imho, is how
it should have been done in the first place.
>Since this would go into APT, dpkg stays clean and
>I like the idea in somebody else's email of only providing zsyncs against
>files published on CD or the last 10days worth.
>I think a *top* priority should be delta-diffs on the Package lists, but the
>dapper-does-it-48-times-a-day churn on these is massive.
More information about the ubuntu-devel