Compressing packages with bzip2 instead gzip?

Phillip Susi psusi at cfl.rr.com
Tue Jan 17 23:40:21 GMT 2006


Paul Sladen wrote:

>I don't think you can generalise like.  One of the (many) algorithms that
>7zip selects between is bzip2, the other being bascially zlib with a huge
>(out of spec) window size.
>
>  
>

I was refering to 7zip's native compression algorithm of LZMA.  With the 
default maximum compression setting which uses a 32 MB dictionary, 
decompressing was faster than bzip2 using a 900 KB dictionary, and the 
data sizes were 20-30% smaller. 

>The reason that 7zip does better on jump-infested i386 code is that it runs
>a pre-processor over the binary to convert relative jumps into absolute
>values (and the absolute values are more likely to repeat).
>
>  
>

I don't believe this to be the case.  Firstly, iirc, local jumps on i386 
are limited to a signed 8 bit displacement, so are only suitable for 
local branches within a function, and thus, aren't very likely to 
repeat.  Secondly, I've done plenty of compression comparisons on non 
i386 executable data and LZMA still delivers much better compression and 
faster decompression than bzip2. 

>I think there is a case for adding support for a post-processing stage to
>dpkg/cloop (like PNG does with predictors).  Secondly, if bzip2 speed is a
>problem, I suspect there is room for optimisation (refactoring, _not_
>turning on --omg-faster) and using bzip2 compression with a smaller
>windowsize (<900kB) if the difference is less than 1% in size.
>
>  
>

This whole discussion involves using larger dictionary sizes to get 
better compression.  Using a smaller dictionary size doesn't have much 
effect on decompression speed, but does impact compression size.  If you 
think you can optimize the bzip2 code to run faster though, by all 
means, go for it, but I think the slowness is simply inherant in the BWT 
algorithm it employs. 

>(For perspective, one seek() on the CD image is going to cost about the same
>as the difference in execution speed of zlib/bzip2).
>
>Delta-debs on Mirrors.
>
>For mirror distribution, my conclusion from 12months ago was that zsync
>(pre-computed, client-size executed rsync) against the data.tar regenerated
>by dpkg-repack is the way to go.
>  
>

That is good if the client still has the original deb and just wants to 
update it to the new version.  Most of the time they won't have the 
original deb, so it would be better to send them binary diffs against 
the files inside the package ( which they have installed ), rather than 
a diff on the package itself.  This would require the authentication 
being changed from checking the md5 sum on the package to checking the 
md5 sum on the individual files inside the package, which imho, is how 
it should have been done in the first place. 

>Since this would go into APT, dpkg stays clean and 
>
>I like the idea in somebody else's email of only providing zsyncs against
>files published on CD or the last 10days worth.
>
>I think a *top* priority should be delta-diffs on the Package lists, but the
>dapper-does-it-48-times-a-day churn on these is massive.
>
>	-Paul
>  
>




More information about the ubuntu-devel mailing list