bzip2 and gzip --rsyncable internals
Phillip Susi
psusi at cfl.rr.com
Fri Jan 20 15:54:39 GMT 2006
Paul Sladen wrote:
>
> I believe a better understanding of the algorthims can furthered by studying
> the source-code of the implentation. The following pointers are good
> starting points for working out what the bzip2 code is up to:
>
> $ grep \'9\' bzip2-1.0.3/bzip2.c
> case '9': blockSize100k = 9; break;
>
> $ grep -m1 nblockMAX bzip2-1.0.3/bzlib.c
> s->nblockMAX = 100000 * blockSize100k - 19;
>
> This explains why if you execute 'bzip2' with arguments '-vvv' you'll see
> the number 899981 come up frequently.
>
gzip does not use blocks of any kind, which is what I was mainly
refering to. As for bzip2, I believe that the BWT algorithm is block
oriented, so the blocking is for that, then all of those blocks are
encoded as one big byte stream with huffman coding aren't they?
> The question of what black-magic is in use originally came up when looking
> at how to make the LiveCDs efficiently rsyncable:
>
> $ grep -m1 -A1 RSYNC_SUM_MATCH gzip-1.3.5/deflate.c
> #define RSYNC_SUM_MATCH(sum) ((sum) % RSYNC_WIN == 0)
> /* Whether window sum matches magic value */
>
This makes no sense. You tell rsync/zsync what block size to use don't
you? Then it computes sums on fixed length blocks of that size. gzip
has no way of knowing what block size rsync will later use, and what
does the sum of 0 matter? rsync just makes sure the blocks have the same
sum, not a sum of zero doesn't it?
The only thing gzip needs to do is contain the fallout of
from small changes by restarting the encoder every now and then. Why
should it know or care about the running block checksums rsync uses?
More information about the ubuntu-devel
mailing list