bzip2 and gzip --rsyncable internals

Fri Jan 20 15:54:39 GMT 2006

Paul Sladen wrote:
> 
> I believe a better understanding of the algorthims can furthered by studying
> the source-code of the implentation.  The following pointers are good
> starting points for working out what the bzip2 code is up to:
> 
>   $ grep \'9\' bzip2-1.0.3/bzip2.c
>   case '9': blockSize100k    = 9; break;
> 
>   $ grep -m1 nblockMAX bzip2-1.0.3/bzlib.c
>   s->nblockMAX = 100000 * blockSize100k - 19;
> 
> This explains why if you execute 'bzip2' with arguments '-vvv' you'll see
> the number 899981 come up frequently.
> 

gzip does not use blocks of any kind, which is what I was mainly 
refering to.  As for bzip2, I believe that the BWT algorithm is block 
oriented, so the blocking is for that, then all of those blocks are 
encoded as one big byte stream with huffman coding aren't they?

> The question of what black-magic is in use originally came up when looking
> at how to make the LiveCDs efficiently rsyncable:
> 
>   $ grep -m1 -A1 RSYNC_SUM_MATCH gzip-1.3.5/deflate.c
>   #define RSYNC_SUM_MATCH(sum) ((sum) % RSYNC_WIN == 0)
>   /* Whether window sum matches magic value */
> 

This makes no sense.  You tell rsync/zsync what block size to use don't 
you?  Then it computes sums on fixed length blocks of that size.  gzip 
has no way of knowing what block size rsync will later use, and what 
does the sum of 0 matter? rsync just makes sure the blocks have the same 
sum, not a sum of zero doesn't it?

The only thing gzip needs to do is contain the fallout of
from small changes by restarting the encoder every now and then.  Why 
should it know or care about the running block checksums rsync uses?