zsync and --rsyncable - Was: Compressing data with bzip2 ...

Thu Jan 19 05:58:44 GMT 2006

On Thursday 19 January 2006 16:07, Phillip Susi wrote:
> Aigars Mahinovs wrote:
> >That would require either a huge amount of disk space or additional
> >software on the server + a significant load on the server. That is no
> >better then the rsync situation today as this is simply not acceptable
> >to most mirrors.
>
> An apache module could be written to decompress the deb on demand, cache
> it, and serve the decompressed contents out of the cache.  That would
> not require a whole lot of cpu ( assuming a good cache hit rate ) or
> disk resources on the server, just some added software.  In the end it
> would save disk space on the servers because it would allow the packages
> to be compressed with 7zip instead of gzip, and would save bandwidth
> because people would not be downloading nearly as much data.
>
> John C. McCabe-Dansted wrote:
> >This does *not* require that the uncompressed tar be stored on the server.
> > In fact the zsync manual recommends that you do not decompress the tar,
> > so that zsync can read the smaller compressed data.
>
> The zsync manual is foolish then.

Well statistics  
    http://zsync.moria.org.uk/paper/ch03s05.html
are equally foolish.

> It requires the tar to be 
> uncompressed on the server

no

> because you want to zsync the uncompressed 
> data, not the compressed data. 

Zsync reads compressed data from the server, but reads and writes uncompressed 
data to/from a local file.

> You change one byte in the uncompressed 
> data and it ripples changes through the compressed stream from there to
> the end of the block, 

yes

> causing zsync to have to send a LOT more data. 

no, see http://zsync.moria.org.uk/paper/

> The gzip --rsyncable sets gzip to use small blocks to contain the
> ripples,

This is necessary for rsync not zsync. My tests indicate that --rsyncable does 
not assist zsync's performance (nor does the 4KB blocks that dpkg uses).

> but if you want good compression, you compress the entire file 
> in one pass, and use a better algorithm like LZMA, but then zsync won't
> work very well because of the massive change ripples.

It should be possible to get zsync-look-inside to work with LZMA, if it is a 
stream compressor like gzip rather than a block compressor like bzip2 (as I 
understand).

> >Zsync uses a clever trick which I think goes basically like this:
> >
> >$gzip_opts=detect which options were used to create data.tar.gz
> >tail -f data.tar data.tar | gzip $gzip_opts > data.tar.gz &
> >if next block b matches
> >then
> >	cat b >> data.tar
> >else
> >	read data from http://server/data.tar.gz until there is enough data in
> > the local copy of data.tar.gz to reconstruct block b.
> >fi
>
> That code contradicts your original statement by using the uncompressed
> tars, which you said the server did not need to store.  Other than that,
> I can't make much sense out of it.

data.tar and data.tar.gz are local files, http://.../data.tar.gz is a remote 
file. Hence we only need to (generate) an uncompressed file on the local 
machine. There is no uncompressed remote file.

The dataflow is:
      file://.../new/data.tar <- zsync ( 
		file://.../old/data.tar,
		http://.../new/data.tar.zsync;
		http://.../new/data.tar.gz;
	}

If my explanation is unclear you should probably read the full paper (above).

-- 
John C. McCabe-Dansted
Masters Student