zsync and --rsyncable - Was: Compressing data with bzip2 ...

Fri Jan 20 03:15:51 GMT 2006

Ohhhhhhhh!  Boy, it took me a while but I finally get it.  The algorithm 
is a bit complex, but very interesting.  The problem I couldn't wrap my 
head around was mapping uncompressed blocks to compressed byte ranges, 
and how to utilize that information.  Let me see if I get this right now:

1) Use the checksums to find which uncompressed blocks need updated
2) Use the map to find the compressed byte range that corresponds to 
that block, and fetch it
3) Patch that data into your compressed copy
4) Decompress up to the end of that patch to get the updated 
uncompressed block, and patch that into the uncompressed file
5) Recompress the uncompressed file up to the next uncompressed block 
that needs updated
6) Repeat

It's step 5 that I couldn't see.  That is what allows you to reconstruct 
large tracts of the compressed stream that have changed, but you don't 
need to download them because you can create it yourself, as long as you 
know the corresponding uncompressed blocks have not changed, and you 
have all of the previous uncompressed data. 

Wow, very cool algorithm.  Yes, it should be possible to do this with 
LZMA in theory, but I have a feeling it will be rather hard to code it 
in such a way as it isn't horribly slow.  This is because you do not 
want to have to actually recompress from the start of the file in step 
5, rather you want to be able to resume the compression algorithm from 
the byte just before the patched block.  That sounds like it might be 
rather hard to do. 

John C. McCabe-Dansted wrote:

>On Thursday 19 January 2006 16:07, Phillip Susi wrote:
>  
>
>>Aigars Mahinovs wrote:
>>    
>>
>>>That would require either a huge amount of disk space or additional
>>>software on the server + a significant load on the server. That is no
>>>better then the rsync situation today as this is simply not acceptable
>>>to most mirrors.
>>>      
>>>
>>An apache module could be written to decompress the deb on demand, cache
>>it, and serve the decompressed contents out of the cache.  That would
>>not require a whole lot of cpu ( assuming a good cache hit rate ) or
>>disk resources on the server, just some added software.  In the end it
>>would save disk space on the servers because it would allow the packages
>>to be compressed with 7zip instead of gzip, and would save bandwidth
>>because people would not be downloading nearly as much data.
>>
>>John C. McCabe-Dansted wrote:
>>    
>>
>>>This does *not* require that the uncompressed tar be stored on the server.
>>>In fact the zsync manual recommends that you do not decompress the tar,
>>>so that zsync can read the smaller compressed data.
>>>      
>>>
>>The zsync manual is foolish then.
>>    
>>
>
>Well statistics  
>    http://zsync.moria.org.uk/paper/ch03s05.html
>are equally foolish.
>
>  
>
>>It requires the tar to be 
>>uncompressed on the server
>>    
>>
>
>no
>
>  
>
>>because you want to zsync the uncompressed 
>>data, not the compressed data. 
>>    
>>
>
>Zsync reads compressed data from the server, but reads and writes uncompressed 
>data to/from a local file.
>
>  
>
>>You change one byte in the uncompressed 
>>data and it ripples changes through the compressed stream from there to
>>the end of the block, 
>>    
>>
>
>yes
>
>  
>
>>causing zsync to have to send a LOT more data. 
>>    
>>
>
>no, see http://zsync.moria.org.uk/paper/
>
>  
>
>>The gzip --rsyncable sets gzip to use small blocks to contain the
>>ripples,
>>    
>>
>
>This is necessary for rsync not zsync. My tests indicate that --rsyncable does 
>not assist zsync's performance (nor does the 4KB blocks that dpkg uses).
>
>  
>
>>but if you want good compression, you compress the entire file 
>>in one pass, and use a better algorithm like LZMA, but then zsync won't
>>work very well because of the massive change ripples.
>>    
>>
>
>It should be possible to get zsync-look-inside to work with LZMA, if it is a 
>stream compressor like gzip rather than a block compressor like bzip2 (as I 
>understand).
>
>  
>
>>>Zsync uses a clever trick which I think goes basically like this:
>>>
>>>$gzip_opts=detect which options were used to create data.tar.gz
>>>tail -f data.tar data.tar | gzip $gzip_opts > data.tar.gz &
>>>if next block b matches
>>>then
>>>	cat b >> data.tar
>>>else
>>>	read data from http://server/data.tar.gz until there is enough data in
>>>the local copy of data.tar.gz to reconstruct block b.
>>>fi
>>>      
>>>
>>That code contradicts your original statement by using the uncompressed
>>tars, which you said the server did not need to store.  Other than that,
>>I can't make much sense out of it.
>>    
>>
>
>data.tar and data.tar.gz are local files, http://.../data.tar.gz is a remote 
>file. Hence we only need to (generate) an uncompressed file on the local 
>machine. There is no uncompressed remote file.
>
>The dataflow is:
>      file://.../new/data.tar <- zsync ( 
>		file://.../old/data.tar,
>		http://.../new/data.tar.zsync;
>		http://.../new/data.tar.gz;
>	}
>
>If my explanation is unclear you should probably read the full paper (above).
>
>  
>