[MERGE] tuned_gzip.bytes_to_gzip

Fri Sep 21 20:04:43 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> On Mon, 2007-09-17 at 17:34 +1000, Ian Clatworthy wrote:
> 

My biggest comment is that we should just have a smoke test. Just
compress a bytestring and then use GzipFile to decompress it and make
sure you get it back.

I'm a little concerned that you allow the user to specify a lot of
parameters, which *might* break the gzip stream. I *think* you are doing
it to avoid global variable name lookups. If so, I would rather see them
all be private variables. So users can change them if they really want
to, but it is clear that it isn't really intended to be used that way.

> 
>> Some other comments below.
>>
>>> +        bytes = (''.join(chain(
>>>              ["version %s %d %s\n" % (version_id,
>>>                                       len(lines),
>>>                                       digest)],
>>>              lines,
>>> -            ["end %s\n" % version_id]))
>>> -        data_file.close()
>>> -        length= sio.tell()
>>> -
>>> -        sio.seek(0)
>>> -        return length, sio
>>> +            ["end %s\n" % version_id])))
>> Is the enclosing () needed around bytes?
> 
> mm, strictly no, but it read more clearly to me.
> 
>>> +def bytes_to_gzip(bytes, factory=zlib.compressobj,
>>> +    level=zlib.Z_DEFAULT_COMPRESSION, method=zlib.DEFLATED,
>>> +    width=-zlib.MAX_WBITS, mem=zlib.DEF_MEM_LEVEL,
>>> +    crc32=zlib.crc32):
>> A comment or two separate to your patch ...
>>
>> 1. John might have some figures on various compression levels, their
>> performance and space tradeoffs. IIUIC, Z_DEFAULT_COMPRESSION is -1
>> which maps to 6 on a scale of 0-9, 9 being the slowest and most
>> compression. Less compression is necessarily faster (overall) of course.
> 
> Actually its a U curve. No compress = lots of disk. High compress = lots
> of CPU.
> 
> -1 is 6 *currently* buts its defined as being basically the sweet spot
> for most users most of the time, so using Z_DEFAULT_COMPRESSION is
> almost invariably a good idea.

I would also stick with Z_DEFAULT_COMPRESSION. People have spent a heck
of a lot more time than I to try and find a sweet spot over a large
variety of data.

The biggest reason this came up is because GzipFile defaults to 9 rather
than 6 for some really weird reason.

> 
>> 2. crc=zlib.alder32 might be slightly quicker as well. Changing this
>> looks more complex that just changing the compression level though
>> because I'm guessing that's a different file format (unless a flag is
>> used to indicate the type of crc?). Our decompress code certainly looks
>> hard-coded to crc32 (and fair enough given it's goal of being optimised).
> 
> The crc needs to stay the same, its wire level compatibility, and
> adler32 is not byte order safe IIRC (we had a problem with dirstate with
> those two in the past).
> 
> -Rob
> 

Agreed, the gzip stream is specifically crc32, so we need to keep using it.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFG9BXLJdeBCYSNAAMRAsPIAJ9nOt8NGo4BXtx6TndlI5RLjTOrbgCeIaC9
TOwafxi8aolgi5G2GTqjvQQ=
=K49B
-----END PGP SIGNATURE-----