brisbane:CHKMap.iteritems() tweaks

Wed Mar 25 02:04:19 GMT 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> On Tue, 2009-03-24 at 17:37 -0500, John Arbash Meinel wrote:
>>
>> git doesn't have to zlib.decompress() all of the texts that aren't
>> referenced in its delta chain. Having a text in the middle of a group
>> causes us to have to at least decompress all the previous bytes,
>> whether
>> they are specifically referenced or not.
> 
> Thats equivalent to having the prior texts in the delta chain; we
> decompress the chain, but we don't process each text.
> 
>> When I was exploring breaking at file_id boundaries (always, or at
>> least
>> more often) it caused the total compressed size to go up by a sizeable
>> amount. (I assume from losing cross-file compression.) Though perhaps
>> it
>> was something like group overhead from all of the empty directory
>> texts
>> not being shared in a group, or something weird like that.
> 
> We may need to investigate further :).

So in a maximally packed repository, the time spent for "ls" is in
zlib.decompress(), not apply_delta. I would certainly be willing to add
some instrumentation for this. But with 'pack recent' we get 12 blocks
read, versus 300 blocks read. Both dealing with 720 chk pages. And if
you imagine decompression 10 4MiB buffers rather than decompressing 300
pages to the middle 2MiB, that is 40MiB decompressed rather than 600.

> 
>> So for "decompressing the content" speed, we are talking 2.5s => 1.1s.
>> This is compared with the 5s we spend in "get_build_details" pulling
>> information out of the .tix.
>>
>> We could shrink the standard group size, we could try shrinking the
>> multi-file-id group size a bit more (at 1MiB, the total compressed
>> size
>> was the same, at <512KiB the total size started increasing). I would
>> assume that tuning these is mostly data dependent.
> 
> So would I.
> 
>> It might be an answer for getting the size back, without paying the
>> lzma
>> overhead for all access.
> 
> True, OTOH 'log' and 'annotate' access all the old history, so we will
> still commonly pay for it.
> 
> -Rob

That is true, but as they are accessing the old content, they are *also*
accessing the new content. So we aren't paying to decompress to the 2MiB
point, just to access 1k of text. We are decompressing 4MiB to access
all of the text.

Put another way, when accessed in *bulk* groupcompress is quite fast.

1500 calls to 'apply_delta' takes only 21ms. Compared with the 109 calls
to zlib.decompressobj.decompress() costing 2.2s. (This is doing 'bzr co'
on an LP tree).

We can decompress the whole lp repository (repository-details) (which is
10GB uncompressed) in 5m50s. I guess that works out to 29MB/s, which
isn't amazing. However 'time zcat foo.tar.gz' shows that zlib on that
machine can take a 25MB compressed and expand it to 64MB in 1.0s. Which
is 64MB/s, or really only ~2x faster than extracting everything. So
while we could get closer, the zlib.decompress() time will still be our
upper bound.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAknJkSMACgkQJdeBCYSNAAMwrwCeP1EBa4WsZPvI2TxK0JCDm/As
qVoAn0DNsKPX7NwsJx90GGkVyUVe+aX2
=Cvga
-----END PGP SIGNATURE-----