brisbane:CHKMap.iteritems() tweaks

John Arbash Meinel john at arbash-meinel.com
Tue Mar 24 22:37:07 GMT 2009


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


...
>> So we save ~2s during extracting the texts time. (This is with my fix to
>> TT.create_file(string) to use f.write() rather than f.writelines())
>>
>> I'm not as convinced that this is worthwhile yet. Considering that we
>> spend 4.7s in 'get_build_details', making get_bytes_as() 2.5=>1.0s
>> doesn't seem really worth the 10% increase in repository size.
>>
>> I guess I can say "maybe", but it isn't as clear-cut as the benefit to
>> changing the chk pages.
> 
> I think we should address the penalty of having many texts in the group
> rather than splitting the groups up. git has a single pack when fully
> packed with the new stuff at the front, and we should be getting similar
> locality of reference.
> 
> I don't think 10% repository size is worth it.
> 
> -Rob

git doesn't have to zlib.decompress() all of the texts that aren't
referenced in its delta chain. Having a text in the middle of a group
causes us to have to at least decompress all the previous bytes, whether
they are specifically referenced or not.

When I was exploring breaking at file_id boundaries (always, or at least
more often) it caused the total compressed size to go up by a sizeable
amount. (I assume from losing cross-file compression.) Though perhaps it
was something like group overhead from all of the empty directory texts
not being shared in a group, or something weird like that.

So for "decompressing the content" speed, we are talking 2.5s => 1.1s.
This is compared with the 5s we spend in "get_build_details" pulling
information out of the .tix.

We could shrink the standard group size, we could try shrinking the
multi-file-id group size a bit more (at 1MiB, the total compressed size
was the same, at <512KiB the total size started increasing). I would
assume that tuning these is mostly data dependent.

groupcompress gives us lots of room to change how we want to lay things
out, without breaking the disk format.

As an interesting side note:

100M    launchpad-chk255big-lzma
125M	launchpad-chk255big-zlib
140M	launchpad-chk255big-zlib-with-100-new
122M	launchpad-chk255big-zlib-100-new-lzma-old

That would probably go down another 1.5MB if I used lzma on old revision
texts or signatures.

It might be an answer for getting the size back, without paying the lzma
overhead for all access.

time bzr ls -r -1:
37.6s	-lzma
4.18s	-zlib
1.88s	-zlib-100-new-lzma-old
1.88s	-zlib-with-100-new

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAknJYJMACgkQJdeBCYSNAAO0wACguqz1Sv3khRw5zo6uJEmZChLs
ergAoMqPnpv6MIVf/SD9HNW9y76Nf85C
=vTqq
-----END PGP SIGNATURE-----



More information about the bazaar mailing list