groupcompress extraction 10x faster

Thu Feb 19 19:10:41 GMT 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert observed a while back that gc extraction was faster than knits
*if* you didn't abuse the "_group_cache". I've been playing with some
conversions, and I noticed that getting the texts out was pretty slow.
When I looked closer, I found out why. Namely, we were using
"dict-sorted" order for texts during get_record_stream(..., 'unordered').

I went ahead and committed the attached patch, which makes a huge
difference on my mysql test repository. Specifically "time bzr
repository-details" dropped from 7m10s down to 35s (14x faster). (I was
able to get the same performance by changing the group_cache to 250MB,
but obviously that uses a lot more RAM during processing.)

Anyway, I thought Robert especially would like to know about this
change. I'm also probably going to play around with a "gc-optimal"
ordering, just to see what happens.

As it is now, because of the semi-random ordering, gc actually ends up a
net loss for my test of "mysql-5.1 -r525".

gc+chk255:
Commits: 1043
                      Raw    %    Compressed    %  Objects
Revisions:       3990 KiB   0%       826 KiB   2%     1043
Inventories:    31012 KiB   3%     15328 KiB  38%    12090
Texts:         882272 KiB  96%     23565 KiB  59%     7226
Signatures:         0 KiB   0%         0 KiB   0%        0
Total:         917275 KiB 100%     39720 KiB 100%    20359

chk255:
Commits: 1043
                      Raw    %    Compressed    %  Objects
Revisions:       3990 KiB   0%      1228 KiB   4%     1043
Inventories:    31012 KiB   3%     15987 KiB  56%    12090
Texts:         882272 KiB  96%     11174 KiB  39%     7226
Signatures:         0 KiB   0%         0 KiB   0%        0
Total:         917275 KiB 100%     28390 KiB 100%    20359

and versus the original knit repo:
Commits: 1043
                      Raw    %    Compressed    %  Objects
Revisions:       3949 KiB   0%      1201 KiB   8%     1043
Inventories:   842115 KiB  48%      1840 KiB  12%     1043
Texts:         882272 KiB  51%     11174 KiB  78%     7226
Signatures:         0 KiB   0%         0 KiB   0%        0
Total:        1728337 KiB 100%     14216 KiB 100%     9312

Something still isn't right with the gc+chk255 repository, as we
certainly should be getting *some* compression for inventories, better
than the chk255 repository. I mostly just wanted to point out that
without proper ordering the gc compressed texts actually double in size.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkmdrrEACgkQJdeBCYSNAANJkgCeKo3CmyWVtZXP5e0cxh1SACWH
aN8AoNaWFNVdMsifQTpWFMCRV7vkwYWl
=C3EG
-----END PGP SIGNATURE-----
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: gc_sorting.diff
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20090219/6526c9de/attachment.diff