Best case extraction speed of Groupcompress

John Arbash Meinel john at arbash-meinel.com
Fri Mar 27 17:52:38 GMT 2009


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

So one of my personal deciding factors with why I wanted to use
groupcompress as the compression algorithm was because of the
theoretical decompression speed it can reach when extracting lots of
fulltexts from the same block.

I wanted to see what this really amounted to. So I wrote this little loop:
  TIMEIT -s "from bzrlib import branch" \
         -s "b = branch.Branch.open('a/branch')" \
         -s "b.lock_read(); t = b.repository.texts" \
         -s "keys = t.keys()" \
    "b = 0
     for record in t.get_record_stream(keys, 'unordered', True):
       b += len(record.get_bytes_as('fulltext'))
       record._manager = record._bytes = None # [1]
     print b"


This effectively isolates groupcompress, and asks it to decompress every
text in the repository, in whatever is the most optimal order. (And then
print out how many total bytes were decompressed.)

This still has a call into index code, for the 'get_record_stream', but
most of the index should already be cached in memory.

Anyway, running this on the bzr.dev conversion, I see:
  2,257,171,595 bytes (2.1GiB)
10 loops, best of 3: 5.19 sec per loop

So on my laptop we can extract every text for all of bzr.dev in about
5.2 seconds. Or at a rate of approximately 414 MiB/s.

For texts with a bit less history, I was seeing:
 903,447,097
10 loops, best of 3: 1.51 sec per loop
Which is 570.6 MiB/s.

So as long as we play nicely with groups, and lay them out in
appropriate ways, we can expect to see *very* nice 'give me the content'
speed.

John
=:->



[1] This is currently necessary because 'get_record_stream()' creates a
refcycle between the record and its manager. And TIMEIT disables 'gc',
so these refcycles don't get collected. Which causes it to hold all
objects in memory, and likes to hit swap, real hard and real fast. What
surprised me was that I hit 3 or 4 extractions before my mouse pointer
started to chug...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAknNEmYACgkQJdeBCYSNAAPq+gCfZuirh1uj7JA0xSD1gUZIiwfA
ozQAn1aEX7t06q0M+vGyVp5dFZgbN21i
=eDue
-----END PGP SIGNATURE-----



More information about the bazaar mailing list