Some benchmark results from brisbane-core

Thu Mar 12 22:31:06 GMT 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Clatworthy wrote:
> Last night, I converted the first 10K revisions of Python 3.0 to
> various formats we're evaluating in the brisbane-core branch.
> I used the latest committed version of brisbane-core and groupcompress
> and built the C extensions for both. I also used the latest
> fastimport and usertest plugins. The only tweak was John's
> proposed change to xml8.py to copy inventory entries less often,
> i.e. labels are included in the groupcompress data and zlib is used,
> not lzma.
> 
> Here are the fast-import times:
> 
> Import times:
>   btree           24m 34s
>   gcchk16         5m 42s
>   gcchk255        5m 58s
>   gcchk255big     5m 51s
>   gcnrr           23m 41s
>   rich-root       28m 8s
> 
> Hooray - 4 times quicker!
> 
> Attached are the results of running the log benchmark. Here are
> the highlights of gcchk255big ("the one most likely") vs 1.9:
> 
> * Disk space (including working tree) reduced from 64.6MB to 53.1MB.
>   The .bzr directory itself shrank from 42.3MB to 25.5MB.
> 
> * log -v time reduced from 1133s to 44s.
> 
> In summary though, we still have some tuning to do as the btree
> format is still faster at many of the operations over small
> data sets.
> 
> I'm also not 100% confident about whether fast-import is doing
> the right thing in all cases. In particular, I'm surprised to
> see gcnrr taking the least amount of space. (Robert and John have
> been assisting me in making fast-import build the right per-file
> graphs in the last 2 days but I'm still validating the results.)
> 

I'm actually not surprised about gcnrr being the smallest, I would be
surprised if it wasn't. If you think about CHK as being a split-out
inventory, with the size of the root node and leaf nodes being variable,
gc-nrr can be simply thought of as "leaf node of infinite size".

And the discussion I gave for why gc255-big is going to be smaller than
gc16 (post compression) still applies, and explains why gcnrr is even
smaller. Namely:
  InternalNode has a cost of 1 sha1 (at least 20 bytes compressed) for
  every reference. With a 2-level tree (1 root, 255 leaves), every
  change has a new sha1 in the root, and new content in the leaf.

  With a 3-level tree (gc16), you have 2 sha1 references per change,
  plus the actual content change.

  With a 1-level tree (gc255-oh-my-god-thats-big, or gc-nrr), you only
  have the new content in the leaf.

For specifics, I think on average a leaf node change takes something
like 100 bytes. Adding a 20 byte sha1 pointer for every change is thus
20% overhead. So with a 3-deep gc16 tree, that is 40 bytes overhead,
with a 255-big that is only 20 bytes overhead, and with gc-nrr that is 0
overhead.

> Even so, it's very clear that import times, disk space and log -v
> will all be dramatically better under whatever new format we
> go with.
> 
> Ian C.
> 

For stuff like "bzr log -r -10..-1", setting _NO_LABELS actually has a
big effect. There is about 300ms spent to parse the label header in the
packed content. I don't have a great answer other than getting rid of
labels. (We could parse them more lazily, or only on transmission, but
by then we've lost the real benefits.)

Thanks for running these, though. It is nice to see where we need to be
focusing our effort.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkm5jSoACgkQJdeBCYSNAAMxqwCfZOykqB8nlq54bW3labfrLz7t
QB8An0CoPzv4WiOR1619xNHdiTVsjNqa
=vzDO
-----END PGP SIGNATURE-----