BTree + CHK Inefficiencies

Tue Aug 3 17:21:16 BST 2010

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I just wanted to summarize what I found while debugging the memory
consumption for the gcc-linaro checkout.

The primary memory consumption was in the chk BTreeGraphIndex, with
something like 160MB allocated to this. The reasons are

1) chk pages are hash keyed, so they keys distribute roughly evenly
across the entire address space.

2) There are 700k total entries in the graph. This isn't too surprising,
we expect roughly 1 entry per change over time.

3) In a single inventory, there are ~50k chk pages, for a tree with 70k
inventory entries. This is because it goes past the threshold, and we
now have a 3-deep chk tree. (1 root 255-fan-out, 255-intermediate
nodes-65536, N leaves.) Each leaf only has 1-2 entries in it.

4) The BTreeGraphIndex is able to put (on average) 100 keys per 4k page.
Giving us 700k/100 = 7k index pages.

5) because of (1) reading 2 keys is likely to read in 2 pages, which is
200 keys. (we only care about 2 out of that 200). I don't know the
specific probability rule (probably binomial), but basically the chance
any given key is already cached is (num_already_read/total_pages). I
think this means that at 7k keys, we should have read half the index. At
50k keys we had read 99.9% of the index.

6) This gives us pretty poor ratios for total keys read/cached vs number
actually needed.

Some proposals to fix it:

a) My alternative chk_index outlined in
doc/developers/improved_chk_index.txt.

This gets rid of the explicitly paged and compressed content, switching
to binary data that is already well compacted. It also could use some
collision avoidance properties to shrink the index even further. Without
that, I would estimate the size-on-disk is actually going to be about
the same, but if we can improve the number of bytes processed, that
would still be a win.

b) Extend my 'split-pack' work. I have some code that changes how
autopack and 'bzr pack' works, so that it clusters recent content
together, and then old content second. An extension to that is to
actually have that content in 2 separate pack files. This would mean
that when accessing recent content, it is mostly those 50k keys that you
actually care about, and not the 650k keys you don't.
This could actually benefit us in a lot of ways (you would rarely have
to page in any of the old pack's contents.)
In real-world repositories, it will actually tend towards this, since
new content continually gets added, and old content will sit in the old
pack file. Except:
  i) Initial creation grabs everything at once
  ii) Some files won't be touched often, so they will stay in the 'old'
      set. While new files may be touch many times (say 10-50) and
      clutter the 'new' packs with what is really now old data.

c) If we just want to decrease memory, we could have a parameter on
BTreeGraphIndex as to what type of _LeafNode class it could use. Right
now we are averaging 234 bytes /  entry in all of the in-memory structures.
  i) A custom Pyrex class that uses C structs and minimal content. I
     believe I worked it out to be
	20 + 8 + 4 + 4 + 4 = 40 bytes / record
  ii) Just hold the btree content as a string, and pull out the record
      you care about. Our string representation is not fixed width, but
      averages around 74-76 bytes / record.
      You could probably add a mini-index with 1 short integer (pages
      are always going to be <64k long) defining the start of each
      record, which only adds 2 bytes / record.

  You'll also have some Python object overhead, but if you can have
  1/page rather than the current 4/entry we would be doing a lot better.
  (That is a 400:1 ratio.)

  These *might* help decrease the amount of parsing we have to do,
  though (i) is repacking the content. So it is evaluating the
  hex bytes back into binary, and evaluating the integer characters back
  into a c 'int'.

  iii) Only cache the zlib compressed content. This is closer to 35
       bytes/record. Which is pretty optimal, but means any key that we
       actually want to read we still have to zlib.decompress().
       (TIMEIT says it takes 88us to decompress a 4k leaf node, if we
       did this for every key, that would be pretty bad. Maybe we could
       do "if I decompress 2x, then cache"...)

There are probably some other options as well, but these are the first
to come to mind.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkxYQfwACgkQJdeBCYSNAAOh1wCfQLjieK6rumKnQ6ZnNFIlXX93
6KUAnRqot7jyVNnM6Jen/2WkcXuedk2l
=h0tg
-----END PGP SIGNATURE-----