BTree + CHK Inefficiencies

Wed Aug 4 19:55:57 BST 2010

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...

> c) If we just want to decrease memory, we could have a parameter on
> BTreeGraphIndex as to what type of _LeafNode class it could use. Right
> now we are averaging 234 bytes /  entry in all of the in-memory structures.
>   i) A custom Pyrex class that uses C structs and minimal content. I
>      believe I worked it out to be
> 	20 + 8 + 4 + 4 + 4 = 40 bytes / record

I did go ahead and play with this. As I played with it I found that the
time to convert hex bytes into binary was very fast. I also found that
putting the keys into a dict was surprisingly slow, so the parser got a
lot faster with a custom storage.

Short summary:
 Custom C class cut peak memory of 'ls -R' to 62% (400MB down to 250MB)
 Faster parsing, but doesn't seem to dominate 'ls -R' time.

Original code:
  112us _leaf_node_parse_lines (bytes => list(tuples + strings))
  373us dict(_parse()), so 260us to take the list and turn it into
	a dict
New parser:
   51us parse bytes => custom storage, about 7.3x faster

For comparison
   90us zlib.decompress(c_bytes) => bytes

So the old bytes => lines parser is close to decompress speed, the new
one is faster (and we will start to hit diminishing returns rapidly here).

Now, the benefit is still there if you actually materialize all of the
entries.
  464us sorted(dict(parse()).keys())
  380us dict(parse()).keys()
   92us custom_parse().all_keys()

(and note that we don't have to sort because it is saved in memory in
sorted order, as read from disk)
Even casting back into items which requires another set of tuples and
strings:
  491us sorted(dict.items())
  392us dict.items()
  262us custom_parse().all_items()

Note that the most common case, though, is that we only need 1 item out
of the 100 or so that are in a given btree page.

I then looked at lookup speed (since I was getting rid of a dict, which
is just about as fast as you can get in python).

As the baseline:
  6.7us  [k for k in all_keys]
 12.0us  [(k in dict) for k in all_keys]
 23.0us  [(k in subclass_dict) for k in all_keys]
 19.1us  [(k in custom) for k in all_keys]

So we aren't as fast as a raw dict, but we are faster than subclassing
dict (without overriding __getitem__ either). A custom __getitem__ that
thunks over to an attribute was close to 50us.

Extract speed is slower, because it is creating the objects late:
  12us   [dict[k] for k in all_keys]
  28us   [subclass_dict[k] for k in all_keys]
 151us   [custom[k] for k in all_keys]

However, this is expected, and we expect that the common cases will not
be to extract all keys, and we already showed that the time savings in
the parser makes up for the slowdown in extraction.

So what about the memory savings? Most of the memory consumption for
'bzr co' can be triggered via 'bzr ls -R' because it still has to read
all of those chk pages, etc.

bzr ls -R gcc-linaro/trunk -Dmemory
  PeakMemory 399912 KiB

wbzr ls -R gcc-linaro/trunk -Dmemory
  PeakMemory 247924 KiB

So this cuts 152MB out of peak memory (62% peak vs orig, almost half). I
did time them, but the numbers are within noise margins at this point.

Note that the peak memory numbers get even better for 64-bit machines.
The data types are fixed between 32/64bit machines, so we should always
scale at ~43bytes/record. (vs 234 that I measured before, vs ~400+ for
64-bit machines.)

I'm still tweaking it a bit, but the memory numbers are definitely there.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkxZt70ACgkQJdeBCYSNAAPkvgCfRiRzc+lTGW5RuhKTkyX51G7K
ev8AoKO76dnz66zjg5IsWc7tcgcSpOp6
=cvU1
-----END PGP SIGNATURE-----