Memory Consumption for 'bzr co' [was Re: Work flow on large repositories]

Sat Jul 31 07:40:09 BST 2010

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Short version:
 1) We use ~500MB to just track the inventory state (I see the same
    bloat for just 'bzr ls -R -r-1')
 2) We create almost 5M objects, though 4M shouldn't be in the python
    garbage collector (strings and StaticTuple), it is possible some of
    the slowdown is triggering too many GC passes.
 3) BTreeGraphIndex is not handling CHK entries well (known issue,
    though we can probably do some tuning.)
 4) CHKInventory is also a bit of a memory hog (espec + CHKMap)

Michael Hope wrote:
...

I've been studying the memory dump (and evolving Meliae nicely as I find
ways to explore memory and improve my methodology. The 0.3.0 will be
thanks to you. :) The big revelation was being able to compute
everything reachable from X excluding Y.)

Anyway, I've tracked a few things down, though I don't really know what
you care to delve into. I'll post some of my analysis here. I don't know
that you (Michael) will understand everything, as this is aimed as much
at people who know bzr internals well.

This dump was taken when the system claimed I was at ~500MB of memory
consumption, before any content was actually being extracted.

Total 5078731 objects, 291 types, Total size = 374.2MiB (392381358 bytes)
 Index   Count   %      Size   % Cum     Max Kind
     0 2375950  46 224148214  57  57 4194313 str
     1   63209   1  77855404  19  76 3145868 dict
     2 1647097  32  29645488   7  84      20 StaticTuple
     3  374259   7  14852532   3  88     304 tuple
     4  138464   2  12387988   3  91     536 unicode
     5   70933   1   8283012   2  93    4212 set
     6       1   0   7147504   1  95 7147504 SimpleSet

So notice that I don't see all memory yet. Mostly because direct C
allocations aren't seen unless the Python objects implement __sizeof__,
and give accurate reporting. Also, I would guess there is some C
allocator overhead, page rounding, etc. (allocating 1kB may allocate a
4kB page, and not fill it perfectly, etc.)

However, that still gives us 374MB that we can do better. It also shows
that we have >5Million python objects active to process your 60k entry
tree. Which sounds a bit high :).

Digging into it, there seem to be a few primary culprits. The #1 culprit
seems to be the chk_bytes.index.
>>> summarize_refs(btgi)
Total 3034760 objects, 12 types, Total size = 160.0MiB (167734728 bytes)
 Index   Count   %      Size   % Cum     Max Kind
     0 1506094  49  92931594  55  55      95 str
     1    7466   0  47298808  28  83  393356 dict
     2 1506076  49  27109344  16  99      20 StaticTuple

We have loaded 3M objects (str+tuples) just here. Some of these will be
shared, but ultimately we don't need all of it, though we do need some
of it.

I think there are a few bits.

 a) Our btree pages aren't really great with CHK data. The keys for CHK
    data are sha1 hashes, and thus inherently have no locality. Our
    btree design sorts them lexicographically, and then zlib compresses
    them into 4kB (compressed) pages. This puts about 100 records into a
    single page.
    To read 1 record, we read (and cache) the page.
    In your case, there seem to be 7473 total leaf pages, and we've read
    7466 of them (99.9%), resulting in caching about 700k records. (2
    tuples and 2 strings per record)
 b) gcc-linaro is large enough that the inventory tree has split into
    3-levels. (root-internal, internal, leaves). To grab the full
    inventory, we read ~48k LeafNodes, and 400 InternalNodes.
    So we *should* only be looking up <50k chk pages.
    So we are reading 700k/50k = 14x too many rows from the btree index.
 c) because of no locality, the chk_bytes index is configured to cache
    all reads. Other indexes are designed to LRUCache. (When branching
    from scratch, though, we *do* need all index entries, and we didn't
    want to thrash re-reading all those extra rows, and then throw them
    away because the random mapping breaks LRU).
    Note that changing caching strategy will improve peak memory, but it
    doesn't address the extra *time* spent reading and parsing 14x rows
    that we don't use.
 d) I do have a proposal for an alternate CHKIndex, which we might be
    able to incrementally implement. (though it requires a disk format
    change.)
 e) looking at the above, we spend 160MB on 700k records, or 234b/record
    If we had a C struct for the data, I would do:
    uint64 block_start;
    uint32 block_len;
    uint32 start_byte;
    uint32 end_byte;
    char sha1[20]; // (binary, not hex)

    That is 8+4+4+4+20=40 bytes per record. Which could drop 160MB to
    27MB.
    Note that 47M out of the 160M is because we use dicts for fast
    lookup. We could bisect an array of these blocks if we wanted to, or
    lots of other things. (Move away from local dicts, into just a
    global one, etc.) However using a 'dict' inherently means putting
    Python objects into it. Though we could go with a single 20-byte
    non-hex sha1 value, rather than a tuple with a single entry.

    Note also that in the extreme of not loading the extra content + a
    tight memory layout, we get all the way down to almost 2MB down from
    160MB. Definitely food for thought.

Looking elsewhere. Lets look at the rest of the Repository, ignoring
that one index.
>>> summarize_refs(repo, excluding=[btgi.address])
Total 3809 objects, 48 types, Total size = 36.6MiB (38388443 bytes)
 Index   Count   %      Size   % Cum     Max Kind
     0    1345  35  38266499  99  99 4194313 str
     1      49   1     45260   0  99   24716 dict
     2    1742  45     29120   0  99      20 StaticTuple
     3      10   0      5560   0  99     556 GroupCompressBlock

36MB... We might be able to trim some, but that doesn't really seem like
a lot of overhead. The numbers hint that this is the GroupCompressBlock
cache. (Where we cache the actual content we read from the repo.) As
such, I don't feel strongly that we need to do a lot here.

Going off to the side, lets look at the WorkingTree overhead:
>>> summarize_refs(ds, excluding=[wt.address])
Total 699709 objects, 19 types, Total size = 42.2MiB (44239684 bytes)
 Index   Count   %      Size   % Cum     Max Kind
     0  285690  40  20997620  47  47     226 str
     1  212977  30   8781420  19  67      48 tuple
     2   69640   9   8078240  18  85     116 set
     3   73700  10   4114368   9  94   17456 list
     4       2   0   1573144   3  98 1573004 dict
     5   57685   8    692220   1  99      12 int
     6       1   0      1708   0  99    1708 DirState

42MB for the dirstate structures. The key cost being _id_index dict
mapping file_ids back to where they might be found. We use a large dict
with 60k file-ids mapping each to a set containing tuples of strings.
About 27MB of referenced data just in that dict.

We could tighten it up a bit by:
 a) Using StaticTuple rather than tuple
 b) Getting rid of the sets, since 99% of them contain a single entry,
    and do something more dynamic. (a tuple until >~5 items, a set
    beyond that)
 c) That would also move ~300k items out of the gc. May not matter.

And finally, back to CHKInventory (excluding the chk_bytes repository
objects)
>>> summarize_refs(chki, excluding=[chk_bytes.address])
Total 1231512 objects, 16 types, Total size = 107.6MiB (112783517 bytes)
 Index   Count   %      Size   % Cum     Max Kind
     0  654339  53  80479605  71  71     536 str
     1   51514   4  15394232  13  85 1573004 dict
     2   65581   5   4721832   4  89      72 InventoryFile
     3   69639   5   3744212   3  92     208 unicode
     4  186732  15   3266268   2  95      20 StaticTuple
     5   47122   3   2827320   2  97      60 LeafNode
     6  152186  12   1826232   1  99      12 int
     7    4058   0    503192   0  99     124 CHKInventoryDirectory

So 65k files, and 4k directories, or roughly 70k total entries.

Again we have one very large dict, and it is mapping file_id => entry.
However, we have lots of smaller dicts, and a 600k strings. This turns
out to look like it is primarily the chk maps themselves.

First the parent_id_basename_to_file_id map:
Total 278680 objects, 11 types, Total size = 21.0MiB (21970735 bytes)
 Index   Count   %      Size   % Cum     Max Kind
     0  186781  67  17034111  77  77     226 str
     1    4478   1   3097576  14  91   24716 dict
     2   74117  26   1464428   6  98      20 StaticTuple
     3    4402   1    264120   1  99      60 LeafNode
     4    8821   3    105852   0  99      12 int
     5      76   0      4560   0  99      60 InternalNode

Not a huge amount here, though a bit. However, the id_to_entry_map:

Total 536031 objects, 11 types, Total size = 51.5MiB (54048060 bytes)
 Index   Count   %      Size   % Cum     Max Kind
     0  251531  46  41019668  75  75     536 str
     1   42976   8   7616768  14  89    6284 dict
     2   42720   7   2563200   4  94      60 LeafNode
     3  112615  21   1801840   3  98      16 StaticTuple
     4   85928  16   1031136   1  99      12 int
     5     256   0     15360   0  99      60 InternalNode

This is partly where splitting 3-deep hurts. As each LeafNode only has a
couple entries (70k/42.7k=1.6 entries per leaf).

Also note that the pid map takes a (file_id, name) tuple and maps to
file_id. And id_to_entry takes (file_id,) and maps to 'file_id+value',
but those file_id strings are duplicated everywhere. So you have 3
copies of each file-id, and they're not particularly small.

I did check, and some of the CHKMap tuples are re-used from the btree
index from before. Only about 50k, though, as I determined earlier.

Anyway, it is now *way* past my bedtime, but I figured I should write up
some of this stuff. If only to get it out of my head so I can stop
thinking about it and sleep :).

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkxTxUkACgkQJdeBCYSNAAOkOACfVTSA7oLTiq5c2GVeIpkN2Sk6
3TMAmgOnJx5clvHVCcVQTMZwdK0OIvxT
=X8ha
-----END PGP SIGNATURE-----