Memory Consumption for 'bzr co' [was Re: Work flow on large repositories]
John Arbash Meinel
john at arbash-meinel.com
Sat Jul 31 07:40:09 BST 2010
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Short version:
1) We use ~500MB to just track the inventory state (I see the same
bloat for just 'bzr ls -R -r-1')
2) We create almost 5M objects, though 4M shouldn't be in the python
garbage collector (strings and StaticTuple), it is possible some of
the slowdown is triggering too many GC passes.
3) BTreeGraphIndex is not handling CHK entries well (known issue,
though we can probably do some tuning.)
4) CHKInventory is also a bit of a memory hog (espec + CHKMap)
Michael Hope wrote:
...
I've been studying the memory dump (and evolving Meliae nicely as I find
ways to explore memory and improve my methodology. The 0.3.0 will be
thanks to you. :) The big revelation was being able to compute
everything reachable from X excluding Y.)
Anyway, I've tracked a few things down, though I don't really know what
you care to delve into. I'll post some of my analysis here. I don't know
that you (Michael) will understand everything, as this is aimed as much
at people who know bzr internals well.
This dump was taken when the system claimed I was at ~500MB of memory
consumption, before any content was actually being extracted.
Total 5078731 objects, 291 types, Total size = 374.2MiB (392381358 bytes)
Index Count % Size % Cum Max Kind
0 2375950 46 224148214 57 57 4194313 str
1 63209 1 77855404 19 76 3145868 dict
2 1647097 32 29645488 7 84 20 StaticTuple
3 374259 7 14852532 3 88 304 tuple
4 138464 2 12387988 3 91 536 unicode
5 70933 1 8283012 2 93 4212 set
6 1 0 7147504 1 95 7147504 SimpleSet
So notice that I don't see all memory yet. Mostly because direct C
allocations aren't seen unless the Python objects implement __sizeof__,
and give accurate reporting. Also, I would guess there is some C
allocator overhead, page rounding, etc. (allocating 1kB may allocate a
4kB page, and not fill it perfectly, etc.)
However, that still gives us 374MB that we can do better. It also shows
that we have >5Million python objects active to process your 60k entry
tree. Which sounds a bit high :).
Digging into it, there seem to be a few primary culprits. The #1 culprit
seems to be the chk_bytes.index.
>>> summarize_refs(btgi)
Total 3034760 objects, 12 types, Total size = 160.0MiB (167734728 bytes)
Index Count % Size % Cum Max Kind
0 1506094 49 92931594 55 55 95 str
1 7466 0 47298808 28 83 393356 dict
2 1506076 49 27109344 16 99 20 StaticTuple
We have loaded 3M objects (str+tuples) just here. Some of these will be
shared, but ultimately we don't need all of it, though we do need some
of it.
I think there are a few bits.
a) Our btree pages aren't really great with CHK data. The keys for CHK
data are sha1 hashes, and thus inherently have no locality. Our
btree design sorts them lexicographically, and then zlib compresses
them into 4kB (compressed) pages. This puts about 100 records into a
single page.
To read 1 record, we read (and cache) the page.
In your case, there seem to be 7473 total leaf pages, and we've read
7466 of them (99.9%), resulting in caching about 700k records. (2
tuples and 2 strings per record)
b) gcc-linaro is large enough that the inventory tree has split into
3-levels. (root-internal, internal, leaves). To grab the full
inventory, we read ~48k LeafNodes, and 400 InternalNodes.
So we *should* only be looking up <50k chk pages.
So we are reading 700k/50k = 14x too many rows from the btree index.
c) because of no locality, the chk_bytes index is configured to cache
all reads. Other indexes are designed to LRUCache. (When branching
from scratch, though, we *do* need all index entries, and we didn't
want to thrash re-reading all those extra rows, and then throw them
away because the random mapping breaks LRU).
Note that changing caching strategy will improve peak memory, but it
doesn't address the extra *time* spent reading and parsing 14x rows
that we don't use.
d) I do have a proposal for an alternate CHKIndex, which we might be
able to incrementally implement. (though it requires a disk format
change.)
e) looking at the above, we spend 160MB on 700k records, or 234b/record
If we had a C struct for the data, I would do:
uint64 block_start;
uint32 block_len;
uint32 start_byte;
uint32 end_byte;
char sha1[20]; // (binary, not hex)
That is 8+4+4+4+20=40 bytes per record. Which could drop 160MB to
27MB.
Note that 47M out of the 160M is because we use dicts for fast
lookup. We could bisect an array of these blocks if we wanted to, or
lots of other things. (Move away from local dicts, into just a
global one, etc.) However using a 'dict' inherently means putting
Python objects into it. Though we could go with a single 20-byte
non-hex sha1 value, rather than a tuple with a single entry.
Note also that in the extreme of not loading the extra content + a
tight memory layout, we get all the way down to almost 2MB down from
160MB. Definitely food for thought.
Looking elsewhere. Lets look at the rest of the Repository, ignoring
that one index.
>>> summarize_refs(repo, excluding=[btgi.address])
Total 3809 objects, 48 types, Total size = 36.6MiB (38388443 bytes)
Index Count % Size % Cum Max Kind
0 1345 35 38266499 99 99 4194313 str
1 49 1 45260 0 99 24716 dict
2 1742 45 29120 0 99 20 StaticTuple
3 10 0 5560 0 99 556 GroupCompressBlock
36MB... We might be able to trim some, but that doesn't really seem like
a lot of overhead. The numbers hint that this is the GroupCompressBlock
cache. (Where we cache the actual content we read from the repo.) As
such, I don't feel strongly that we need to do a lot here.
Going off to the side, lets look at the WorkingTree overhead:
>>> summarize_refs(ds, excluding=[wt.address])
Total 699709 objects, 19 types, Total size = 42.2MiB (44239684 bytes)
Index Count % Size % Cum Max Kind
0 285690 40 20997620 47 47 226 str
1 212977 30 8781420 19 67 48 tuple
2 69640 9 8078240 18 85 116 set
3 73700 10 4114368 9 94 17456 list
4 2 0 1573144 3 98 1573004 dict
5 57685 8 692220 1 99 12 int
6 1 0 1708 0 99 1708 DirState
42MB for the dirstate structures. The key cost being _id_index dict
mapping file_ids back to where they might be found. We use a large dict
with 60k file-ids mapping each to a set containing tuples of strings.
About 27MB of referenced data just in that dict.
We could tighten it up a bit by:
a) Using StaticTuple rather than tuple
b) Getting rid of the sets, since 99% of them contain a single entry,
and do something more dynamic. (a tuple until >~5 items, a set
beyond that)
c) That would also move ~300k items out of the gc. May not matter.
And finally, back to CHKInventory (excluding the chk_bytes repository
objects)
>>> summarize_refs(chki, excluding=[chk_bytes.address])
Total 1231512 objects, 16 types, Total size = 107.6MiB (112783517 bytes)
Index Count % Size % Cum Max Kind
0 654339 53 80479605 71 71 536 str
1 51514 4 15394232 13 85 1573004 dict
2 65581 5 4721832 4 89 72 InventoryFile
3 69639 5 3744212 3 92 208 unicode
4 186732 15 3266268 2 95 20 StaticTuple
5 47122 3 2827320 2 97 60 LeafNode
6 152186 12 1826232 1 99 12 int
7 4058 0 503192 0 99 124 CHKInventoryDirectory
So 65k files, and 4k directories, or roughly 70k total entries.
Again we have one very large dict, and it is mapping file_id => entry.
However, we have lots of smaller dicts, and a 600k strings. This turns
out to look like it is primarily the chk maps themselves.
First the parent_id_basename_to_file_id map:
Total 278680 objects, 11 types, Total size = 21.0MiB (21970735 bytes)
Index Count % Size % Cum Max Kind
0 186781 67 17034111 77 77 226 str
1 4478 1 3097576 14 91 24716 dict
2 74117 26 1464428 6 98 20 StaticTuple
3 4402 1 264120 1 99 60 LeafNode
4 8821 3 105852 0 99 12 int
5 76 0 4560 0 99 60 InternalNode
Not a huge amount here, though a bit. However, the id_to_entry_map:
Total 536031 objects, 11 types, Total size = 51.5MiB (54048060 bytes)
Index Count % Size % Cum Max Kind
0 251531 46 41019668 75 75 536 str
1 42976 8 7616768 14 89 6284 dict
2 42720 7 2563200 4 94 60 LeafNode
3 112615 21 1801840 3 98 16 StaticTuple
4 85928 16 1031136 1 99 12 int
5 256 0 15360 0 99 60 InternalNode
This is partly where splitting 3-deep hurts. As each LeafNode only has a
couple entries (70k/42.7k=1.6 entries per leaf).
Also note that the pid map takes a (file_id, name) tuple and maps to
file_id. And id_to_entry takes (file_id,) and maps to 'file_id+value',
but those file_id strings are duplicated everywhere. So you have 3
copies of each file-id, and they're not particularly small.
I did check, and some of the CHKMap tuples are re-used from the btree
index from before. Only about 50k, though, as I determined earlier.
Anyway, it is now *way* past my bedtime, but I figured I should write up
some of this stuff. If only to get it out of my head so I can stop
thinking about it and sleep :).
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAkxTxUkACgkQJdeBCYSNAAOkOACfVTSA7oLTiq5c2GVeIpkN2Sk6
3TMAmgOnJx5clvHVCcVQTMZwdK0OIvxT
=X8ha
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list