[MERGE] Make 'bzr co' more memory sensitive

Fri Oct 3 17:57:37 BST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

At the moment, we've been bringing in some really nice apis, which allow
us to discuss large amounts of data at once. One case is that now "bzr
co" is able to pass the list of all files to checkout into
"iter_files_bytes()", which can then be spooled out in any order.

This is very nice for CPU time, as it doesn't require us to access the
same things over and over again. However, it turns out that it causes us
to unpack the entire WT into memory before we write it out.

This was the cause of:
https://bugs.edge.launchpad.net/bzr/+bug/277171
and
https://bugs.edge.launchpad.net/bzr/+bug/269456

There are several things to be done to combat this, but I'm starting
with this. In testing on a mysql tree, doing "bzr co" with current
bzr.dev uses around 700MB (the tree is 157MB). When I do the same with
my patch, it uses 200MB right away (loading all the relevant text
indexes), and then it stays stable at under 270MB.

It does slow down the time for "bzr co" by a little bit (at least as
long as you aren't hitting swap). My guess is that asking for
everything-at-once is able to order the reads from the pack repository
in more linear order.

The difference isn't huge, though:
real    1m42.557s
vs
real    2m6.050s

A better fix would be to rewrite _get_content_map functionality so that
it can be an iterator, with the ability to hold on to *some*
intermediate results that it will need to produce later results, but
also able to let go of other intermediate results when they are no
longer needed.

Also, I should mention that the code is buffering all of the
intermediate content as well, not just the final texts. Another small
issue is that we have to return the texts as ''.join(lines) rather than
just lines. However, this isn't a huge overhead, as it only doubles the
consumption for each file, and we don't hold on to those strings after
they are written out.

Just to give an outline of memory consumption...

1) get_parent_map(keys) reads all of the text indexes because it is
doing a very broad request of all files on disk. This ends with a peak
memory of 270MB and a resident of 230MB

2) _get_record_map(keys) reads in the raw bytes from disk. This is the
fulltext records plus the delta records. This is the *parsed* records,
and not the raw bytes from disk. So things have been uncompressed and
parsed into lines. At this point, we are at 565MB of RAM

3) _get_content_map(keys) converts the raw fulltexts and deltas into
in-memory lists of lines. Note, however, that it is good about sharing
the raw strings between the lists it creates. So probably most of the
extra memory consumed at this point is just in "list" objects. I'm not
100% sure, though. 733MB

With this patch, we still get (1), but because we request the content
one file at a time, neither (2) or (3) buffer much at a time.

I'll also note that with btree indexes (1) drops to 130MB instead of 230MB.

Also, because of the text sharing, I don't think that doing
"text_map.pop()" ends up saving as much memory as I hoped. The problem
is that it will usually have several texts in the chain, and we are only
removing the last one/two whatever.

I suppose my ideal structure would get the full list of keys that we
actually need to write to disk. It would then lookup in the indexes to
find the ancestry it needs to build those texts. Then sort by pack
location to get optimal read ordering. It would then start reading in
order, buffer what it needs to get to the next text, and then release
the buffers when the final text is extracted.

My updating "bzr pack" to sort the file texts, it can also help give us
upper bounds on the amount of memory we will consume. (Certainly without
disk ordering, the worst case is O(all_texts)).

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkjmTwEACgkQJdeBCYSNAANSHgCg2ZrDZfVoTCoqc4HXrOu8A7Wc
sAAAoKlwHQx65zGFS6BSSk6fih9BhGyb
=vfw9
-----END PGP SIGNATURE-----
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: lighter_get_record_stream.patch
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20081003/739b9dad/attachment.diff