[MERGE] Make 'bzr co' more memory sensitive
Vincent Ladeuil
v.ladeuil+lp at free.fr
Mon Oct 6 13:21:22 BST 2008
>>>>> "john" == John Arbash Meinel <john at arbash-meinel.com> writes:
john> At the moment, we've been bringing in some really nice apis, which allow
john> us to discuss large amounts of data at once. One case is that now "bzr
john> co" is able to pass the list of all files to checkout into
john> "iter_files_bytes()", which can then be spooled out in any order.
john> This is very nice for CPU time, as it doesn't require us to access the
john> same things over and over again. However, it turns out that it causes us
john> to unpack the entire WT into memory before we write it out.
john> This was the cause of:
john> https://bugs.edge.launchpad.net/bzr/+bug/277171
john> and
john> https://bugs.edge.launchpad.net/bzr/+bug/269456
john> There are several things to be done to combat this, but I'm starting
john> with this. In testing on a mysql tree, doing "bzr co" with current
john> bzr.dev uses around 700MB (the tree is 157MB). When I do the same with
john> my patch, it uses 200MB right away (loading all the relevant text
john> indexes), and then it stays stable at under 270MB.
john> It does slow down the time for "bzr co" by a little bit (at least as
john> long as you aren't hitting swap). My guess is that asking for
john> everything-at-once is able to order the reads from the pack repository
john> in more linear order.
john> The difference isn't huge, though:
john> real 1m42.557s
john> vs
john> real 2m6.050s
In my case, it went from 45 minutes trashing a 1GB virtual
machine (granted, there was other processes using part of that
1GB) to 'real 190.05s' with a memory peak around 230MB.
Enough said.
<snip/>
john> === modified file 'bzrlib/knit.py'
john> --- bzrlib/knit.py 2008-10-01 05:40:45 +0000
john> +++ bzrlib/knit.py 2008-10-03 16:14:39 +0000
john> @@ -1124,6 +1124,26 @@
john> record_map[key] = record, record_details, digest, next
john> return record_map
john> + def _split_by_prefix(self, keys):
john> + """For the given keys, split them up based on their prefix.
john> +
john> + To keep memory pressure somewhat under control, split the
john> + requests back into per-file-id requests, otherwise "bzr co"
john> + extracts the full tree into memory before writing it to disk.
john> + This should be revisited if _get_content_maps() can ever cross
john> + file-id boundaries.
Can you write a test guarding from regression on that particular
case ?
BB:tweak
Convert that to approve from me if the test above is really too
hard to get right, but in that case, add a FIXME.
On the performance implications, I'd say that this fix should go
in *even* if it makes the operation slower on smaller cases. We
can't reasonably pretend it's a performance regression since for
large projects the actual implementation is not viable.
Vincent
More information about the bazaar
mailing list