[MERGE] Cache inventory.knit so we don't download it twice

Wed Jul 19 06:28:07 BST 2006

On 7/19/06, John Arbash Meinel <john at arbash-meinel.com> wrote:
> Michael Ellerman wrote:
> > On 7/14/06, John Arbash Meinel <john at arbash-meinel.com> wrote:
>
> ...
>
> >
> >
> > I've done a bit of profiling with this patch (including spill to
> > disk), vs bzr.dev and vs. my KnitData.read_records_iter patch. Graph
> > attached.
> >
> > In short it doesn't seem to have much effect on memory usage. What the
> > graph does show, IMHO, is that my read_records_iter() in arbitrary
> > order patch is Good Stuff (TM), and we should work out what changed
> > recently to cause the memory spike at the end of branching.
> >
>
> As a public API, I think it should return in the order requested, if
> only because that is what we have specified so far.
> However, the only callers seem to be internal functions at this point.
>
> read_records_iter_raw() seems to only do a little bit of caching, rather
> than building up a map, and then iterating.
>
> read_records_iter() is building up a map, and then iterating, and as I'm
> sure Michael noticed, the only callers at this point actually build up
> their own map, and then return it.
>
> So some of our inefficiency is probably because we are building up 2
> maps. And I'm guessing the version ids are not interned, so we end up
> holding a lot of copies of the version ids in memory.
>
> For api compatibility, though, I think we can create a
> 'read_records_iter_unsorted', which does what Michael suggests, and then
> read_records_iter, can just thunk into that, and return what it can.
>
> The attached diff is against my 'cache-inventory' branch, with spill to
> disk enabled. But if Michael can run it against his large branches and
> it drops the memory consumption, I'm willing to clean it up for use with
> bzr.dev.


Ok, you missed the most important call to read_records_iter(),
although perhaps that's because you thought it need to be in order?
The one in iter_lines_added_or blah blah.

I hacked your patch to add it:

=== modified file 'bzrlib/knit.py'

--- bzrlib/knit.py      2006-07-17 23:48:11 +0000
+++ bzrlib/knit.py      2006-07-19 05:16:28 +0000
@@ -807,7 +807,7 @@
         try:
             pb.update('Walking content.', count, total)
             for version_id, data, sha_value in \
-                self._data.read_records_iter(version_id_records):
+                self._data.read_records_iter_unsorted(version_id_records):
                 pb.update('Walking content.', count, total)
                 method = self._index.get_method(version_id)
                 version_idx = self._index.lookup(version_id)


And with that it looks good. There's still a nasty spike towards the
end of the branching that I'd like to tie down.

Graph attached.

cheers
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cache-inventory-out-of-order-jam.png
Type: image/png
Size: 9137 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060719/689df55e/attachment.png