[RFC] KnitData.read_records_iter() returns records out of order

Mon Jun 26 06:53:37 BST 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Michael Ellerman wrote:
> Although removing the knit data cache is nice, we still have code in
> read_records_iter() that holds the entire contents of the knit in
> memory.

That's not the usual case.  It only holds the records we request.  For
operations like retrieving files, or displaying logs, it doesn't request
all revisions at once.

get_revisions and get_revision_trees can theoretically request all
revisions at once, but their only caller is iter_revisions, which makes
multiple requests to get_revisions, of increasing size.

But it looks like iter_lines_added_or_present_in_versions does request
all records at once.

Other callers of read_records usually only request the records needed to
produce single revision, so they would not contribute to memory bloat.
One caveat: I do not know how the fetchers operate, but I believe in
general they use read_records_iter_raw, not the method you're talking about.

> Once we've allocated that memory it's never returned to the
> OS, which (somewhat artificially) inflates the memory usage of bzr.

In the log cases, I would expect the size of the text_maps to dwarf the
size of the record maps.

> The reason we hold the whole thing in memory is so that we can return
> the records in the order the caller asked for them. However AFAICT
> none of the callers _in the bzr tree_ care what the order is. So we're
> much better off just handing back records in the order we get them
> from readv(), and therefore we only hold one unzipped record in memory
> at any given time.

My preference, in terms of cleanliness, would be to have the Transport
take care of requesting the records in optimal order and returning them
in the requested order.

Callers like _get_record_map must store all the records in a dict
anyway, so they'll have no memory savings.

> Given that the current API is supposed to be stable, I guess we need
> to keep this code, and add a new read_records_unsorted_iter() ?

Should be easy to split the existing method up into two, one that
produces a generator, one that sorts the generator output.

> And I should write some tests too :)
> 
> This causes a nice reduction in memory usage for branch on trees with
> large inventories:
> http://michael.ellerman.id.au/files/samba-combined.png

Well, it does sounds like a nice improvement, then.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEn3Zh0F+nu1YWqI0RAirDAJ9V4dIjYZwyltIBVxqz8FY7VPUdDACfeDFf
roknJgp0ASnW9+zbyNCSt3U=
=DaNZ
-----END PGP SIGNATURE-----