[RFC] KnitData.read_records_iter() returns records out of order

Mon Jun 26 07:27:39 BST 2006

On 6/26/06, Aaron Bentley <aaron.bentley at utoronto.ca> wrote:
> Michael Ellerman wrote:
> > Although removing the knit data cache is nice, we still have code in
> > read_records_iter() that holds the entire contents of the knit in
> > memory.
>
> That's not the usual case.  It only holds the records we request.  For
> operations like retrieving files, or displaying logs, it doesn't request
> all revisions at once.

True, it loads the data for all revisions you request.

> get_revisions and get_revision_trees can theoretically request all
> revisions at once, but their only caller is iter_revisions, which makes
> multiple requests to get_revisions, of increasing size.
>
> But it looks like iter_lines_added_or_present_in_versions does request
> all records at once.

Depending on what version_ids you ask for, but potentially, this is
the code path I'm looking at:

RepoFetcher.__fetch()
RepoFetcher._fetch_weave_texts()
Repository.fileids_altered_by_revision_ids()
Knit.iter_lines_added_or_present_in_versions()
_KnitData.read_records_iter()

> > The reason we hold the whole thing in memory is so that we can return
> > the records in the order the caller asked for them. However AFAICT
> > none of the callers _in the bzr tree_ care what the order is. So we're
> > much better off just handing back records in the order we get them
> > from readv(), and therefore we only hold one unzipped record in memory
> > at any given time.
>
> My preference, in terms of cleanliness, would be to have the Transport
> take care of requesting the records in optimal order and returning them
> in the requested order.

I disagree, although that might be clean, it means we end up doing
work we don't need to. I'd rather push the sorting requirement up the
stack, so that we delay sorting as long as possible, and some callers
won't require sorting at all.

So we document the fact that readv() may return records out of order,
and so on up the stack. We can write an OutOfOrderLocalTransport that
returns everything in random orders for testing.

> Callers like _get_record_map must store all the records in a dict
> anyway, so they'll have no memory savings.

Sure, but that's a seperate performance/memory tradeoff that we can
decide to make.

> > Given that the current API is supposed to be stable, I guess we need
> > to keep this code, and add a new read_records_unsorted_iter() ?
>
> Should be easy to split the existing method up into two, one that
> produces a generator, one that sorts the generator output.

Yep.

cheers