[RFC] KnitData.read_records_iter() returns records out of order

Mon Jun 26 06:14:14 BST 2006

Hi guys,

Although removing the knit data cache is nice, we still have code in
read_records_iter() that holds the entire contents of the knit in
memory. Once we've allocated that memory it's never returned to the
OS, which (somewhat artificially) inflates the memory usage of bzr.

The reason we hold the whole thing in memory is so that we can return
the records in the order the caller asked for them. However AFAICT
none of the callers _in the bzr tree_ care what the order is. So we're
much better off just handing back records in the order we get them
from readv(), and therefore we only hold one unzipped record in memory
at any given time.

Given that the current API is supposed to be stable, I guess we need
to keep this code, and add a new read_records_unsorted_iter() ? And I
should write some tests too :)

This causes a nice reduction in memory usage for branch on trees with
large inventories:
http://michael.ellerman.id.au/files/samba-combined.png

cheers

=== modified file 'bzrlib/knit.py'

--- bzrlib/knit.py      2006-06-26 03:19:42 +0000
+++ bzrlib/knit.py      2006-06-26 03:24:34 +0000
@@ -1433,20 +1433,15 @@
         # Get unique records, sorted by position
         needed_records = sorted(set(records), key=operator.itemgetter(1))

-        # We take it that the transport optimizes the fetching as good
+        # We take it that the transport optimizes the fetching as well
         # as possible (ie, reads continuous ranges.)
         response = self._transport.readv(self._filename,
             [(pos, size) for version_id, pos, size in needed_records])

-        record_map = {}
-        for (record_id, pos, size), (pos, data) in \
+        for (version_id, pos, size), (pos, data) in \
             izip(iter(needed_records), response):
-            content, digest = self._parse_record(record_id, data)
-            record_map[record_id] = (digest, content)
-
-        for version_id, pos, size in records:
-            digest, content = record_map[version_id]
-            yield version_id, content, digest
+            content, digest = self._parse_record(version_id, data)
+            yield version_id, list(content), digest

     def read_records(self, records):
         """Read records into a dictionary."""