[MERGE] 22% Faster logs by optimizing get_texts

Sat Jun 17 21:07:55 BST 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aaron Bentley wrote:
> Hi all,
> 
> This patch continues my log performance work, by implementing get_texts
> so that it is a single readv operation.
> 
> Note that it was also necessary to sort the records before sending them
> to readv-- should readv do its own sorting?
> 
> Before the patch the test ran in 626 ms.  Now it runs in 510.
> 
> Also, I've done a little clean-up work.
> 
> Aaron


...

+    def _get_component_positions(self, version_id):
+        needed_versions, basis_versions = \
+            self._get_component_versions(version_id)
+        assert len(basis_versions) == 0
+        positions = []
+        for method, comp_id in needed_versions:
+            data_pos, data_size = self._index.get_position(comp_id)
+            positions.append((method, comp_id, data_pos, data_size))
+        return positions
+

...

+        needed_versions, basis_versions = \
+            self._get_component_versions(version_id)

         components = {}
         if basis_versions:
+            assert True, "I am broken"
+            basis = self.basis_knit

Shouldn't the above be 'assert False, "I am broken"


             records = []
             for comp_id in basis_versions:
                 data_pos, data_size =
basis._index.get_data_position(comp_id)
@@ -603,7 +621,6 @@

         # digest here is the digest from the last applied component.
         if sha_strings(content.text()) != digest:
- -            import pdb;pdb.set_trace()
             raise KnitCorrupt(self.filename, 'sha-1 does not match %s'
% version_id)

         return content


...

I'm starting to wonder if we are hurting ourselves by working with
everything as lines rather than working on them as string blobs.

I know Weaves were very line based, but Knits don't have to be as line
based. I suppose difflib is also line based.
And while PatienceDiff is line based, that could be an implementation
detail, rather than being a public api.

+
+    def get_text(self, version_id):
+        """See VersionedFile.get_text"""
+        return self.get_texts([version_id])[0]
+
+    def get_texts(self, version_ids):
+        return [''.join(l) for l in self.get_line_list(version_ids)]
+


...

This is something I just read on some Python tutorial. There is a
compiled module 'operator'.
And you can do:
import operator
needed_records.sort(key=operator.item_getter(1))
Since it is a compiled C function, it should be faster than a lambda.

         if len(needed_records):
+            needed_records.sort(key=lambda x:x[1])
             # We take it that the transport optimizes the fetching as


In general, I think it looks good. A relatively small change, just to
request groups instead of one at a time.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFElGEbJdeBCYSNAAMRAjYlAKCTxM6htiBmjirXtxZ8dMDPLe+jZwCeKbnf
f5CaL73C3FaFAt6rnuYXKIw=
=SBNu
-----END PGP SIGNATURE-----