Rev 447: Clean up a slow query that can be *much* faster. in http://bazaar.launchpad.net/~jameinel/loggerhead/fast_merged_into_revs

John Arbash Meinel john at arbash-meinel.com
Thu Jan 27 23:01:33 UTC 2011


At http://bazaar.launchpad.net/~jameinel/loggerhead/fast_merged_into_revs

------------------------------------------------------------
revno: 447
revision-id: john at arbash-meinel.com-20110127230120-sqwopu3i5ic6rfr1
parent: john at arbash-meinel.com-20110127163725-h7dopgv2ciwvgn9l
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: fast_merged_into_revs
timestamp: Thu 2011-01-27 17:01:20 -0600
message:
  Clean up a slow query that can be *much* faster.
  
  The get_revids_from() code was doing a very ugly code path to determine what
  mainline revision merged a given list of revisions. It was walking each
  mainline revision, finding the full set of merged revisions, and seeing
  if any of the revisions we cared about were present.
  Instead, go straight to the optimized db queries, which do a join
  to determine if any of the interesting revisions are present against any
  a group of mainline ancestry, etc.
  
  Some numbers:
    querying for the history of a file with only a couple revs. In the old
    (pre-history-db) code it would iterate over the whole ancestry and take
    about 3s to find those 2 mainline revs.
    In the history-db code, doing the same 'give me all the ancestry' slowed
    down to the point it would take 30-80s to answer the same query.
    Doing it with the optimized code path now drops that time down to
    187ms. So up to 400 times faster than the bad case, and 40x faster than
    the non-history-db code.
-------------- next part --------------
=== modified file 'loggerhead/apps/branch.py'
--- a/loggerhead/apps/branch.py	2011-01-20 07:50:26 +0000
+++ b/loggerhead/apps/branch.py	2011-01-27 23:01:20 +0000
@@ -1,4 +1,4 @@
-# Copyright (C) 2008, 2009, 2010 Canonical Ltd.
+# Copyright (C) 2008-2011 Canonical Ltd.
 #
 # This program is free software; you can redistribute it and/or modify
 # it under the terms of the GNU General Public License as published by
@@ -159,11 +159,16 @@
         cls = self.controllers_dict.get(path)
         if cls is None:
             raise httpexceptions.HTTPNotFound()
+
         self.branch.lock_read()
         try:
             try:
-                c = cls(self, self.get_history)
-                return c(environ, start_response)
+                from bzrlib import commands
+                def do_it():
+                    c = cls(self, self.get_history)
+                    return c(environ, start_response)
+                #return commands.apply_lsprofiled('profile.txt', do_it)
+                return do_it()
             except:
                 environ['exc_info'] = sys.exc_info()
                 environ['branch'] = self

=== modified file 'loggerhead/history.py'
--- a/loggerhead/history.py	2010-05-15 07:49:53 +0000
+++ b/loggerhead/history.py	2011-01-27 23:01:20 +0000
@@ -1,4 +1,4 @@
-# Copyright (C) 2006-2010 Canonical Ltd.
+# Copyright (C) 2006-2011 Canonical Ltd.
 #                     (Authored by Martin Albisetti <argentina at gmail.com>)
 # Copyright (C) 2006  Robey Pointer <robey at lag.net>
 # Copyright (C) 2006  Goffredo Baroncelli <kreijack at inwind.it>
@@ -30,9 +30,11 @@
 import datetime
 import logging
 import os
+import pprint
 import re
 import textwrap
 import threading
+import time
 
 from bzrlib import (
     graph,
@@ -219,10 +221,10 @@
         self._branch_tags = None
         self._inventory_cache = {}
         if cache_path is None:
-            cache_path = ':memory:'
+            self._cache_path = ':memory:'
         else:
-            cache_path = os.path.join(cache_path, 'historydb.sql')
-        self._querier = history_db.Querier(cache_path, branch)
+            self._cache_path = os.path.join(cache_path, 'historydb.sql')
+        self._querier = history_db.Querier(self._cache_path, branch)
         # sqlite is single-writer, so block concurrent updates.
         self._querier.set_importer_lock(history_db_importer_lock)
         # This may be premature, but for now if you need History, you almost
@@ -329,17 +331,23 @@
             return
         revid_set = set(revid_list)
 
-        while tip_revid is not None and revid_set:
-            parent_revid = self._get_lh_parent(tip_revid)
-            # TODO: Consider caching this, especially between HTTP requests
-            introduced = self._querier.iter_merge_sorted_revisions(
-                start_revision_id=tip_revid, stop_revision_id=parent_revid)
-            introduced_revs = set([i[0] for i in introduced
-                                   if i[0] in revid_set])
-            if introduced_revs:
-                revid_set.difference_update(introduced_revs)
-                yield tip_revid
-            tip_revid = parent_revid
+        # TODO: We need tests for when start_revid != self.last_revision
+        t = time.clock()
+        # TODO: Update history_db to have the exact query we want. Right now
+        #       we're passing over the history a couple of times, which is
+        #       pretty wasteful.
+        mainline_revs = self._querier.get_mainline_where_merged(revid_list)
+        mainline_revids = set(mainline_revs.itervalues())
+        mainline_revnos = self._querier.get_dotted_revnos(mainline_revids)
+        revno_ordered = sorted([(revno[0], rev_id) for rev_id, revno
+                                in mainline_revnos.iteritems()],
+                               reverse=True)
+        t = time.clock() - t
+        self.log.info('%.3fs get_mainline_where_merged returned %s'
+                      % (t, pprint.pformat(mainline_revs)))
+        for _, rev_id in revno_ordered:
+            yield rev_id
+        return
 
     def get_short_revision_history_by_fileid(self, file_id, tip_revid):
         # FIXME: would be awesome if we could get, for a folder, the list of



More information about the bazaar-commits mailing list