Rev 447: Clean up a slow query that can be *much* faster. in http://bazaar.launchpad.net/~jameinel/loggerhead/fast_merged_into_revs
John Arbash Meinel
john at arbash-meinel.com
Thu Jan 27 23:01:33 UTC 2011
At http://bazaar.launchpad.net/~jameinel/loggerhead/fast_merged_into_revs
------------------------------------------------------------
revno: 447
revision-id: john at arbash-meinel.com-20110127230120-sqwopu3i5ic6rfr1
parent: john at arbash-meinel.com-20110127163725-h7dopgv2ciwvgn9l
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: fast_merged_into_revs
timestamp: Thu 2011-01-27 17:01:20 -0600
message:
Clean up a slow query that can be *much* faster.
The get_revids_from() code was doing a very ugly code path to determine what
mainline revision merged a given list of revisions. It was walking each
mainline revision, finding the full set of merged revisions, and seeing
if any of the revisions we cared about were present.
Instead, go straight to the optimized db queries, which do a join
to determine if any of the interesting revisions are present against any
a group of mainline ancestry, etc.
Some numbers:
querying for the history of a file with only a couple revs. In the old
(pre-history-db) code it would iterate over the whole ancestry and take
about 3s to find those 2 mainline revs.
In the history-db code, doing the same 'give me all the ancestry' slowed
down to the point it would take 30-80s to answer the same query.
Doing it with the optimized code path now drops that time down to
187ms. So up to 400 times faster than the bad case, and 40x faster than
the non-history-db code.
-------------- next part --------------
=== modified file 'loggerhead/apps/branch.py'
--- a/loggerhead/apps/branch.py 2011-01-20 07:50:26 +0000
+++ b/loggerhead/apps/branch.py 2011-01-27 23:01:20 +0000
@@ -1,4 +1,4 @@
-# Copyright (C) 2008, 2009, 2010 Canonical Ltd.
+# Copyright (C) 2008-2011 Canonical Ltd.
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
@@ -159,11 +159,16 @@
cls = self.controllers_dict.get(path)
if cls is None:
raise httpexceptions.HTTPNotFound()
+
self.branch.lock_read()
try:
try:
- c = cls(self, self.get_history)
- return c(environ, start_response)
+ from bzrlib import commands
+ def do_it():
+ c = cls(self, self.get_history)
+ return c(environ, start_response)
+ #return commands.apply_lsprofiled('profile.txt', do_it)
+ return do_it()
except:
environ['exc_info'] = sys.exc_info()
environ['branch'] = self
=== modified file 'loggerhead/history.py'
--- a/loggerhead/history.py 2010-05-15 07:49:53 +0000
+++ b/loggerhead/history.py 2011-01-27 23:01:20 +0000
@@ -1,4 +1,4 @@
-# Copyright (C) 2006-2010 Canonical Ltd.
+# Copyright (C) 2006-2011 Canonical Ltd.
# (Authored by Martin Albisetti <argentina at gmail.com>)
# Copyright (C) 2006 Robey Pointer <robey at lag.net>
# Copyright (C) 2006 Goffredo Baroncelli <kreijack at inwind.it>
@@ -30,9 +30,11 @@
import datetime
import logging
import os
+import pprint
import re
import textwrap
import threading
+import time
from bzrlib import (
graph,
@@ -219,10 +221,10 @@
self._branch_tags = None
self._inventory_cache = {}
if cache_path is None:
- cache_path = ':memory:'
+ self._cache_path = ':memory:'
else:
- cache_path = os.path.join(cache_path, 'historydb.sql')
- self._querier = history_db.Querier(cache_path, branch)
+ self._cache_path = os.path.join(cache_path, 'historydb.sql')
+ self._querier = history_db.Querier(self._cache_path, branch)
# sqlite is single-writer, so block concurrent updates.
self._querier.set_importer_lock(history_db_importer_lock)
# This may be premature, but for now if you need History, you almost
@@ -329,17 +331,23 @@
return
revid_set = set(revid_list)
- while tip_revid is not None and revid_set:
- parent_revid = self._get_lh_parent(tip_revid)
- # TODO: Consider caching this, especially between HTTP requests
- introduced = self._querier.iter_merge_sorted_revisions(
- start_revision_id=tip_revid, stop_revision_id=parent_revid)
- introduced_revs = set([i[0] for i in introduced
- if i[0] in revid_set])
- if introduced_revs:
- revid_set.difference_update(introduced_revs)
- yield tip_revid
- tip_revid = parent_revid
+ # TODO: We need tests for when start_revid != self.last_revision
+ t = time.clock()
+ # TODO: Update history_db to have the exact query we want. Right now
+ # we're passing over the history a couple of times, which is
+ # pretty wasteful.
+ mainline_revs = self._querier.get_mainline_where_merged(revid_list)
+ mainline_revids = set(mainline_revs.itervalues())
+ mainline_revnos = self._querier.get_dotted_revnos(mainline_revids)
+ revno_ordered = sorted([(revno[0], rev_id) for rev_id, revno
+ in mainline_revnos.iteritems()],
+ reverse=True)
+ t = time.clock() - t
+ self.log.info('%.3fs get_mainline_where_merged returned %s'
+ % (t, pprint.pformat(mainline_revs)))
+ for _, rev_id in revno_ordered:
+ yield rev_id
+ return
def get_short_revision_history_by_fileid(self, file_id, tip_revid):
# FIXME: would be awesome if we could get, for a folder, the list of
More information about the bazaar-commits
mailing list