Rev 71: Another dramatic improvement. Now *almost* as fast as the old code :). in http://bzr.arbash-meinel.com/plugins/history_db

John Arbash Meinel john at arbash-meinel.com
Fri Apr 9 18:02:17 BST 2010


At http://bzr.arbash-meinel.com/plugins/history_db

------------------------------------------------------------
revno: 71
revision-id: john at arbash-meinel.com-20100409170157-w9d7yz1x3iejmzwn
parent: john at arbash-meinel.com-20100409145225-tr9v4kg1wefvzt15
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: history_db
timestamp: Fri 2010-04-09 12:01:57 -0500
message:
  Another dramatic improvement. Now *almost* as fast as the old code :).
  
  Specifically, instead of stepping all the way back to X.Y.1, we step back until we
  find *any* X.?.1 revision. Once we've found one, we know that we have the latest.
  
  Now down to 4m3s, which is close to the ~3m30s that I think we used to have.
  Instead of bringing in 20M revisions, we now bring in 5M while walking mainline.
  We still have 650k mainline steps fro 'unknown' but now we are way down to
  only 113k because of sub-branches. *Huge* improvement (was 3M+).
  split_gdfo still filters more than split children.
  
  My guess is that we have a lot more branches that are merged multiple times, and
  so there are extra 'children' that are actually merged later than our current tip.
  However, because they are based on the current revision, their gdfo isn't actually
  high enough to be quickly filtered.
  
  Next possible steps-
    mainline rev cache, make it cheaper to step since we're doing it repeatedly
    step by more-than-one-at-a-time. Using the mainline info we already store.
    If we are *just* looking at the branch_count, we could use a new sql query
      SELECT * FROM dotted_revno WHERE ... AND revno LIKE X.?.1
    So we walk the mainline, and try to quickly load the most recent revno.
-------------- next part --------------
=== modified file 'history_db.py'
--- a/history_db.py	2010-04-08 21:29:31 +0000
+++ b/history_db.py	2010-04-09 17:01:57 +0000
@@ -913,6 +913,46 @@
             _MergeSortNode(db_id, merge_depth, left_parent, pending_parents,
                            is_first))
 
+    def _step_to_latest_branch(self, base_revno):
+        """Step the mainline until we've loaded the latest sub-branch.
+
+        This is used when we need to create a new child branch. We need to
+        ensure that we've loaded the most-recently-merged branch, so that we
+        can generate the correct branch counter.
+
+        For example, if you have a revision whose left-hand parent is 1.2.3,
+        you need to load mainline revisions until you find some revision like
+        (1.?.1). This will ensure that you have the most recent branch merged
+        to mainline that was branched from the revno=1 revision in mainline.
+        
+        Note that if we find (1,3,1) before finding (1,2,1) that is ok. As long
+        as we have found the first revision of any sub-branch, we know that
+        we've found the most recent (since we walk backwards).
+
+        :param base_revno: The revision that this branch is based on. 0 means
+            that this is a new-root branch.
+        :return: None
+        """
+        self._stats['step to latest'] += 1
+        while self._imported_mainline_id is not None:
+            if (base_revno,) in self._dotted_to_db_id:
+                # We have walked far enough to load the original revision,
+                # which means we've loaded all children.
+                self._stats['step to latest found base'] += 1
+                break
+            # Estimate what is the most recent branch, and see if we have read
+            # its first revision
+            branch_count = self._revno_to_branch_count.get(base_revno, 0)
+            root_of_branch_revno = (base_revno, branch_count, 1)
+            # Note: if branch_count == 0, that means we haven't seen any
+            #       other branches for this revision.
+            if root_of_branch_revno in self._dotted_to_db_id:
+                break
+            self._stats['step mainline to-latest'] += 1
+            if base_revno == 0:
+                self._stats['step mainline to-latest NULL'] += 1
+            self._step_mainline()
+
     def _pop_node(self):
         """Move the last node from the _depth_first_stack to _scheduled_stack.
 
@@ -935,14 +975,11 @@
                 # we need a new branch number. To get this correct, we have to
                 # make sure that the beginning of this branch has been loaded
                 if len(parent_revno) > 1:
-                    # TODO: We don't actually need parent_revno, we need
-                    #       (rev, self._revno_to_branch_count[rev], 1), which
-                    #       may be closer.
-                    branch_root = parent_revno[:2] + (1,)
-                    while (self._imported_mainline_id is not None
-                           and branch_root not in self._dotted_to_db_id):
-                        self._stats['step mainline sub-branch'] += 1
-                        self._step_mainline()
+                    # if parent_revno is a mainline, then
+                    # _ensure_lh_parent_info should have already loaded enough
+                    # data. So we only do this when the parent is a merged
+                    # revision.
+                    self._step_to_latest_branch(parent_revno[0])
                 base_revno = parent_revno[0]
                 branch_count = (
                     self._revno_to_branch_count.get(base_revno, 0) + 1)
@@ -969,16 +1006,7 @@
             #       the *first* (0,?,1) revision you find merged must be the
             #       last.
 
-            # As long as we haven't yet walked off the mainline...
-            while self._imported_mainline_id is not None:
-                # This is the oldest root that we have found so far
-                last_new_root = self._revno_to_branch_count.get(0, 0)
-                # See if we have found its 1 revision
-                first_rev_revno = (0, last_new_root, 1)
-                if first_rev_revno in self._dotted_to_db_id:
-                    break
-                self._stats['step mainline null-branch'] += 1
-                self._step_mainline()
+            self._step_to_latest_branch(0)
             branch_count = self._revno_to_branch_count.get(0, -1) + 1
             self._revno_to_branch_count[0] = branch_count
             if branch_count == 0: # This is the mainline

=== modified file 'test_importer.py'
--- a/test_importer.py	2010-04-08 20:00:49 +0000
+++ b/test_importer.py	2010-04-09 17:01:57 +0000
@@ -494,9 +494,9 @@
                           (self.M_id, (1, 2, 5), False, 1),
                           (self.O_id, (6,), False, 0),
                          ])
-        # We have to load G to get E, but we shouldn't have to load D_id, so
-        # that should be where we stop.
-        self.assertEqual(self.D_id, inc_merger._imported_mainline_id)
+        # We have to load I to get H, but at that point we've seen a (1,X,1)
+        # revision, so we know we've seen the newest merged branch.
+        self.assertEqual(self.G_id, inc_merger._imported_mainline_id)
 
     def test_handles_simple_child(self):
         ancestry = {'A': (), 'B': ('A',), 'C': ('B',), 'D': ('C',)}



More information about the bazaar-commits mailing list