Rev 3735: (jam) 'bzr log file' now shows more relevant info, in file:///home/pqm/archives/thelove/bzr/%2Btrunk/

Canonical.com Patch Queue Manager pqm at pqm.ubuntu.com
Thu Sep 25 02:21:50 BST 2008


At file:///home/pqm/archives/thelove/bzr/%2Btrunk/

------------------------------------------------------------
revno: 3735
revision-id: pqm at pqm.ubuntu.com-20080925012144-k71s2olv2fpy771x
parent: pqm at pqm.ubuntu.com-20080924081540-ecfvp6xq4x9gs81n
parent: john at arbash-meinel.com-20080922200907-0uv1skmq5jo1skvn
committer: Canonical.com Patch Queue Manager <pqm at pqm.ubuntu.com>
branch nick: +trunk
timestamp: Thu 2008-09-25 02:21:44 +0100
message:
  (jam) 'bzr log file' now shows more relevant info,
  	and does so in a more efficient manner.
modified:
  NEWS                           NEWS-20050323055033-4e00b5db738777ff
  bzrlib/index.py                index.py-20070712131115-lolkarso50vjr64s-1
  bzrlib/log.py                  log.py-20050505065812-c40ce11702fe5fb1
  bzrlib/tests/test_log.py       testlog.py-20050728115707-1a514809d7d49309
    ------------------------------------------------------------
    revno: 3711.3.23
    revision-id: john at arbash-meinel.com-20080922200907-0uv1skmq5jo1skvn
    parent: john at arbash-meinel.com-20080922132935-klw5fk22s4s1imrx
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: lighter_log_file
    timestamp: Mon 2008-09-22 15:09:07 -0500
    message:
      Documentation and cleanup.
      
      Improve the docstring for _filter_revisions_touching_file_id to indicate
      what revisions will and won't be returned.
      Other general suggestions from Ian.
    modified:
      NEWS                           NEWS-20050323055033-4e00b5db738777ff
      bzrlib/log.py                  log.py-20050505065812-c40ce11702fe5fb1
      bzrlib/tests/test_log.py       testlog.py-20050728115707-1a514809d7d49309
    ------------------------------------------------------------
    revno: 3711.3.22
    revision-id: john at arbash-meinel.com-20080922132935-klw5fk22s4s1imrx
    parent: john at arbash-meinel.com-20080921144837-wi61tf7gr4jfwl5d
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: lighter_log_file
    timestamp: Mon 2008-09-22 08:29:35 -0500
    message:
      Remove a hangover 'assert'
    modified:
      bzrlib/log.py                  log.py-20050505065812-c40ce11702fe5fb1
    ------------------------------------------------------------
    revno: 3711.3.21
    revision-id: john at arbash-meinel.com-20080921144837-wi61tf7gr4jfwl5d
    parent: john at arbash-meinel.com-20080921141555-r6npeijzl5ic1r6r
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: lighter_log_file
    timestamp: Sun 2008-09-21 09:48:37 -0500
    message:
      Fix GraphIndex to properly generate _nodes_by_keys on demand.
    modified:
      bzrlib/index.py                index.py-20070712131115-lolkarso50vjr64s-1
    ------------------------------------------------------------
    revno: 3711.3.20
    revision-id: john at arbash-meinel.com-20080921141555-r6npeijzl5ic1r6r
    parent: john at arbash-meinel.com-20080921140124-kvoi3sdig2owjukf
    parent: pqm at pqm.ubuntu.com-20080921012105-ote1u11mokjim9ir
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: lighter_log_file
    timestamp: Sun 2008-09-21 09:15:55 -0500
    message:
      merge bzr.dev and clean up NEWS
    modified:
      NEWS                           NEWS-20050323055033-4e00b5db738777ff
      bzrlib/builtins.py             builtins.py-20050830033751-fc01482b9ca23183
      bzrlib/directory_service.py    directory_service.py-20080305221044-vr2mkvlsk8jypa2y-1
      bzrlib/help_topics/__init__.py help_topics.py-20060920210027-rnim90q9e0bwxvy4-1
      bzrlib/revisionspec.py         revisionspec.py-20050907152633-17567659fd5c0ddb
      bzrlib/smart/medium.py         medium.py-20061103051856-rgu2huy59fkz902q-1
      bzrlib/smart/server.py         server.py-20061110062051-chzu10y32vx8gvur-1
      bzrlib/tests/blackbox/test_branch.py test_branch.py-20060524161337-noms9gmcwqqrfi8y-1
      bzrlib/tests/blackbox/test_remove_tree.py test_remove_tree.py-20061110192919-5j3xjciiaqbs2dvo-1
      bzrlib/tests/blackbox/test_switch.py test_switch.py-20071122111948-0c5en6uz92bwl76h-1
      bzrlib/tests/test_directory_service.py test_directory_servi-20080305221044-vr2mkvlsk8jypa2y-2
      bzrlib/tests/test_transport_implementations.py test_transport_implementations.py-20051227111451-f97c5c7d5c49fce7
      bzrlib/tests/tree_implementations/test_inv.py test_inv.py-20070312023226-0cdvk5uwhutis9vg-1
      bzrlib/tests/tree_implementations/test_tree.py test_tree.py-20061215160206-usu7lwcj8aq2n3br-1
      bzrlib/transform.py            transform.py-20060105172343-dd99e54394d91687
      bzrlib/transport/local.py      local_transport.py-20050711165921-9b1f142bfe480c24
      bzrlib/tree.py                 tree.py-20050309040759-9d5f2496be663e77
    ------------------------------------------------------------
    revno: 3711.3.19
    revision-id: john at arbash-meinel.com-20080921140124-kvoi3sdig2owjukf
    parent: john at arbash-meinel.com-20080921135843-wnwpal216u3zt2ni
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: lighter_log_file
    timestamp: Sun 2008-09-21 09:01:24 -0500
    message:
      Add a TODO discussing how our index requests should evolve.
    modified:
      bzrlib/log.py                  log.py-20050505065812-c40ce11702fe5fb1
    ------------------------------------------------------------
    revno: 3711.3.18
    revision-id: john at arbash-meinel.com-20080921135843-wnwpal216u3zt2ni
    parent: john at arbash-meinel.com-20080919035127-okaoylx3tcrtn65w
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: lighter_log_file
    timestamp: Sun 2008-09-21 08:58:43 -0500
    message:
      Update NEWS based on the current algorithm.
    modified:
      NEWS                           NEWS-20050323055033-4e00b5db738777ff
    ------------------------------------------------------------
    revno: 3711.3.17
    revision-id: john at arbash-meinel.com-20080919035127-okaoylx3tcrtn65w
    parent: john at arbash-meinel.com-20080919035029-aibb9jq5waxjjs7c
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: lighter_log_file
    timestamp: Thu 2008-09-18 22:51:27 -0500
    message:
      Fix the 'test_log' tests
    modified:
      bzrlib/tests/test_log.py       testlog.py-20050728115707-1a514809d7d49309
    ------------------------------------------------------------
    revno: 3711.3.16
    revision-id: john at arbash-meinel.com-20080919035029-aibb9jq5waxjjs7c
    parent: john at arbash-meinel.com-20080919034224-laq19nw522j3c6ge
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: lighter_log_file
    timestamp: Thu 2008-09-18 22:50:29 -0500
    message:
      Doc update.
    modified:
      bzrlib/log.py                  log.py-20050505065812-c40ce11702fe5fb1
    ------------------------------------------------------------
    revno: 3711.3.15
    revision-id: john at arbash-meinel.com-20080919034224-laq19nw522j3c6ge
    parent: john at arbash-meinel.com-20080919032139-pdcuw9ryqatpqdn9
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: lighter_log_file
    timestamp: Thu 2008-09-18 22:42:24 -0500
    message:
      Work around GraphIndex inefficiencies by requesting keys 1000 at a time.
    modified:
      bzrlib/log.py                  log.py-20050505065812-c40ce11702fe5fb1
    ------------------------------------------------------------
    revno: 3711.3.14
    revision-id: john at arbash-meinel.com-20080919032139-pdcuw9ryqatpqdn9
    parent: john at arbash-meinel.com-20080919013023-31adhm4mt3obrjst
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: lighter_log_file
    timestamp: Thu 2008-09-18 22:21:39 -0500
    message:
      Change the per-file log algorithm dramatically.
      
      Now we use the merge_sort information to decide when to show
      another entry. It seems this changes what is actually logged.
      This is because sometimes a branch merges the trunk. And in that
      case, it will be bringing in the changes to a file, which the
      feature branch did not have before.
      However, for people using 'bzr log file' I believe those are
      uninteresting nodes.
      
      Memory consumption is good (240MB, mostly from text_index._buffer_all),
      speed is good (28s, also mostly in _buffer_all).
      
      Changing the code to request one key at a time saves the
      buffer all, but at the expense of lots of bisects. In some
      cases it is faster and saves memory (23s, 88MB), but on
      bzr.dev it is slower.
    modified:
      bzrlib/log.py                  log.py-20050505065812-c40ce11702fe5fb1
    ------------------------------------------------------------
    revno: 3711.3.13
    revision-id: john at arbash-meinel.com-20080919013023-31adhm4mt3obrjst
    parent: john at arbash-meinel.com-20080918204343-0h7rbblut820u6i8
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: lighter_log_file
    timestamp: Thu 2008-09-18 20:30:23 -0500
    message:
      Shave off another 5s by not building 'node_by_key'
    modified:
      bzrlib/index.py                index.py-20070712131115-lolkarso50vjr64s-1
    ------------------------------------------------------------
    revno: 3711.3.12
    revision-id: john at arbash-meinel.com-20080918204343-0h7rbblut820u6i8
    parent: john at arbash-meinel.com-20080918200531-woiund7811i2kgyy
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: lighter_log_file
    timestamp: Thu 2008-09-18 15:43:43 -0500
    message:
      Document the advantage of using a revision_id set, instead of a (file_id, revision_id) one.
    modified:
      bzrlib/log.py                  log.py-20050505065812-c40ce11702fe5fb1
    ------------------------------------------------------------
    revno: 3711.3.11
    revision-id: john at arbash-meinel.com-20080918200531-woiund7811i2kgyy
    parent: john at arbash-meinel.com-20080918200316-v76fi3cotgu3qfun
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: lighter_log_file
    timestamp: Thu 2008-09-18 15:05:31 -0500
    message:
      Clean out the debugging code.
    modified:
      bzrlib/log.py                  log.py-20050505065812-c40ce11702fe5fb1
    ------------------------------------------------------------
    revno: 3711.3.10
    revision-id: john at arbash-meinel.com-20080918200316-v76fi3cotgu3qfun
    parent: john at arbash-meinel.com-20080918193947-0nbfx6f2yu3zvi3k
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: lighter_log_file
    timestamp: Thu 2008-09-18 15:03:16 -0500
    message:
      Document some other alternatives, but in the end, just go with the memory consuming fast 'frozenset()'
    modified:
      bzrlib/log.py                  log.py-20050505065812-c40ce11702fe5fb1
    ------------------------------------------------------------
    revno: 3711.3.9
    revision-id: john at arbash-meinel.com-20080918193947-0nbfx6f2yu3zvi3k
    parent: john at arbash-meinel.com-20080918193505-d1jihbz38y6ooty3
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: lighter_log_file
    timestamp: Thu 2008-09-18 14:39:47 -0500
    message:
      Switch back to using a single dict.
      
      The algorithm is simpler, and though I rember differently, the
      numbers show that it performs identically.
    modified:
      bzrlib/log.py                  log.py-20050505065812-c40ce11702fe5fb1
    ------------------------------------------------------------
    revno: 3711.3.8
    revision-id: john at arbash-meinel.com-20080918193505-d1jihbz38y6ooty3
    parent: john at arbash-meinel.com-20080918191751-wk23agjzhkiw2wo5
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: lighter_log_file
    timestamp: Thu 2008-09-18 14:35:05 -0500
    message:
      It seems the specific layout of dicts doesn't matter.
      Using 2 dicts, 1 dict and a list, 1 dict all had identical memory
      and CPU numbers. The only thing that matters is:
      
      a) We can't use mixed mode with a single dictionary, because
         it gets very confused and performs poorly.
      b) If you always use tuples, you cut memory from 410MB => 272MB, but
         performance jumps 42s => 50s
      c) 'mixed' mode is inbetween, with 400MB and 43s. However, it is
         pretty much just the 'frozenset' performance.
    modified:
      bzrlib/log.py                  log.py-20050505065812-c40ce11702fe5fb1
    ------------------------------------------------------------
    revno: 3711.3.7
    revision-id: john at arbash-meinel.com-20080918191751-wk23agjzhkiw2wo5
    parent: john at arbash-meinel.com-20080918190406-r75lzwnmjjth94h6
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: lighter_log_file
    timestamp: Thu 2008-09-18 14:17:51 -0500
    message:
      Instead of using a dictionary for cached ancestries,
      use a list. This also lets us use integer offsets in our mapping,
      rather than revision ids.
    modified:
      bzrlib/log.py                  log.py-20050505065812-c40ce11702fe5fb1
    ------------------------------------------------------------
    revno: 3711.3.6
    revision-id: john at arbash-meinel.com-20080918190406-r75lzwnmjjth94h6
    parent: john at arbash-meinel.com-20080918182621-g2w4klj590g2zptb
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: lighter_log_file
    timestamp: Thu 2008-09-18 14:04:06 -0500
    message:
      Changing back to using a single dict saves memory,
      at the cost of CPU.
    modified:
      bzrlib/log.py                  log.py-20050505065812-c40ce11702fe5fb1
    ------------------------------------------------------------
    revno: 3711.3.5
    revision-id: john at arbash-meinel.com-20080918182621-g2w4klj590g2zptb
    parent: john at arbash-meinel.com-20080918182100-n1gwuzs3ajy27kdc
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: lighter_log_file
    timestamp: Thu 2008-09-18 13:26:21 -0500
    message:
      a bit more debug info
    modified:
      bzrlib/log.py                  log.py-20050505065812-c40ce11702fe5fb1
    ------------------------------------------------------------
    revno: 3711.3.4
    revision-id: john at arbash-meinel.com-20080918182100-n1gwuzs3ajy27kdc
    parent: john at arbash-meinel.com-20080918164436-tlhfuf010iz9gf5e
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: lighter_log_file
    timestamp: Thu 2008-09-18 13:21:00 -0500
    message:
      Significantly faster, but consuming more memory.
      
      We are back down to 42.6s, but we now consume 409MB.
      The current form keeps the value list separate from the
      revision pointers. That way we don't have to re-do the
      computation when we hit the same nodes repeatedly.
    modified:
      bzrlib/log.py                  log.py-20050505065812-c40ce11702fe5fb1
    ------------------------------------------------------------
    revno: 3711.3.3
    revision-id: john at arbash-meinel.com-20080918164436-tlhfuf010iz9gf5e
    parent: john at arbash-meinel.com-20080918162150-vi5xcsfwrgaql0it
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: lighter_log_file
    timestamp: Thu 2008-09-18 11:44:36 -0500
    message:
      A bit of a Frankenstein now, but it drops memory dramatically.
      
      Now we build up new nodes into a list, and cache them as tuples.
      If we end up computing the frozenset() for a node, we then cache it
      as such.
      Memory consumption is now only 101MB, though processing time is 53s.
    modified:
      bzrlib/log.py                  log.py-20050505065812-c40ce11702fe5fb1
    ------------------------------------------------------------
    revno: 3711.3.2
    revision-id: john at arbash-meinel.com-20080918162150-vi5xcsfwrgaql0it
    parent: john at arbash-meinel.com-20080918161558-745a1vuxechgcq8h
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: lighter_log_file
    timestamp: Thu 2008-09-18 11:21:50 -0500
    message:
      Trade off time versus memory consumption.
      
      [frozen]set() objects always waste a bit of memory, in order
      to avoid hash collisions. So if we cache the ancestry by
      using tuples instead of sets, we can save a bit of memory.
      For mysql, this is 272MB versus 380MB. However, it costs
      CPU time to regenerate all of these sets 56.7s versus 44.2s.
    modified:
      bzrlib/log.py                  log.py-20050505065812-c40ce11702fe5fb1
    ------------------------------------------------------------
    revno: 3711.3.1
    revision-id: john at arbash-meinel.com-20080918161558-745a1vuxechgcq8h
    parent: pqm at pqm.ubuntu.com-20080917230446-p0wippqwikt511sp
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: lighter_log_file
    timestamp: Thu 2008-09-18 11:15:58 -0500
    message:
      Change the per-file ancestry code in log.py
      
      The old code would create a new set() object whenever there
      was more than one parent. However, many of those cases did
      not introduce any new per-file revisions, and thus was
      wasted memory.
    modified:
      NEWS                           NEWS-20050323055033-4e00b5db738777ff
      bzrlib/log.py                  log.py-20050505065812-c40ce11702fe5fb1
=== modified file 'NEWS'
--- a/NEWS	2008-09-24 06:52:03 +0000
+++ b/NEWS	2008-09-25 01:21:44 +0000
@@ -9,6 +9,12 @@
 
   CHANGES:
 
+    * ``bzr log file`` has been changed. It now uses a different method
+      for determining which revisions to show as merging the changes to
+      the file. It now only shows revisions which merged the change
+      towards your mainline. This simplifies the output, makes it faster,
+      and reduces memory consumption.  (John Arbash Meinel)
+
   FEATURES
 
   IMPROVEMENTS:

=== modified file 'bzrlib/index.py'
--- a/bzrlib/index.py	2008-09-03 22:30:56 +0000
+++ b/bzrlib/index.py	2008-09-21 14:48:37 +0000
@@ -366,7 +366,7 @@
         self._keys_by_offset = {}
         # ready-to-return key:value or key:value, node_ref_lists
         self._nodes = {}
-        self._nodes_by_key = {}
+        self._nodes_by_key = None
         trailers = 0
         pos = stream.tell()
         lines = stream.read().split('\n')
@@ -381,26 +381,30 @@
             else:
                 node_value = value
             self._nodes[key] = node_value
-            if self._key_length > 1:
-                # TODO: We may want to do this lazily, but if we are calling
-                #       _buffer_all, we are likely to be doing
-                #       iter_entries_prefix
-                key_dict = self._nodes_by_key
-                if self.node_ref_lists:
-                    key_value = key, node_value[0], node_value[1]
-                else:
-                    key_value = key, node_value
-                # For a key of (foo, bar, baz) create
-                # _nodes_by_key[foo][bar][baz] = key_value
-                for subkey in key[:-1]:
-                    key_dict = key_dict.setdefault(subkey, {})
-                key_dict[key[-1]] = key_value
         # cache the keys for quick set intersections
         self._keys = set(self._nodes)
         if trailers != 1:
             # there must be one line - the empty trailer line.
             raise errors.BadIndexData(self)
 
+    def _get_nodes_by_key(self):
+        if self._nodes_by_key is None:
+            nodes_by_key = {}
+            if self.node_ref_lists:
+                for key, (value, references) in self._nodes.iteritems():
+                    key_dict = nodes_by_key
+                    for subkey in key[:-1]:
+                        key_dict = key_dict.setdefault(subkey, {})
+                    key_dict[key[-1]] = key, value, references
+            else:
+                for key, value in self._nodes.iteritems():
+                    key_dict = nodes_by_key
+                    for subkey in key[:-1]:
+                        key_dict = key_dict.setdefault(subkey, {})
+                    key_dict[key[-1]] = key, value
+            self._nodes_by_key = nodes_by_key
+        return self._nodes_by_key
+
     def iter_all_entries(self):
         """Iterate over all keys within the index.
 
@@ -593,6 +597,7 @@
                 else:
                     yield self, key, self._nodes[key]
             return
+        nodes_by_key = self._get_nodes_by_key()
         for key in keys:
             # sanity check
             if key[0] is None:
@@ -600,7 +605,7 @@
             if len(key) != self._key_length:
                 raise errors.BadIndexKey(key)
             # find what it refers to:
-            key_dict = self._nodes_by_key
+            key_dict = nodes_by_key
             elements = list(key)
             # find the subdict whose contents should be returned.
             try:
@@ -973,7 +978,7 @@
                 raise errors.BadIndexData(self)
             # keys are tuples. Each element is a string that may occur many
             # times, so we intern them to save space. AB, RC, 200807
-            key = tuple(intern(element) for element in elements[:self._key_length])
+            key = tuple([intern(element) for element in elements[:self._key_length]])
             if first_key is None:
                 first_key = key
             absent, references, value = elements[-3:]

=== modified file 'bzrlib/log.py'
--- a/bzrlib/log.py	2008-09-11 17:13:46 +0000
+++ b/bzrlib/log.py	2008-09-22 20:09:07 +0000
@@ -268,8 +268,8 @@
     if specific_fileid:
         view_revisions = _filter_revisions_touching_file_id(branch,
                                                          specific_fileid,
-                                                         mainline_revs,
-                                                         view_revisions)
+                                                         view_revisions,
+                                                         direction)
 
     # rebase merge_depth - unless there are no revisions or 
     # either the first or last revision have merge_depth = 0.
@@ -536,75 +536,88 @@
     return view_revisions
 
 
-def _filter_revisions_touching_file_id(branch, file_id, mainline_revisions,
-                                       view_revs_iter):
-    """Return the list of revision ids which touch a given file id.
+def _filter_revisions_touching_file_id(branch, file_id, view_revisions,
+                                       direction):
+    r"""Return the list of revision ids which touch a given file id.
 
     The function filters view_revisions and returns a subset.
     This includes the revisions which directly change the file id,
     and the revisions which merge these changes. So if the
     revision graph is::
-        A
-        |\
-        B C
+        A-.
+        |\ \
+        B C E
+        |/ /
+        D |
+        |\|
+        | F
         |/
-        D
-
-    And 'C' changes a file, then both C and D will be returned.
-
-    This will also can be restricted based on a subset of the mainline.
-
+        G
+
+    And 'C' changes a file, then both C and D will be returned. F will not be
+    returned even though it brings the changes to C into the branch starting
+    with E. (Note that if we were using F as the tip instead of G, then we
+    would see C, D, F.)
+
+    This will also be restricted based on a subset of the mainline.
+
+    :param branch: The branch where we can get text revision information.
+    :param file_id: Filter out revisions that do not touch file_id.
+    :param view_revisions: A list of (revision_id, dotted_revno, merge_depth)
+        tuples. This is the list of revisions which will be filtered. It is
+        assumed that view_revisions is in merge_sort order (either forward or
+        reverse).
+    :param direction: The direction of view_revisions.  See also
+        reverse_by_depth, and get_view_revisions
     :return: A list of (revision_id, dotted_revno, merge_depth) tuples.
     """
-    # find all the revisions that change the specific file
-    # build the ancestry of each revision in the graph
-    # - only listing the ancestors that change the specific file.
-    graph = branch.repository.get_graph()
-    # This asks for all mainline revisions, which means we only have to spider
-    # sideways, rather than depth history. That said, its still size-of-history
-    # and should be addressed.
-    # mainline_revisions always includes an extra revision at the beginning, so
-    # don't request it.
-    parent_map = dict(((key, value) for key, value in
-        graph.iter_ancestry(mainline_revisions[1:]) if value is not None))
-    sorted_rev_list = tsort.topo_sort(parent_map.items())
-    text_keys = [(file_id, rev_id) for rev_id in sorted_rev_list]
-    modified_text_versions = branch.repository.texts.get_parent_map(text_keys)
-    ancestry = {}
-    for rev in sorted_rev_list:
-        text_key = (file_id, rev)
-        parents = parent_map[rev]
-        if text_key not in modified_text_versions and len(parents) == 1:
-            # We will not be adding anything new, so just use a reference to
-            # the parent ancestry.
-            rev_ancestry = ancestry[parents[0]]
+    # Lookup all possible text keys to determine which ones actually modified
+    # the file.
+    text_keys = [(file_id, rev_id) for rev_id, revno, depth in view_revisions]
+    # Looking up keys in batches of 1000 can cut the time in half, as well as
+    # memory consumption. GraphIndex *does* like to look for a few keys in
+    # parallel, it just doesn't like looking for *lots* of keys in parallel.
+    # TODO: This code needs to be re-evaluated periodically as we tune the
+    #       indexing layer. We might consider passing in hints as to the known
+    #       access pattern (sparse/clustered, high success rate/low success
+    #       rate). This particular access is clustered with a low success rate.
+    get_parent_map = branch.repository.texts.get_parent_map
+    modified_text_revisions = set()
+    chunk_size = 1000
+    for start in xrange(0, len(text_keys), chunk_size):
+        next_keys = text_keys[start:start + chunk_size]
+        # Only keep the revision_id portion of the key
+        modified_text_revisions.update(
+            [k[1] for k in get_parent_map(next_keys)])
+    del text_keys, next_keys
+
+    result = []
+    if direction == 'forward':
+        # TODO: The algorithm for finding 'merges' of file changes expects
+        #       'reverse' order (the default from 'merge_sort()'). Instead of
+        #       forcing this, we could just use the reverse_by_depth order.
+        view_revisions = reverse_by_depth(view_revisions)
+    # Track what revisions will merge the current revision, replace entries
+    # with 'None' when they have been added to result
+    current_merge_stack = [None]
+    for info in view_revisions:
+        rev_id, revno, depth = info
+        if depth == len(current_merge_stack):
+            current_merge_stack.append(info)
         else:
-            rev_ancestry = set()
-            if text_key in modified_text_versions:
-                rev_ancestry.add(rev)
-            for parent in parents:
-                if parent not in ancestry:
-                    # parent is a Ghost, which won't be present in
-                    # sorted_rev_list, but we may access it later, so create an
-                    # empty node for it
-                    ancestry[parent] = set()
-                rev_ancestry = rev_ancestry.union(ancestry[parent])
-        ancestry[rev] = rev_ancestry
-
-    def is_merging_rev(r):
-        parents = parent_map[r]
-        if len(parents) > 1:
-            leftparent = parents[0]
-            for rightparent in parents[1:]:
-                if not ancestry[leftparent].issuperset(
-                        ancestry[rightparent]):
-                    return True
-        return False
-
-    # filter from the view the revisions that did not change or merge 
-    # the specific file
-    return [(r, n, d) for r, n, d in view_revs_iter
-            if (file_id, r) in modified_text_versions or is_merging_rev(r)]
+            del current_merge_stack[depth + 1:]
+            current_merge_stack[-1] = info
+
+        if rev_id in modified_text_revisions:
+            # This needs to be logged, along with the extra revisions
+            for idx in xrange(len(current_merge_stack)):
+                node = current_merge_stack[idx]
+                if node is not None:
+                    result.append(node)
+                    current_merge_stack[idx] = None
+    if direction == 'forward':
+        result = reverse_by_depth(result)
+    return result
 
 
 def get_view_revisions(mainline_revs, rev_nos, branch, direction,

=== modified file 'bzrlib/tests/test_log.py'
--- a/bzrlib/tests/test_log.py	2008-08-21 04:12:02 +0000
+++ b/bzrlib/tests/test_log.py	2008-09-22 20:09:07 +0000
@@ -986,10 +986,10 @@
         view_revs_iter = log.get_view_revisions(mainline, revnos, tree.branch,
                                                 'reverse', True)
         actual_revs = log._filter_revisions_touching_file_id(
-                            tree.branch, 
+                            tree.branch,
                             file_id,
-                            mainline,
-                            list(view_revs_iter))
+                            list(view_revs_iter),
+                            'reverse')
         self.assertEqual(revisions, [r for r, revno, depth in actual_revs])
 
     def test_file_id_f1(self):




More information about the bazaar-commits mailing list