[MERGE] performance enhancements for log -v

Sun Jun 25 18:06:50 BST 2006

Aaron Bentley wrote:
> Hi all,
> 
> I've been working on improving performance for log -v, and I'm happy to
> say I've succeeded.  It's gone from 2013 ms to 770 ms, i.e. it's 2.6
> times faster.  Normal log operations are also faster;

Massively nice performance improvement, I'll try to give it a good review.


(Same thing with your fixes to 'log'. I think it does what I want, but I
want some time to play with both to get a feel for if the information is
understandable in the new forms, and which is actually more understandable.)

> 
> To accomplish this, I've refactored the implementation of
> KnitVersionedFile.  This actually brought back memories of reworking the
> Arch builder, since both systems build up revisions using stored
> full-texts, cached texts and deltas.

:)

> 
> The main change was to retrieve the the inventories in batches, but
> after I had that, I kept working on eliminating list construction, and
> wound up knocking out a lot of the private methods.  The way a group of
> texts is generated now is:

Are you retrieving inventories, or the complete set of involved texts in
batches?

> 
> 1. A dict called component_data is produced that holds describes data
> about how to build all component versions.  This component_data doesn't
> have the actual texts, but it knows where they're located in the knit


I assume this is just in a per-knit sense, right? So you tell the knit
'give me these revisions', and the knit builds up a map of what it needs
to get, where it is, and then goes and starts building up your request,
caching any intermediates as it goes.

> 
> 2. All the contents for all component versions are retrieved at once
> using the component_data positions.  This allows us to use readv
> efficiently.  A dict called record_map is produced based on
> component_data, replacing position data with contents and digest.
> 
> 3. Each requested revision is built.  Intermediate KnitContent objects
> are held in a dict, and reused for building subsequent revisions, where
> possible.
> 
> There's also a dict of *final* KnitContent objects, because texts
> lacking final eols are handled very late in the process.  KnitComponents
> without final eols cannot be reused to build later revisions from.

This seems like something that should be fixed. Not necessarily in this
pass, but any time we have the statement "X (which should be used)
cannot be used" seems like something that needs fixing.

> 
> Additionally, since texts are generated for SHA1 verification, these are
> kept in a dict and returned.  Most operations will use this map.
> 

This sounds a lot like we are caching all of the contents that we
extract from a given knit. Does this clash with your "don't cache unless
we need to", and when is this text map cleaned out.

It would seem to me that this map should only exist for the length of
the actual top-level call (it shouldn't be a state member of the knit,
it should be passed around between the functions that use it, and then
garbage collected when you return the texts). At least from what you
seemed to see with kernel-sized trees taking up too much ram.


> I don't think we can improve log performance much more without first
> addressing knit record parsing.  According to lsprof, if get_texts takes
> 1.41 seconds, 0.85 of that is spent parsing records.
> 
> BEFORE CHANGES
> ...nchmarks.bench_log.LogBenchmark.test_cmd_log       OK    53ms/ 9447ms
> ...s.bench_log.LogBenchmark.test_cmd_log_subprocess   OK   598ms/10540ms
> ...nchmarks.bench_log.LogBenchmark.test_log           OK   410ms/10365ms
> ...hmarks.bench_log.LogBenchmark.test_log_screenful   OK    48ms/ 9148ms
> ...s.bench_log.LogBenchmark.test_log_screenful_line   OK    36ms/10362ms
> ....bench_log.LogBenchmark.test_log_screenful_short   OK    21ms/ 9096ms
> ...nchmarks.bench_log.LogBenchmark.test_log_verbose   OK  2013ms/11003ms
> ...nchmarks.bench_log.LogBenchmark.test_merge_log     OK   111ms/24973ms
> 
> 
> 
> AFTER CHANGES
> ...nchmarks.bench_log.LogBenchmark.test_cmd_log       OK    47ms/ 9215ms
> ......ench_log.LogBenchmark.test_cmd_log_subprocess   OK   322ms/ 9445ms
> ...nchmarks.bench_log.LogBenchmark.test_log           OK   403ms/ 9795ms
> ...hmarks.bench_log.LogBenchmark.test_log_screenful   OK    49ms/ 9185ms
> ...s.bench_log.LogBenchmark.test_log_screenful_line   OK    33ms/ 9458ms
> ....bench_log.LogBenchmark.test_log_screenful_short   OK    22ms/ 9232ms
> ...nchmarks.bench_log.LogBenchmark.test_log_verbose   OK   770ms/10521ms
> ...nchmarks.bench_log.LogBenchmark.test_merge_log     OK   110ms/24873ms
> 
> Aaron



...

 class _KnitFactory(object):
     """Base factory for creating content objects."""
@@ -442,8 +445,7 @@

     def get_sha1(self, version_id):
         """See VersionedFile.get_sha1()."""
-        components = self._get_components(version_id)
-        return components[-1][-1][-1]
+        return self._get_record_map([version_id])[version_id][2]

^- This doesn't seem to be very readable. I realize the old one wasn't
either, but maybe we could do:

map = self._get_record_map([version_id])
record_info = map[version_id]
return record_info[2] # sha is the 2nd index in the info map


     @staticmethod
     def get_suffixes():
@@ -512,86 +514,34 @@
             diff_hunks.append((op[1], op[2], op[4]-op[3],
new_content._lines[op[3]:op[4]]))
         return diff_hunks

-    def _get_component_versions(self, version_id):
-        basis = self.basis_knit
-        needed_versions = []
-        basis_versions = []
-        cursor = version_id
-
-        while 1:
-            picked_knit = self
-            if basis and basis._index.has_version(cursor):
-                picked_knit = basis
-                basis_versions.append(cursor)
-            method = picked_knit._index.get_method(cursor)
-            needed_versions.append((method, cursor))
-            if method == 'fulltext':
-                break
-            cursor = picked_knit.get_parents(cursor)[0]
-        return needed_versions, basis_versions
-
-    def _get_component_positions(self, version_id):
-        needed_versions, basis_versions = \
-            self._get_component_versions(version_id)
-        assert len(basis_versions) == 0
-        positions = []
-        for method, comp_id in needed_versions:
-            data_pos, data_size = self._index.get_position(comp_id)
-            positions.append((method, comp_id, data_pos, data_size))
-        return positions
-
-    def _get_components(self, version_id):
-        """Return a list of (version_id, method, data) tuples that
-        makes up version specified by version_id of the knit.
-
-        The components should be applied in the order of the returned
-        list.
-
-        The basis knit will be used to the largest extent possible
-        since it is assumed that accesses to it is faster.
+    def _get_components_positions(self, version_ids):
+        """Produce a map of position data for the components of versions.
+
+        This data is indended to be used for retrieving the knit records.
+
+        A dict of version_id to (method, data_pos, data_size, next) is
+        returned.
+        method is the way referenced data should be applied.
+        data_pos is the position of the data in the knit.
+        data_size is the size of the data in the knit.
+        next is the build-parent of the version, or None for fulltexts.
         """
-        #profile notes:
-        # 4168 calls in 14912, 2289 internal
-        # 4168 in 9711 to read_records
-        # 52554 in 1250 to get_parents
-        # 170166 in 865 to list.append
-
-        # needed_revisions holds a list of (method, version_id) of
-        # versions that is needed to be fetched to construct the final
-        # version of the file.
-        #
-        # basis_revisions is a list of versions that needs to be
-        # fetched but exists in the basis knit.
-
-        needed_versions, basis_versions = \
-            self._get_component_versions(version_id)
-
-        components = {}
-        if basis_versions:
-            assert False, "I am broken"
-            basis = self.basis_knit
-            records = []
-            for comp_id in basis_versions:
-                data_pos, data_size =
basis._index.get_data_position(comp_id)
-                records.append((comp_id, data_pos, data_size))
-            components.update(basis._data.read_records(records))
-
-        records = []
-        for comp_id in [vid for method, vid in needed_versions
-                        if vid not in basis_versions]:
-            data_pos, data_size = self._index.get_position(comp_id)
-            records.append((comp_id, data_pos, data_size))
-        components.update(self._data.read_records(records))
-
-        # get_data_records returns a mapping with the version id as
-        # index and the value as data.  The order the components need
-        # to be applied is held by needed_versions (reversed).
-        out = []
-        for method, comp_id in reversed(needed_versions):
-            out.append((comp_id, method, components[comp_id]))
-
-        return out
-
+        component_data = {}
+        for version_id in version_ids:
+            cursor = version_id
+
+            while cursor is not None and cursor not in component_data:
+                picked_knit = self
+                method = picked_knit._index.get_method(cursor)
+                if method == 'fulltext':
+                    next = None
+                else:
+                    next = picked_knit.get_parents(cursor)[0]
+                data_pos, data_size = self._index.get_position(cursor)
+                component_data[cursor] = (method, data_pos, data_size,
next)
+                cursor = next
+        return component_data
+


Do I understand correctly that you basically folded up the 2 requests
into a single function that now takes an array of versions, rather than
just one? You seem to still be carrying around the 'basis-knit' cruft,
even though you never reference the basis knit that I can tell.
(Otherwise why are you using 'picked knit').
In general, I think I like this function, and I like that it takes an
array. I'm hoping we don't need to deprecate the old functions, just
because they were marked private. But if we do, couldn't they just thunk
into this?


     def _get_content(self, version_id, parent_texts={}):
         """Returns a content object that makes up the specified
         version."""
@@ -605,26 +555,7 @@
         if self.basis_knit and version_id in self.basis_knit:
             return self.basis_knit._get_content(version_id)

-        content = None
-        components = self._get_components(version_id)
-        for component_id, method, (data, digest) in components:
-            version_idx = self._index.lookup(component_id)
-            if method == 'fulltext':
-                assert content is None
-                content = self.factory.parse_fulltext(data, version_idx)
-            elif method == 'line-delta':
-                delta = self.factory.parse_line_delta(data, version_idx)
-                content._lines = self._apply_delta(content._lines, delta)
-
-        if 'no-eol' in self._index.get_options(version_id):
-            line = content._lines[-1][1].rstrip('\n')
-            content._lines[-1] = (content._lines[-1][0], line)
-
-        # digest here is the digest from the last applied component.
-        if sha_strings(content.text()) != digest:
-            raise KnitCorrupt(self.filename, 'sha-1 does not match %s'
% version_id)
-
-        return content
+        return self._get_content_maps([version_id])[1][version_id]

I must not understand the map that you are returning. Because I would
have sworn that this should have been:
return self._get_content_maps([version_id])[version_id][1]

If I'm right, then I think you should apply the same fix for
content = self._get_content_map([version_id])[version_id]
return content[1]
Or something to make it clearer what _get_content_map is returning.

It might also be better to have a convenience function for returning the
content for a single entry, since we seem to want to use that in several
places, and I think you've shown it is fairly easy to get it wrong. (or
at least hard to read when you have get_foo([a])[b][c]



     def _check_versions_present(self, version_ids):
         """Check that all specified versions are present."""
@@ -743,21 +674,26 @@
         """See VersionedFile.get_lines()."""
         return self.get_line_list([version_id])[0]

-    def _get_version_components(self, position_map):
-        records = []
-        for version_id, positions in position_map.iteritems():
-            for method, comp_id, position, size in positions:
-                records.append((comp_id, position, size))
-        record_map = self._data.read_records(records)
-
-        component_map = {}
-        for version_id, positions in position_map.iteritems():
-            components = []
-            for method, comp_id, position, size in positions:
-                data, digest = record_map[comp_id]
-                components.append((comp_id, method, data, digest))
-            component_map[version_id] = components
-        return component_map
+    def _get_record_map(self, version_ids):
+        """Produce a dictionary of knit records.
+
+        The keys are version_ids, the values are tuples of (method,
content,
+        digest, next).
+        method is the way the content should be applied.
+        content is a KnitContent object.
+        digest is the SHA1 digest of this version id after all steps
are done
+        next is the build-parent of the version, i.e. the leftmost
ancestor.
+        If the method is fulltext, next will be None.
+        """
+        position_map = self._get_components_positions(version_ids)
+        records = [(c, p, s) for c, (m, p, s, n) in
position_map.iteritems()]
+        record_map = {}
+        for component_id, content, digest in\
+            self._data.read_records_iter(records):
+            method, position, size, next = position_map[component_id]
+            record_map[component_id] = method, content, digest, next
+
+        return record_map

Either use longer variable names, or at least put a comment right above
it, as to what the different variables mean:
# c: content, p: position, s: size/length, m:???, n:???
Or is s sha? (Hence the problem with single letter variables).
I'm okay with short ones inside the list comprehension, since it makes
it nice and succinct, but we need to clarify what the objects are.


     def get_text(self, version_id):
         """See VersionedFile.get_text"""
@@ -768,38 +704,65 @@

     def get_line_list(self, version_ids):
         """Return the texts of listed versions as a list of strings."""
-        position_map = {}
+        text_map, content_map = self._get_content_maps(version_ids)
+        return [text_map[v] for v in version_ids]
+
+    def _get_content_maps(self, version_ids):
+        """Produce maps of text and KnitContents
+
+        :return: (text_map, content_map) where text_map contains the
texts for
+        the requested versions and content_map contains the KnitContents.
+        Both dicts are take version_ids as their keys.
+        """

^-- Grammer bug 'dicts are take', 'dicts take'?

         for version_id in version_ids:
             if not self.has_version(version_id):
                 raise RevisionNotPresent(version_id, self.filename)
-            position_map[version_id] = \
-                self._get_component_positions(version_id)
-
-        version_components =
self._get_version_components(position_map).items()
+        record_map = self._get_record_map(version_ids)

         text_map = {}
-        for version_id, components in version_components:
+        content_map = {}
+        final_content = {}
+        for version_id in version_ids:
+            components = []
+            cursor = version_id
+            while cursor is not None:
+                method, data, digest, next = record_map[cursor]
+                components.append((cursor, method, data, digest))
+                if cursor in content_map:
+                    break
+                cursor = next
+


^- looks good


             content = None
             for component_id, method, data, digest in reversed(components):
-                version_idx = self._index.lookup(component_id)
-                if method == 'fulltext':
-                    assert content is None
-                    content = self.factory.parse_fulltext(data,
version_idx)
-                elif method == 'line-delta':
-                    delta = self.factory.parse_line_delta(data,
version_idx)
-                    content._lines = self._apply_delta(content._lines,
delta)
+                if component_id in content_map:
+                    content = content_map[component_id]
+                else:
+                    version_idx = self._index.lookup(component_id)
+                    if method == 'fulltext':
+                        assert content is None
+                        content = self.factory.parse_fulltext(data,
version_idx)
+                    elif method == 'line-delta':
+                        delta = self.factory.parse_line_delta(data[:],
+                                                              version_idx)
+                        content = content.copy()
+                        content._lines = self._apply_delta(content._lines,
+                                                           delta)
+                    content_map[component_id] = content

             if 'no-eol' in self._index.get_options(version_id):
+                content = content.copy()
                 line = content._lines[-1][1].rstrip('\n')
                 content._lines[-1] = (content._lines[-1][0], line)
+            final_content[version_id] = content

             # digest here is the digest from the last applied component.
-            if sha_strings(content.text()) != digest:
+            text = content.text()
+            if sha_strings(text) != digest:
                 raise KnitCorrupt(self.filename,
                                   'sha-1 does not match %s' % version_id)

-            text_map[version_id] = content.text()
-        return [text_map[v] for v in version_ids]
+            text_map[version_id] = text
+        return text_map, final_content

^- Looking at the above lines, it would seem that knit deltas are
actually built up assuming that every line has a final newline, and then
they strip the last newline. Is that correct?
Since the knits are actually 'line-deltas' rather than being actual raw
deltas, maybe this is reasonable. I don't really like keeping 2 copies
of records, but as long as it is the same reference for everything other
than entries without eol, its probably okay.



     def iter_lines_added_or_present_in_versions(self, version_ids=None):
         """See VersionedFile.iter_lines_added_or_present_in_versions()."""
@@ -1444,13 +1407,15 @@
         # 4168   calls to readv              in 1411
         # 4168   calls to parse_record       in 2880

-        needed_records = []
+        needed_records = set()
         for version_id, pos, size in records:
             if version_id not in self._records:
-                needed_records.append((version_id, pos, size))
+                needed_records.add((version_id, pos, size))
+
+        # turn our set into a list, sorted by file position
+        needed_records = sorted(needed_records, key=operator.itemgetter(1))

^- This is where you figure out what you are going to need to 'readv'
right? Why are you using a set() instead of a list?, I do believe
list.append() is faster than 'set.add()'. Though I guess you may be
adding the same version_id more than once (depending on what is in
'records' and '_records'.

I am also wondering if it would be better to add them as:
needed_records.add((pos, size, version_id)) and then just let them sort
naturally.
(I wouldn't mind sorting on pos, size rather than just pos. Though we
shouldn't have them overlapping).


         if len(needed_records):
-            needed_records.sort(key=operator.itemgetter(1))
             # We take it that the transport optimizes the fetching as good
             # as possible (ie, reads continuous ranges.)
             response = self._transport.readv(self._filename,


=== modified file 'bzrlib/log.py'

--- bzrlib/log.py	2006-06-23 15:27:25 +0000
+++ bzrlib/log.py	2006-06-25 04:57:00 +0000
@@ -54,10 +54,8 @@
 from itertools import izip
 import re

-from bzrlib.delta import compare_trees
 import bzrlib.errors as errors
 from bzrlib.trace import mutter
-from bzrlib.tree import EmptyTree
 from bzrlib.tsort import merge_sort


@@ -115,17 +113,6 @@
     return rh


-def _get_revision_delta(branch, revno):
-    """Return the delta for a mainline revision.
-
-    This is used to show summaries in verbose logs, and also for finding
-    revisions which touch a given file."""
-    # XXX: What are we supposed to do when showing a summary for something
-    # other than a mainline revision.  The delta to it's first parent, or
-    # (more useful) the delta to a nominated other revision.
-    return branch.get_revision_delta(revno)
-
-
 def show_log(branch,
              lf,
              specific_fileid=None,
@@ -225,19 +212,24 @@

     def iter_revisions():
         revision_ids = [r for r, n, d in view_revisions]
+        zeros = set(r for r, n, d in view_revisions if d == 0)
         num = 9
+        cur_deltas = {}

^- again, either avoid single letters, or define them in a comment, it
doesn't really matter when there are only one [r for r in revisions],
but it is more confusing when there are multiple variables


         while revision_ids:
             revisions = branch.repository.get_revisions(revision_ids[:num])
+            if verbose or specific_fileid:
+                delta_revisions = [r for r in revisions if
+                                   r.revision_id in zeros]
+                deltas =
branch.repository.get_revision_deltas(delta_revisions)
+                cur_deltas = dict(zip((d.revision_id for d in
+                                       delta_revisions), deltas))

^- here 'd.revision_id' seems the wrong variable name, since
'delta_revisions' is actually a list of 'Revision' objects. Better to
use r.revision_id for r in delta_revisions. Or even better 'for rev in
delta_revisions'. I was trying to figure out how you were iterating over
the dictionary.


             for revision in revisions:
-                yield revision
+                yield revision, cur_deltas.get(revision.revision_id)
             revision_ids  = revision_ids[num:]
             num = int(num * 1.5)

^- Can you put a comment that cur_deltas.get() is returning None in the
case that we are not Verbose? Either that, or change it to a if/else
block and use 'yeild, None' in the second block, but I think I just
prefer the comment.


-        revisions = branch.repository.get_revisions()
-        for revision in revisions:
-            yield revision
     # now we just print all the revisions
-    for ((rev_id, revno, merge_depth), rev) in \
+    for ((rev_id, revno, merge_depth), (rev, delta)) in \
          izip(view_revisions, iter_revisions()):

         if searchRE:
@@ -246,8 +238,6 @@

         if merge_depth == 0:
             # a mainline revision.
-            if verbose or specific_fileid:
-                delta = _get_revision_delta(branch, rev_nos[rev_id])

             if specific_fileid:
                 if not delta.touches_file_id(specific_fileid):

=== modified file 'bzrlib/repository.py'
--- bzrlib/repository.py	2006-06-20 05:32:16 +0000
+++ bzrlib/repository.py	2006-06-25 04:57:00 +0000
@@ -331,19 +331,36 @@
         self._check_revision_parents(r, inv)
         return r

+    @needs_read_lock
+    def get_revision_deltas(self, revisions):
+        """Produce a generator of revision deltas.
+
+        Note that the input is a sequence of REVISIONS, not revision_ids.
+        Trees will be held in memory until the generator exits.
+        Each delta is relative to the revision's lefthand predecessor.
+        """
+        required_trees = set()
+        for revision in revisions:
+            required_trees.add(revision.revision_id)
+            required_trees.update(revision.parent_ids[:1])
+        trees = dict((t.get_revision_id(), t) for
+                     t in self.revision_trees(required_trees))
+        for revision in revisions:
+            if not revision.parent_ids:
+                old_tree = EmptyTree()
+            else:
+                old_tree = trees[revision.parent_ids[0]]
+            yield delta.compare_trees(old_tree,
trees[revision.revision_id])

^-- Can you make a comment that :param revisions: A list of Revision
objects. I would have otherwise thought that they were revision ids.
I'm also concerned that 'get_revision_delta' takes a revision_id, and
'get_revision_deltas()' takes Revision objects.
Maybe 'get_deltas_for_revisions()' is better, to make it clear that the
api difference is more than just taking a list versus a single item.

+
+    @needs_read_lock
     def get_revision_delta(self, revision_id):
         """Return the delta for one revision.

         The delta is relative to the left-hand predecessor of the
         revision.
         """
-        revision = self.get_revision(revision_id)
-        new_tree = self.revision_tree(revision_id)
-        if not revision.parent_ids:
-            old_tree = EmptyTree()
-        else:
-            old_tree = self.revision_tree(revision.parent_ids[0])
-        return delta.compare_trees(old_tree, new_tree)
+        r = self.get_revision(revision_id)
+        return list(self.get_revision_deltas([r]))[0]

     def _check_revision_parents(self, revision, inventory):
         """Private to Repository and Fetch.
@@ -550,6 +567,18 @@
             return RevisionTree(self, inv, revision_id)

     @needs_read_lock
+    def revision_trees(self, revision_ids):
+        """Return Tree for a revision on this branch.
+
+        `revision_id` may not be None or 'null:'"""
+        assert None not in revision_ids
+        assert NULL_REVISION not in revision_ids
+        texts = self.get_inventory_weave().get_texts(revision_ids)
+        for text, revision_id in zip(texts, revision_ids):
+            inv = self.deserialise_inventory(revision_id, text)
+            yield RevisionTree(self, inv, revision_id)
+
+    @needs_read_lock
     def get_ancestry(self, revision_id):
         """Return a list of revision-ids integrated by a revision.


^- Ultimately, we should think about changing our data storage, so that
rather than deserializing the entire inventory, just to run
'compare_trees' on the result, we should be able to come up with some
sort of Delta object, that can be run directly on the stored
information. I'm not sure how the api would work, and it may involve a
storage format change. But I think from an algorithmic level, it cuts
out *lots* of operations of extracting all the old inventories, just to
build them up again.
And since our data storage right now is actually just a delta against
the leftmost ancestor, it seems like it wouldn't require a data format
change.
I suppose it depends how much we can trust that the delta in the knit is
correct. But if it is correct, we could just add a 'get_deltas' to
VersionedFile (which would handle the case when the text is a full-text
versus a previous ancestor). And then instead of calling 'compare_trees'
which has to iterate the full tree inventory, we could have
bzrlib.delta.delta_from_delta(), which could figure out the
added/new/removed/renamed just from the delta hunks.

This doesn't block what you've done, but I see it as a *massive*
improvement for kernel-sized trees, since it changes O(num_files) to
O(num_changes). And for all I know, the compare_trees is actually
O(num_files^2), I think it is at least O(n log(n)). Or at least O(2n),
since it has to iterate over both the previous and post inventories,
rather than just looking at the delta that we've already computed.



So, I think you've done some great work, and with a couple cleanups you
have a +1. And I think as a group we should think about how we can get
rid of reading the entire inventory, to build up a RevisionTree, just to
call compare_trees() and generate a delta against something we already
have a delta.

However, I think introducing 'get_revision_deltas()' (whatever it ends
up being called), gives us a location to do the optimization (even if it
requires new interfaces on VersionedFile).
So I'm very +1 on your changes.

John
=:->


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060625/77433b17/attachment.pgp