[MERGE] faster log - hooray!

Wed Jan 28 15:43:26 GMT 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Clatworthy wrote:
> bzr's log performance has long been a sore point, particularly
> on large projects. The attached patch changes the internal
> architecture of log so that it is:
> 
> * better at displaying incremental results sooner
> * faster at showing a single revision
> * faster at filtering by a revision range.
> 
> Benchmark results are attached for the emacs-merges trunk.
> See my earlier emails on this topic for an explanation of the
> various benchmarks run. I'll run some more benchmarks tonight
> on a MySQL branch and an OpenOffice.org branch to confirm the
> benchmark results are more broadly applicable than just for Emacs.
> 

I like the results overall. I feel like there are parts that probably
add a bit too much complexity for the actual gain, and I think it could
still use some code cleanup passes. But ultimately, I'd rather have it
land and get real-world use, than wait much longer.

BB:tweak

The only test that seems to have regressed is this one:

LogEarly:merges-file                          20.7      24.3

Do you have any idea why it is slower? Especially that all other tests
are faster. (It may be a blip, it may be something)

Also, is this with or without your revno-cache plugin?

- -        if revisionspec_list[1].get_branch() != revisionspec_list[0
- -                ].get_branch():
+        # Lazy imports prevent exact class matching so re-import here
+        from bzrlib.revisionspec import RevisionInfo, RevisionSpec
+        start_spec = revisionspec_list[0]
+        end_spec = revisionspec_list[1]
+        if end_spec.get_branch() != start_spec.get_branch():

^- Could you explain this a bit better? And if it really is a problem,
*please* remove a lazy_import of a class and change it to a module, and
access them via:

from bzrlib import revisionspec
revisionspec.RevisionInfo

+        rev1 = start_spec.in_history(branch)
+        # Avoid loading all of history when we know a missing
+        # end of range means the last revision ...
+        if end_spec.__class__ == RevisionSpec:
+            last_revno, last_revision_id = branch.last_revision_info()
+            rev2 = RevisionInfo(branch, last_revno, last_revision_id)
+        else:
+            rev2 = end_spec.in_history(branch)

^- This really doesn't seem like the right way to go about it. I'm
guessing you aren't using isinstance() because you need to know that it
is not a subclass, bu the original RevisionSpec object.

More importantly, though, why not change whatever RevisionSpec code so
that it returns branch.last_revision_info() properly?

Otherwise, the correct test is actually:

if end_spec.spec is None:
  last_revno, last_revision_id = ...

I guess you can't just have RevisionSpec itself know, because it doesn't
know whether it is the start end of the range, or the tail end of the
range. However, you *can* do "end_spec.spec is None" rather than
inspecting __class__. (Which probably makes the 'lazy_import' stuff
moot, as you won't be looking at classes directly anymore.)

...

+def _create_log_revision_iterator(branch, start_revision, end_revision,
+    direction, specific_fileid, search, generate_merge_revisions,
+    allow_single_merge_revision, generate_delta, limited_output=False):

I'm a bit concerned about this function, as it seems to be adding a
*lot* of special casing. This isn't saying it isn't necessary, it just
is a flag of "are we sure we want to do this?"

+    use_deltas_for_matching = specific_fileid and (
+            generate_delta or start_rev_id or end_rev_id)
+    delayed_graph_generation = not specific_fileid and (
+            start_rev_id or end_rev_id or limited_output)
+    generate_merges = generate_merge_revisions or (specific_fileid and
+        not use_deltas_for_matching)
+    view_revisions = _calc_view_revisions(branch, start_rev_id, end_rev_id,
+        direction, generate_merges, allow_single_merge_revision,
+        delayed_graph_generation=delayed_graph_generation)
+    search_deltas_for_fileids = None
+    if use_deltas_for_matching:
+        search_deltas_for_fileids = set([specific_fileid])

^- I'm pretty sure that specific_fileid can be None, and if so, this
seems like it should be:

if use_deltas_for_matching:
  if specific_fileid is None:
    search_deltas_for_fileids = set()
  else:
    search_deltas_for_fileids = set([specific_fileid])

I realize that we won't find a fileid that matches 'None', but I'm
thinking it is better to search for the empty set.

There is enough depth of complexity here, that it seems like we would
really like to have a lot more helper functions, and there are enough of
them, that it seems like it would make sense to create a class
definition, and turn them into 'self.XXX'.

For example in:
+def _calc_view_revisions(branch, start_rev_id, end_rev_id, direction,
+    generate_merge_revisions, allow_single_merge_revision,
+    delayed_graph_generation=False):

You have inner blocks like:

+    generate_single_revision = (end_rev_id and start_rev_id ==
end_rev_id and
+        (not generate_merge_revisions or not _has_merges(branch,
end_rev_id)))
+    if generate_single_revision:
+        if end_rev_id == br_rev_id:
+            # It's the tip
+            return [(br_rev_id, br_revno, 0)]
+        else:
+            revno = branch.revision_id_to_dotted_revno(end_rev_id)
+            if len(revno) > 1 and not allow_single_merge_revision:
+                # It's a merge revision and the log formatter is
+                # completely brain dead. This "feature" of allowing
+                # log formatters incapable of displaying dotted revnos
+                # ought to be deprecated IMNSHO. IGC 20091022
+                raise errors.BzrCommandError('Selected log formatter only'
+                    ' supports mainline revisions.')
+            revno_str = '.'.join(str(n) for n in revno)
+            return [(end_rev_id, revno_str, 0)]

^- IMO, the stuff that is inside the 'if generate_single_revision' would
be great inside another function to make _calc_view_revisions a little
bit less difficult to get your head around.

That said, I'd rather have this land and get the benefit, than have to
wait for it to be polished to perfection. This is mostly just a
suggestion of what we should probably do to get this code a bit more
manageable.

...

+def _has_merges(branch, rev_id):
+    """Does a revision have multiple parents or not?"""
+    return len(branch.repository.get_revision(rev_id).parent_ids) > 1
+
+

^- This, IMO, is a bad way of doing it. As you have to process the XML
for a revision, which you then throw away. Is there a reason you
couldn't do:

parents = branch.repository.get_parent_map([rev_id]).get(rev_id, [])
return len(parents) > 1

That only has to go to the index layer, rather than hitting the index
layer to find the revision text, and then extracting and parsing the
XML, etc.

Especially for "delayed_graph_generation", it seems useful.

+    if not generate_merge_revisions:
+        result = _linear_view_revisions(branch, start_rev_id, end_rev_id)
+        # If a start limit was given and it's not obviously an
+        # ancestor of the end limit, check it before outputting anything
+        if start_rev_id and not (_is_obvious_ancestor(branch, start_rev_id,
+            end_rev_id)):
+            try:
+                result = list(result)
+            except _StartNotLinearAncestor:
+                raise errors.BzrCommandError('Start revision not found in'
+                    ' left-hand history of end revision.')
+        if direction == 'forward':
+            result = reversed(list(result))
+        return result

^- I don't see the big win of using "_is_obvious_ancestor" here. It
seems like it has to do all the work of walking big graphs (given that
you end up calling things like "revision_id_to_dotted_revno" inside
_is_obvious_ancestor.)

I believe you are trying to iterate more, rather than always calling
list(), but I think if you have called _is_obvious_ancestor() you've
done all the necessary iterating that you would have done just calling
list(result).

Have you tried removing this check and seeing if it really impacts
performance? (All things being equal, reducing complexity is good.)

Also, the fact that _is_obvious_ancestor seems to have a side-effect of
computing merge_sorted and dotted revnos, *without* an obvious way of
getting just a subset. Which means that when we are able to do just a
subset, it seems like we might have a hard time working around the
_is_obvious_ancestor work. I could be wrong, though.

Also, for *me* I always use --forward, which means that the whole check
is moot. So I'd like to see the check updated to:

+        if (direction == 'forward'
	     or (start_rev_id
                 and not (_is_obvious_ancestor(branch, start_rev_id,
+                                              end_rev_id))):
+            try:
+                result = list(result)
+            except _StartNotLinearAncestor:
+                raise errors.BzrCommandError('Start revision not found in'
+                    ' left-hand history of end revision.')
+        if direction == 'forward':
+            result = reversed(result)
+        return result

Do you have a Usertest for my personal "use it many times every day"
version:

  bzr alias log='log --short --forward -r-10..-1'

Being selfish, that is the command I care the most about being fast. I
guess revision_id_to_dotted_revnos() uses the mainline check first, and
thus all of the code paths it has to take turn out to be fairly fast. It
doesn't really seem the right way to go about it, IMO.

...

+    else:
+        # We're following a development line starting at a merged revision.
+        # We need to adjust depths down by the initial depth until we find
+        # a depth less than it. Then we use that depth as the adjustment.
+        # If and when we reach the mainline, depth adjustment ends.
+        depth_adjustment = None
+        for (rev_id, merge_depth, revno, end_of_merge
+             ) in view_revisions:
+            if depth_adjustment is None:
+                depth_adjustment = merge_depth
+            if depth_adjustment:
+                if merge_depth < depth_adjustment:
+                    depth_adjustment = merge_depth
+                merge_depth -= depth_adjustment
+            yield rev_id, '.'.join(map(str, revno)), merge_depth

^- I think this actually needs to be a 2-pass arrangement. Where you
compute the overall depth_adjustment, and then adjust everything by that
value. Otherwise I think if you did:

bzr log --long -r 1..1.2.3

It would start logging 1.2.3 with a depth_adjustment of N, but then
finish logging mainline with a depth-adjustment of 0. Or was that the
specific goal? (logging it as though 1.2.3 was the mainline, but using
the numbering from the current branch tip?)

I thought the original depth_adjustment work was just to handle that
doing "bzr log -r 1.2.1..1.2.3" ended up indented everywhere, which
seemed silly.

...

+def calculate_view_revisions(branch, start_revision, end_revision,
direction,
+        specific_fileid, generate_merge_revisions,
allow_single_merge_revision):
+    """Calculate the revisions to view.
+
+    :return: An iterator of (revision_id, dotted_revno, merge_depth)
tuples OR
+             a list of the same tuples.
+    """
+    # This method is no longer called by the main code path.
+    # It is retained for API compatibility and may be deprecated
+    # soon. IGC 20090116

^- If we aren't using it anymore, I'd rather deprecate it now rather
than waiting.

...

     def test_merges_nonsupporting_formatter(self):
+        # This "feature" of log formatters is madness. If a log
+        # formatter cannot display a dotted-revno, it ought to ignore it.
+        # Otherwise, a linear sequence is always expected to be handled
now.
+        raise KnownFailure('log formatters must support linear
sequences now')

^- I don't understand this change.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkmAfR4ACgkQJdeBCYSNAAOyPgCeMhnwzROPdGrV2+VQB/QBfm2q
LkMAn1/vCJbY/5wAB75UGVOvBxAjOucu
=EO0M
-----END PGP SIGNATURE-----