[MERGE] faster log - hooray!

Wed Feb 4 12:33:54 GMT 2009

John Arbash Meinel wrote:
> Ian Clatworthy wrote:
>> bzr's log performance has long been a sore point, particularly
>> on large projects. The attached patch changes the internal
>> architecture of log so that ...

> I like the results overall. I feel like there are parts that probably
> add a bit too much complexity for the actual gain, and I think it could
> still use some code cleanup passes. But ultimately, I'd rather have it
> land and get real-world use, than wait much longer.

I fully agree.

> The only test that seems to have regressed is this one:
> 
> LogEarly:merges-file                          20.7      24.3
> 
> Do you have any idea why it is slower? Especially that all other tests
> are faster. (It may be a blip, it may be something)

It's slower because the delta-matching algorithm (in this patch)
was simplistic and could easily be slower than per-file graph,
even for small ranges. I backed out the rule for switching it on
so that it was *only* used if we were generating a delta anyway.
(After that, some numbers changed up and others down, but the landed
patch was faster for every test on Emacs and looked better overall
on MySQL and OOo.)

My pending patch for loging multiple files and directories includes
a much faster implementation (~ 4x better) for delta matching, but
is still conservative about when it gets switched on.

> Also, is this with or without your revno-cache plugin?

*Without*. The revnocache plugin takes large amounts off most of
the remaining high numbers because, with this patch landed and
the revgraphcache strategy used by the plugin, the fully sorted
merge history is cached in a file.

> -        if revisionspec_list[1].get_branch() != revisionspec_list[0
> -                ].get_branch():
> +        # Lazy imports prevent exact class matching so re-import here
> +        from bzrlib.revisionspec import RevisionInfo, RevisionSpec
> +        start_spec = revisionspec_list[0]
> +        end_spec = revisionspec_list[1]
> +        if end_spec.get_branch() != start_spec.get_branch():
> 
> ^- Could you explain this a bit better? And if it really is a problem,
> *please* remove a lazy_import of a class and change it to a module, and
> access them via:
> 
> from bzrlib import revisionspec
> revisionspec.RevisionInfo

I didn't dig deep when the class matching failed. I simply jumped to
the conclusion that the wrapper/lazy-delegate class was getting in
the way, made the change to explicit importing and it all worked. :-(

> +        rev1 = start_spec.in_history(branch)
> +        # Avoid loading all of history when we know a missing
> +        # end of range means the last revision ...
> +        if end_spec.__class__ == RevisionSpec:
> +            last_revno, last_revision_id = branch.last_revision_info()
> +            rev2 = RevisionInfo(branch, last_revno, last_revision_id)
> +        else:
> +            rev2 = end_spec.in_history(branch)
> 
> ^- This really doesn't seem like the right way to go about it. I'm
> guessing you aren't using isinstance() because you need to know that it
> is not a subclass, bu the original RevisionSpec object.

Right.

> More importantly, though, why not change whatever RevisionSpec code so
> that it returns branch.last_revision_info() properly?
> 
> Otherwise, the correct test is actually:
> 
> if end_spec.spec is None:
>   last_revno, last_revision_id = ...
> 
> I guess you can't just have RevisionSpec itself know, because it doesn't
> know whether it is the start end of the range, or the tail end of the
> range. However, you *can* do "end_spec.spec is None" rather than
> inspecting __class__. (Which probably makes the 'lazy_import' stuff
> moot, as you won't be looking at classes directly anymore.)

I tweaked this as suggested and the import issue disappeared just like you
said. Thanks! Hmmm - my usertest results looked fine but I noticed that
log -r-100..-1 is still quicker than log -r-100.. so I need to dig deeper
to understand why.

> +def _create_log_revision_iterator(branch, start_revision, end_revision,
> +    direction, specific_fileid, search, generate_merge_revisions,
> +    allow_single_merge_revision, generate_delta, limited_output=False):
> 
> I'm a bit concerned about this function, as it seems to be adding a
> *lot* of special casing. This isn't saying it isn't necessary, it just
> is a flag of "are we sure we want to do this?"

In it's current form, I hope the shelf life of this is a week or two max.

> There is enough depth of complexity here, that it seems like we would
> really like to have a lot more helper functions, and there are enough of
> them, that it seems like it would make sense to create a class
> definition, and turn them into 'self.XXX'.

Agreed. My next patch makes quite a bit of progress in that direction.

> For example in:
> +def _calc_view_revisions(branch, start_rev_id, end_rev_id, direction,
> +    generate_merge_revisions, allow_single_merge_revision,
> +    delayed_graph_generation=False):
> 
> You have inner blocks like:
> 
> +    generate_single_revision = (end_rev_id and start_rev_id ==
> end_rev_id and
> +        (not generate_merge_revisions or not _has_merges(branch,
> end_rev_id)))
> +    if generate_single_revision:
> +        if end_rev_id == br_rev_id:
> +            # It's the tip
> +            return [(br_rev_id, br_revno, 0)]
> +        else:
> +            revno = branch.revision_id_to_dotted_revno(end_rev_id)
> +            if len(revno) > 1 and not allow_single_merge_revision:
> +                # It's a merge revision and the log formatter is
> +                # completely brain dead. This "feature" of allowing
> +                # log formatters incapable of displaying dotted revnos
> +                # ought to be deprecated IMNSHO. IGC 20091022
> +                raise errors.BzrCommandError('Selected log formatter only'
> +                    ' supports mainline revisions.')
> +            revno_str = '.'.join(str(n) for n in revno)
> +            return [(end_rev_id, revno_str, 0)]
> 
> 
> ^- IMO, the stuff that is inside the 'if generate_single_revision' would
> be great inside another function to make _calc_view_revisions a little
> bit less difficult to get your head around.

Fixed in the pending logging multiple files/directories patch. See
https://code.edge.launchpad.net/~ian-clatworthy/bzr/log-dir for
progress if you're interested.

> +def _has_merges(branch, rev_id):
> +    """Does a revision have multiple parents or not?"""
> +    return len(branch.repository.get_revision(rev_id).parent_ids) > 1
> +
> +
> 
> ^- This, IMO, is a bad way of doing it. As you have to process the XML
> for a revision, which you then throw away. Is there a reason you
> couldn't do:
> 
> parents = branch.repository.get_parent_map([rev_id]).get(rev_id, [])
> return len(parents) > 1
> 
> That only has to go to the index layer, rather than hitting the index
> layer to find the revision text, and then extracting and parsing the
> XML, etc.
> 
> Especially for "delayed_graph_generation", it seems useful.

Ah. Thanks for that insight. Fixed in the landed code.

> +    if not generate_merge_revisions:
> +        result = _linear_view_revisions(branch, start_rev_id, end_rev_id)
> +        # If a start limit was given and it's not obviously an
> +        # ancestor of the end limit, check it before outputting anything
> +        if start_rev_id and not (_is_obvious_ancestor(branch, start_rev_id,
> +            end_rev_id)):
> +            try:
> +                result = list(result)
> +            except _StartNotLinearAncestor:
> +                raise errors.BzrCommandError('Start revision not found in'
> +                    ' left-hand history of end revision.')
> +        if direction == 'forward':
> +            result = reversed(list(result))
> +        return result
> 
> ^- I don't see the big win of using "_is_obvious_ancestor" here. It
> seems like it has to do all the work of walking big graphs (given that
> you end up calling things like "revision_id_to_dotted_revno" inside
> _is_obvious_ancestor.)
> 
> I believe you are trying to iterate more, rather than always calling
> list(), but I think if you have called _is_obvious_ancestor() you've
> done all the necessary iterating that you would have done just calling
> list(result).
> 
> Have you tried removing this check and seeing if it really impacts
> performance? (All things being equal, reducing complexity is good.)
> 
> Also, the fact that _is_obvious_ancestor seems to have a side-effect of
> computing merge_sorted and dotted revnos, *without* an obvious way of
> getting just a subset. Which means that when we are able to do just a
> subset, it seems like we might have a hard time working around the
> _is_obvious_ancestor work. I could be wrong, though.