[MERGE] annotation policies

Tue Jun 17 23:32:20 BST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Clatworthy wrote:
| John Arbash Meinel wrote:
|
|> This is just an extension of the earlier patch, which now implements a
|> couple
|> more types, which might show what I was thinking about earlier.
|
| Here's a partial review. Only tweaks and comments so far. In general
| though, the code is a pleasure to read. Thanks in particular for taking
| the time to clean up and clarify the method parameter names as you've
| refactored the code from one module to another.
|
|> @@ -173,174 +172,7 @@
|>          If None, then any ancestry disputes will be resolved with
|>          new_revision_id
|>      """
|
| Hmm. Is this docstring comment re None still correct now?

It now depends on the policy. Ultimately, I think I want to deprecate this
function, as the callers should be using a policy.

|
|> +class AnnotationPolicy(object):
|> +    """This class represents a specific method for performing annotations.
|> +
|> +    Mostly, this class determines how to resolve ambiguous lines. (Lines which
|> +    both sides claim different last-modified revisions.)
|> +    """
|> +
|> +    def __init__(self, heads_provider):
|> +        """Create a new AnnotationPolicy
|

...

|
| I must say that I'm not sold on the AnnotationPolicy class name. The
| class itself is arguably an "Annotator" that feels like it ought to
| be parameterised with a policy class. In other words, my gut feel from
| the code I've read so far is that the 'policy' bit is really a tiny bit,
| not the top level thing. Having said that, it looks like the no-merge
| policy benefits from your subclassing approach and I'm not sure that
| it's necessarily worth refactoring things more than you have. So I'm
| OK if you'd really prefer to keep things as you've got them but if you're
| sitting on the fence at all re this issue, consider my grumble enough to
| push you onto one side of it. :-)

There are a couple more levels that need to be worked out for it to be a
complete Annotator. Specifically, there is some conflict with Knits, in that you
want to be able to extract the parent texts in a "optimal" order off disk, and
grab the known matching blocks at the same time.

'no-merge' is, indeed, the big change which would prefer to have more control
over what happens during annotate. In fact, it would like to change what
full-texts are extracted as well. As it only uses left-hand parents, you don't
need to extract any of the other parents.

So... I can get rid of the name "policy", and have a standard one which takes a
small policy class (only needs a function for handling ambiguous lines). But I
did want the flexibility to do stuff like 'no-merge'.

Further, I've considered having a flag to change the line-matching algorithm.
Specifically, you could do stuff where you ignore whitespace changes, or you
could use a (-whitespace, patiencediff, +whitespace), which does interesting
things when you indent code.

|
|> +        :param heads_provider: An object which provids a .heads() call to
|> +            resolve if any revision ids are children of others.
|> +            Should also provide a 'cache(keys, head)' function.
|> +            Can be None.
|
| On the top line, s/provids/provides/.
| I wonder if it's worth be more explicit about what a value of None means?
|

I'm tempted to remove the "can be None". The idea was that not all policies
provide use a heads() so why should callers be required to pass one. And some
locations may not have it easily available.

|> +        else:
|> +            return self._do_annotate_many_parents(
|> +                annotated_parents_lines, left_matching_blocks,
|> +                plain_child_lines, child_revision_id)
|> +        return lines
|
| You don't need this final line - all parts of the if/elif/else use
| return.
|

Thanks, they used to all assign to lines and return at the end.

|> +    def _process_unmatched_lines(self, output_lines, plain_child_lines,
|> +                                annotated_child_lines, start_child, end_child,
|> +                                annotated_right_lines, start_right, end_right,
|> +                                revision_id):
|> +        """Find lines in plain right_lines that match plain_child_lines
|
| Again, need a trailing '.'. on the docstring top line.
|
|> +                left = annotated_child_lines[start_child+child_idx+offset]
|> +                right = annotated_right_lines[start_right+right_idx+offset]
|
| Spaces around the +'s please. I realise this is simply moved code but let's
| clean it up while we're at it.
|
|> +        # TODO: Would it be cleaner to just pass in the revision ids, rather
|> +        #       than (revision_id, text)?
|
| As the code currently stands, I think your approach is fine. *If* you do go with
| separate policy classes that aren't subclasses, then I'd pass just the
| revision-ids in and out, not tuples. I guess unpacking and repacking the tuple
| *might* have a performance impact but I wonder if that would be minor in the
| overall scheme of things? From the performance figures you posted, it looks
| like the policy choice dominates so I'd lean towards making the policy
| subclasses as clean as possible myself. Of course, if your benchmarking
| indicates otherwise, ignore that suggestion. :-)
|

I can really think of ones that would get this far, and actually care about the
text. Specifically, there is no context at this point aside from a single line.
So I don't see how you could write a policy that would choose based on the
content of that single line. (Since it *matches*).

Policy dominates in the long-term. The more times you change a lines annotation,
the more times you have to resolve that (with a heads check, etc.) I've tried
hard to structure it so that *unchanged* lines get processed quickly, as that is
always the place that will be the bottleneck. (In a 1000 line file, maybe you
change 10 lines each time, or 1% of the lines are interesting.) The original
code used to do a flat iteration over all lines, which seems fast because it can
go linearly. Except it has to process the 99% of lines that didn't change.

I'll go ahead with the clearer api which just passes revision ids.

|> +            # Both claim different origins
|> +            # We know that revision_id is the head for
|> +            # left and right, so cache it
|> +            self._heads_provider.cache(
|> +                (revision_id, left[0]),
|> +                (revision_id,))
|> +            self._heads_provider.cache(
|> +                (revision_id, right[0]),
|> +                (revision_id,))
|> +            return (revision_id, left[1])
|
| I'm hoping this will become clearer to me as I grok the code more but why
| do you only update the heads_provider cache for this policy and not others?

The heads_provider we have will automatically cache anything that is requested
of it. So if you ask for "heads(A, B)" it will cache that. This policy is the
only one that is creating a new head, by merging the two existing heads.

Specifically if you have:

~   A B
~   |/
~   C

'right' will always return B, 'left' will always return A. 'merge_node' will
return C. Future calls should actually resolve this case rather quickly
(heads(A, parent_of_A) should always be fast).

I'm actually thinking to remove this cache, as it removes a requirement on the
heads_provider. It provided a fairly minor gain anyway.

|
|> +class NoMergeAnnotationPolicy(annotation_policy.AnnotationPolicy):
|> +    """Assign all lines based on the primary parent"""
|> +
|> +    def _do_annotate_two_parents(self, first_parent_lines,
|> +                                 first_matching_blocks,
|> +                                 second_parent_lines,
|> +                                 plain_child_lines,
|> +                                 child_revision_id):
|> +        """Annotate versus two parents,
|
| s/,/./ on the last line.
|
|> === modified file 'bzrlib/tests/test_annotate.py'
|> --- bzrlib/tests/test_annotate.py	2008-02-18 22:19:41 +0000
|> +++ bzrlib/tests/test_annotate.py	2008-06-11 18:53:36 +0000
|> @@ -23,6 +23,8 @@
|>      annotate,
|>      conflicts,
|>      errors,
|> +    graph,
|> +    revision,
|
| These last 2 new imports aren't needed as best I can tell.
|

nope. The were before I moved to a different test file, so I could more easily
adapt the tests.

|> @@ -522,3 +414,5 @@
|>          blocks = [(0, 1, 1), (1, 2, 0)]
|>          self.annotateEqual([('rev2', 'a\n'), ('rev1', 'a\n')], [parent],
|>                             new_text, 'rev2', blocks)
|> +
|> +
|
| These new trailing newlines aren't a problem but don't add anything either.
|
| I still need to review test_annotation_policy.py and read through your
| emails on the trade-offs of the various algorithms before commenting on
| my preferred default. I'm hoping to do that tomorrow.
|
| Ian C.
|

The problem I haven't really worked out is that of the permutations.
Specifically, you have an algorithm for resolving ambiguous lines, you have an
algorithm for resolving the bulk of the lines. You have different storage
architectures that have different abilities. Some may prefer extracting texts in
a given order, others already have some matching_blocks that they can give you,
etc. You have the possibility of different line-matching algorithms, which may
or may not match what he store can give you, etc.
At some point, I think we also would like to add caching if it is possible to do
in a reasonable way. How does caching work with different parameters you can use
to generate the values?

If we have a cache, would we get rid of policies, and just try to find the
"best" algorithm? (In my mind, a special whitespace patience diff, which can
mark lines with multiple possible sources.)

Some more interesting data points...

policy		annotate	annotate --show-ids
merge-node	10.278s		6.797s
right-head	10.131s		5.972s
simple-right	10.120s		3.966s
no-merge	10.016s		2.570s
no-merge2	 9.941s		0.974s

Without --show-ids we end up calling revision_id_to_revno_map, which has to
extract the whole revision graph, etc. And I currently have 19 active packs. So
the pack index performance is extra slow.

'no-merge2' is changing the code that extracts full texts to understand that we
don't want merged nodes.

So interestingly, not extracting the non-mainline texts has a very small effect
when we are getting the revision_id map (.1s). But when we are using --show-ids,
it is 1.6s different.

Which is mostly just weird to me, but I'll live with it.

Interesting that 'right-head' matches 'no-merge' timing when we aren't using
- --show-ids. It certainly makes me feel like there is something about loading the
revision graph going on here.

Looking at lsprof, the time to "get_build_graph" actually dominates the time to
annotate the records. For '--policy=no-merge --show-ids':

~   1 1.9482      0.0008   bzrlib.builtins:3524(run)
~  +1 1.4533      0.0170   +bzrlib.annotate:39(annotate_file)
~  +1 0.3216      0.0000   +<<string>>:1(revision_tree_read_locked)

~   1 1.4533      0.0170   bzrlib.annotate:39(annotate_file)
...
~  +1 0.4807      0.0383   +bzrlib.knit:3255(_annotate_records)
~  +1 0.8981      0.0181   +bzrlib.knit:3175(_get_build_graph)
...
+580 0.1551      0.0213   +bzrlib.knit:3146(_add_annotation)
+581 0.2413      0.0060   +bzrlib.knit:2803(read_records_iter)
+564 0.0311      0.0037   +bzrlib.knit:427(parse_record)
...
+580 0.1099      0.0042   +bzrlib.annotation_policy:51(reannotate)

So of 2s clock time, we only spend 100ms actually checking 580 different
fulltexts to see how they line up. Of the 580 texts, we have to 'diff' 16 times
(presumably where we have full texts rather than deltas.)

I really don't understand why '--no-show-ids' is so much slower, and why it
basically makes all of the algorithms take the same amount of time. I could
understand if it was a static "10s" slower for each.

Then again, it makes sense if all the time is spent in 'heads()' calls. If we
have "--show-ids" then the full revision graph isn't loaded. So heads() calls
now have to delve into ancestry. Which requires reading more data from the
indexes. Though if that were strictly true, I would expect to see more of a
difference between 'right-heads' and 'simple-right', since the latter doesn't do
any heads calls. Maybe that is traded off by having more conflicting lines,
because we don't get absolute resolution. (When two sides disagree, you have to
keep resolving the ambiguity.)

I'm also a little surprised that in 2000 commits by the PQM, we have only 580
commits that modify NEWS. I would have thought it would be higher.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkhYO3QACgkQJdeBCYSNAAP2jACgrUt9JvyyNEB3ofRfyH0bPLjl
JscAn3H1zdfxB8cij0eaue0o46UAumTB
=OiJW
-----END PGP SIGNATURE-----