[MERGE] Branch.iter_merge_sorted_revisions

Thu Jan 22 22:10:57 GMT 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Clatworthy wrote:
> I now think this patch is ready to land. This version
> simplifies the semantics of 'forward' to be appropriate
> to this API layer, i.e. forward is simply the opposite
> to reverse, not some fancy UI-specific ordering that may
> change over time.
> 
> Vincent has raised the issue of 'reverse' vs 'forward'
> and would 'backward' vs 'forward' be better direction
> names? I agree with his thinking but I've left it as
> 'reverse' to be consistent with other API names, e.g.
> iter_reverse_revision_history. Sometimes consistency
> is more important than strict semantics. :-)

I think 'reverse' is better than 'backward', the bigger question (IMO)
is whether it should be 'forward'. :)

> 
> There is also the issue of whether a last-revision
> parameter ought to be supported or not. That's clearly
> important when sorting an arbitrary graph but I think
> it's unnecessary in the context of a branch method.
> If we do add it, my feeling it that it ought to mean
> "sort the full graph then skip to it before reporting
> results" as I've explained in the comments. Either way,
> a don't think that parameter is required in this
> initial version of the API.

I think it is strictly necessary to avoid having an api that is always
O(all_history). I would generally always make it a "revision_id" because
that is how 95% of the internals work. In fact, I would consider it
should actually be given a start and an end revision id
(inclusive/exclusive?)

I suppose if it is an iterator, in theory you can stop iterating when
you've reached the stop_revision that you care about. Though I think an
implementation would certainly care about knowing you are going to stop
after 10 steps, rather than 2000.

My concern with merging it is always the deprecation/incompatibility
dances. Not to mention helping push people towards efficient apis,
rather than starting with something that won't scale well.

> 
> Once this API is in place, we can avoid calculating the
> full graph multiple times as can happen quite
> easily now (e.g. once to build the revid->revno map
> and again in the core logic of commands like log and
> missing). It will also allow plugins like revnocache
> to provide a cached version of the revision graph,
> greatly speeding up log (and other commands) on large
> projects.
> 
> Ian C.
> 

I agree that having a place to cache the result of 'merge_sort' is
useful. At one point I believe I profiled through trying to remove the
double calls, but as you noticed, passing the graph in RevisionSpec*
into the log.py code isn't really straightforward. Going through Branch
seems reasonable to me.

- -        revision_id_to_revno = dict((rev_id, revno)
- -                                    for seq_num, rev_id, depth, revno,
end_of_merge
- -                                     in merge_sorted_revisions)
- -        return revision_id_to_revno
+        if direction == 'reverse':
+            return iter(merge_sorted_revisions)
+        if direction == 'forward':
+            return iter(reversed(merge_sorted_revisions))
+        else:
+            raise ValueError('invalid direction %r' % direction)

^- 'reversed' already returns an iterator, so I don't think you gain
anything with 'iter(reversed())'

(then again listreverseiterator.__iter__() just returns self, so it
doesn't really cost much, either)

I'll also note that you aren't caching the result of
"iter_merge_sorted_revisions()", which means that multiple calls will
run merge_sort multiple times.

You *are* still caching 'revision_id_to_revno_map', because that is what
it did.

Speaking from experience now, almost all code that would call
get_revision_id_to_revno_map maintains the result for as long as we
would need it anyway.

So my actual recommendation would be to cache the result of
iter_merge_sorted_revisions and *stop* caching revision_id_to_revno_map.

Even better would be to just have get_revision_id_to_revno_map return a
proxy around the existing dictionary. Something like:

class _MergeSortedToRevno(object):

  def __init__(self, merge_sorted):
    self._merge_sorted = merge_sorted
    # alternatively
    #   dict((rev_id, (seq_num, depth, revno, eom))
    #        for ... in merge_sorted)

  def __getitem__(self, key):
    return self._merge_sorted[key][2]

The cache was never meant to support __setitem__ anyway.

So my final comments:

1) I think we need to be thinking sooner rather than later about
handling partial ancestries. For example, my lazy_revno branch is really
close to being able to answer iter_merge_sorted_revisions() without
having to do the whole graph.

2) As written now, this doesn't actually win us anything, because it
doesn't actually cache the merge_sorted results.

3) I would like to see us not include 'seq_num' as part of the generated
output. (Explicitly stripping it.) I don't think any code actually makes
use of it, and it is ill-defined when we are evaluating only partial
history.

4) I think the api is ok (aside from being able to supply a subset to
return).

BB:resubmit

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkl47vEACgkQJdeBCYSNAANFRwCeIhAzmCYDsRk+O9qezgOUQ04V
LBIAnR9j1btesu20+qQlH6p3OormMlDP
=gP2c
-----END PGP SIGNATURE-----