[MERGE] Branch.iter_merge_sorted_revisions
John Arbash Meinel
john at arbash-meinel.com
Thu Jan 22 22:10:57 GMT 2009
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Ian Clatworthy wrote:
> I now think this patch is ready to land. This version
> simplifies the semantics of 'forward' to be appropriate
> to this API layer, i.e. forward is simply the opposite
> to reverse, not some fancy UI-specific ordering that may
> change over time.
>
> Vincent has raised the issue of 'reverse' vs 'forward'
> and would 'backward' vs 'forward' be better direction
> names? I agree with his thinking but I've left it as
> 'reverse' to be consistent with other API names, e.g.
> iter_reverse_revision_history. Sometimes consistency
> is more important than strict semantics. :-)
I think 'reverse' is better than 'backward', the bigger question (IMO)
is whether it should be 'forward'. :)
>
> There is also the issue of whether a last-revision
> parameter ought to be supported or not. That's clearly
> important when sorting an arbitrary graph but I think
> it's unnecessary in the context of a branch method.
> If we do add it, my feeling it that it ought to mean
> "sort the full graph then skip to it before reporting
> results" as I've explained in the comments. Either way,
> a don't think that parameter is required in this
> initial version of the API.
I think it is strictly necessary to avoid having an api that is always
O(all_history). I would generally always make it a "revision_id" because
that is how 95% of the internals work. In fact, I would consider it
should actually be given a start and an end revision id
(inclusive/exclusive?)
I suppose if it is an iterator, in theory you can stop iterating when
you've reached the stop_revision that you care about. Though I think an
implementation would certainly care about knowing you are going to stop
after 10 steps, rather than 2000.
My concern with merging it is always the deprecation/incompatibility
dances. Not to mention helping push people towards efficient apis,
rather than starting with something that won't scale well.
>
> Once this API is in place, we can avoid calculating the
> full graph multiple times as can happen quite
> easily now (e.g. once to build the revid->revno map
> and again in the core logic of commands like log and
> missing). It will also allow plugins like revnocache
> to provide a cached version of the revision graph,
> greatly speeding up log (and other commands) on large
> projects.
>
> Ian C.
>
I agree that having a place to cache the result of 'merge_sort' is
useful. At one point I believe I profiled through trying to remove the
double calls, but as you noticed, passing the graph in RevisionSpec*
into the log.py code isn't really straightforward. Going through Branch
seems reasonable to me.
- - revision_id_to_revno = dict((rev_id, revno)
- - for seq_num, rev_id, depth, revno,
end_of_merge
- - in merge_sorted_revisions)
- - return revision_id_to_revno
+ if direction == 'reverse':
+ return iter(merge_sorted_revisions)
+ if direction == 'forward':
+ return iter(reversed(merge_sorted_revisions))
+ else:
+ raise ValueError('invalid direction %r' % direction)
^- 'reversed' already returns an iterator, so I don't think you gain
anything with 'iter(reversed())'
(then again listreverseiterator.__iter__() just returns self, so it
doesn't really cost much, either)
I'll also note that you aren't caching the result of
"iter_merge_sorted_revisions()", which means that multiple calls will
run merge_sort multiple times.
You *are* still caching 'revision_id_to_revno_map', because that is what
it did.
Speaking from experience now, almost all code that would call
get_revision_id_to_revno_map maintains the result for as long as we
would need it anyway.
So my actual recommendation would be to cache the result of
iter_merge_sorted_revisions and *stop* caching revision_id_to_revno_map.
Even better would be to just have get_revision_id_to_revno_map return a
proxy around the existing dictionary. Something like:
class _MergeSortedToRevno(object):
def __init__(self, merge_sorted):
self._merge_sorted = merge_sorted
# alternatively
# dict((rev_id, (seq_num, depth, revno, eom))
# for ... in merge_sorted)
def __getitem__(self, key):
return self._merge_sorted[key][2]
The cache was never meant to support __setitem__ anyway.
So my final comments:
1) I think we need to be thinking sooner rather than later about
handling partial ancestries. For example, my lazy_revno branch is really
close to being able to answer iter_merge_sorted_revisions() without
having to do the whole graph.
2) As written now, this doesn't actually win us anything, because it
doesn't actually cache the merge_sorted results.
3) I would like to see us not include 'seq_num' as part of the generated
output. (Explicitly stripping it.) I don't think any code actually makes
use of it, and it is ill-defined when we are evaluating only partial
history.
4) I think the api is ok (aside from being able to supply a subset to
return).
BB:resubmit
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkl47vEACgkQJdeBCYSNAANFRwCeIhAzmCYDsRk+O9qezgOUQ04V
LBIAnR9j1btesu20+qQlH6p3OormMlDP
=gP2c
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list