[MERGE] Branch.iter_merge_sorted_revisions

Fri Jan 23 20:02:38 GMT 2009

John Arbash Meinel wrote:
> Ian Clatworthy wrote:

>> There is also the issue of whether a last-revision
>> parameter ought to be supported or not. That's clearly
>> important when sorting an arbitrary graph but I think
>> it's unnecessary in the context of a branch method.
>> If we do add it, my feeling it that it ought to mean
>> "sort the full graph then skip to it before reporting
>> results" as I've explained in the comments. Either way,
>> a don't think that parameter is required in this
>> initial version of the API.
> 
> I think it is strictly necessary to avoid having an api that is always
> O(all_history). I would generally always make it a "revision_id" because
> that is how 95% of the internals work. In fact, I would consider it
> should actually be given a start and an end revision id
> (inclusive/exclusive?)

Yeah, that makes sense. The moment 'direction' is supported as
a parameter (and it is now), an efficient implementation needs
to know the two endpoints. My preference is to make the limits
inclusive.

> I suppose if it is an iterator, in theory you can stop iterating when
> you've reached the stop_revision that you care about. Though I think an
> implementation would certainly care about knowing you are going to stop
> after 10 steps, rather than 2000.

Right, but 'forward' can be much faster if this method knows the
stop_revision vs leaving the iteration termination to the layer above.

> My concern with merging it is always the deprecation/incompatibility
> dances. Not to mention helping push people towards efficient apis,
> rather than starting with something that won't scale well.

Sure.

> I'll also note that you aren't caching the result of
> "iter_merge_sorted_revisions()", which means that multiple calls will
> run merge_sort multiple times.
> 
> You *are* still caching 'revision_id_to_revno_map', because that is what
> it did.
> 
> Speaking from experience now, almost all code that would call
> get_revision_id_to_revno_map maintains the result for as long as we
> would need it anyway.
> 
> So my actual recommendation would be to cache the result of
> iter_merge_sorted_revisions and *stop* caching revision_id_to_revno_map.

Right. Are you OK with that caching change coming as a separate patch?
I think getting the API stable and landed is a useful step in it's own
right because then log/missing/etc. can beginning using the new API.

> Even better would be to just have get_revision_id_to_revno_map return a
> proxy around the existing dictionary. Something like:
> 
> class _MergeSortedToRevno(object):
> 
>   def __init__(self, merge_sorted):
>     self._merge_sorted = merge_sorted
>     # alternatively
>     #   dict((rev_id, (seq_num, depth, revno, eom))
>     #        for ... in merge_sorted)
> 
>   def __getitem__(self, key):
>     return self._merge_sorted[key][2]
> 
> The cache was never meant to support __setitem__ anyway.

Right.

> So my final comments:
> 
> 1) I think we need to be thinking sooner rather than later about
> handling partial ancestries.

Shall do next.

> 2) As written now, this doesn't actually win us anything, because it
> doesn't actually cache the merge_sorted results.

Maybe a separate patch?

> 3) I would like to see us not include 'seq_num' as part of the generated
> output. (Explicitly stripping it.) I don't think any code actually makes
> use of it, and it is ill-defined when we are evaluating only partial
> history.

I agree. If anything, it makes chaining of caches harder. The layer
above can always use enumerate() if they need a sequence number.

Ian C.