[RFD] Gdfo indices (was Re: [Merge] lp:~jameinel/bzr/1.18-faster-iter-ancestry into lp:bzr)

Sat Aug 15 11:26:49 BST 2009

I redirect the discussion to the list as I think it's worth
getting out of the merge proposal thread.

>>>>> John A Meinel writes:

    > ...
    >> (gdfo, revid) guarantees (whatever revid scheme is used, they are
    >> arbitrary anyway) that children will appear before parents (gdfo
    >> being sorted biggest first that is).
    >> 
    >>> From there it comes that reading from the tip, you can, safely,
    >> read pages only *once*, filtering only the relevant revids and
    >> their parents and identify the ghosts with that single pass.
    >> 
    >> All intervening revids will appear progressively as you scan
    >> smaller gdfos and you will never have to look back.

    > True, though

    > 1) It doesn't solve the 'clutter' issue, as long as the histories are
    > approximately the same length.

Exactly, that's true in both approaches, that why I mentioned it
as a "defect" (even if small) in your approach. In fact, it's not
tied to any approach but really to the fact that a repository can
contain disjoint revision graphs when several projects share it.

    > 2) It might do something for 'between' packs, and you could
    > do something like figure out the range of possible gdfo's
    > any given index has. (So if you ask for gdfo 1000 you know
    > it isn't in an index with only values >2000.)

That will be yet another optimization level. On the other hand,
*most* of the time, the connected revids will just group into the
same pack file anyway (in the sense that in terms of graph levels
you shouldn't need to come back to a previous index when loading
the relevant graph in a given index).

In some ghost filing scenarios, there may cases where the graph
goes back and forth (several times) between indices and that may
be a reason to trigger a repack when filing ghosts but I don't
think we need (nor want) to require that...

    > However, the biggest difficulty is you would have to expose
    > "(gdfo, revision_id)" up to the Branch.last_revision_info,
    > or continue to have a revision_id => gdfo map somewhere.

Correct, that's still the sore point.

On the other hand, once we have that (revid => gdfo) map, we need
to use it only once for any graph operation as long as the (gdfo,
revid) index give us parents as (gdfo, revid) too[1] ! 

So adding the gdfo in the actual revision index (as a property)
shouldn't be a problem.

With that, the ghost filing problem becomes more clear (at least
to me):

- only the indices need to be updated. I.e. gdfo should appear
  only there since, it really is a mutable property when you take
  ghosts into account and it's a revision graph property.

- filing a ghost requires that its children gdfos be updated. The
  revids that need to be updated have a gdfo in a (min, max)
  range, the min should be easy to get: the ghosts will connect
  somewhere in the existing graph giving the min value. The max
  gdfo is harder, but in the worst case... we'll fall back to a
  full history operation (constrained by the min value though).

How does that sound ?

    > It is certainly possible, and if we found significant
    > benefits, it would be worth doing.

Agreed there too. In the mean time, your heuristic is really nice
(not sure I made that point clear enough) because it provides us
with a really close approximation.

    > I'll also mention that this sort of api is still
    > beneficial, as it would let you read all the local (gdfo,
    > revision_id) pairs from a single page before you go off to
    > start searching on the next page.

Right.

[1]: Only thought about it yesterday... but that was a 'smack
head' instant :)