[RFD] Gdfo indices (was Re: [Merge] lp:~jameinel/bzr/1.18-faster-iter-ancestry into lp:bzr)
Vincent Ladeuil
v.ladeuil+lp at free.fr
Sat Aug 15 11:26:49 BST 2009
I redirect the discussion to the list as I think it's worth
getting out of the merge proposal thread.
>>>>> John A Meinel writes:
> ...
>> (gdfo, revid) guarantees (whatever revid scheme is used, they are
>> arbitrary anyway) that children will appear before parents (gdfo
>> being sorted biggest first that is).
>>
>>> From there it comes that reading from the tip, you can, safely,
>> read pages only *once*, filtering only the relevant revids and
>> their parents and identify the ghosts with that single pass.
>>
>> All intervening revids will appear progressively as you scan
>> smaller gdfos and you will never have to look back.
> True, though
> 1) It doesn't solve the 'clutter' issue, as long as the histories are
> approximately the same length.
Exactly, that's true in both approaches, that why I mentioned it
as a "defect" (even if small) in your approach. In fact, it's not
tied to any approach but really to the fact that a repository can
contain disjoint revision graphs when several projects share it.
> 2) It might do something for 'between' packs, and you could
> do something like figure out the range of possible gdfo's
> any given index has. (So if you ask for gdfo 1000 you know
> it isn't in an index with only values >2000.)
That will be yet another optimization level. On the other hand,
*most* of the time, the connected revids will just group into the
same pack file anyway (in the sense that in terms of graph levels
you shouldn't need to come back to a previous index when loading
the relevant graph in a given index).
In some ghost filing scenarios, there may cases where the graph
goes back and forth (several times) between indices and that may
be a reason to trigger a repack when filing ghosts but I don't
think we need (nor want) to require that...
> However, the biggest difficulty is you would have to expose
> "(gdfo, revision_id)" up to the Branch.last_revision_info,
> or continue to have a revision_id => gdfo map somewhere.
Correct, that's still the sore point.
On the other hand, once we have that (revid => gdfo) map, we need
to use it only once for any graph operation as long as the (gdfo,
revid) index give us parents as (gdfo, revid) too[1] !
So adding the gdfo in the actual revision index (as a property)
shouldn't be a problem.
With that, the ghost filing problem becomes more clear (at least
to me):
- only the indices need to be updated. I.e. gdfo should appear
only there since, it really is a mutable property when you take
ghosts into account and it's a revision graph property.
- filing a ghost requires that its children gdfos be updated. The
revids that need to be updated have a gdfo in a (min, max)
range, the min should be easy to get: the ghosts will connect
somewhere in the existing graph giving the min value. The max
gdfo is harder, but in the worst case... we'll fall back to a
full history operation (constrained by the min value though).
How does that sound ?
> It is certainly possible, and if we found significant
> benefits, it would be worth doing.
Agreed there too. In the mean time, your heuristic is really nice
(not sure I made that point clear enough) because it provides us
with a really close approximation.
> I'll also mention that this sort of api is still
> beneficial, as it would let you read all the local (gdfo,
> revision_id) pairs from a single page before you go off to
> start searching on the next page.
Right.
[1]: Only thought about it yesterday... but that was a 'smack
head' instant :)
More information about the bazaar
mailing list