[RFD] Gdfo indices (was Re: [Merge] lp:~jameinel/bzr/1.18-faster-iter-ancestry into lp:bzr)

Mon Aug 24 19:24:44 BST 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...

> 
>     > True, though
> 
>     > 1) It doesn't solve the 'clutter' issue, as long as the histories are
>     > approximately the same length.
> 
> Exactly, that's true in both approaches, that why I mentioned it
> as a "defect" (even if small) in your approach. In fact, it's not
> tied to any approach but really to the fact that a repository can
> contain disjoint revision graphs when several projects share it.

So if you wanted to avoid this sort of thing, you could sort the graphs
by, say, initial revision id. You could even teach 'bzr pack' to just
group things by ancestries into separate pack files. Which would give
you most of the win, without needing any disk format changes.

...
>     > However, the biggest difficulty is you would have to expose
>     > "(gdfo, revision_id)" up to the Branch.last_revision_info,
>     > or continue to have a revision_id => gdfo map somewhere.
> 
> Correct, that's still the sore point.
> 
> On the other hand, once we have that (revid => gdfo) map, we need
> to use it only once for any graph operation as long as the (gdfo,
> revid) index give us parents as (gdfo, revid) too[1] ! 
> 
> So adding the gdfo in the actual revision index (as a property)
> shouldn't be a problem.

As a property?

The main downside is that you now need to maintain 2 indices. One to get
you started, and then one get continue operating on (gdfo, revid) pairs.

I'm unconvinced that it will be much better, rather than just adding
gdfo to the existing revision_id => everything else layout. It might,
but we'd need to see real benefit to be worth while.

As an alternative we could store 'gdfo' as part of the Branch's
last_revision. And change all of our internal apis to look something like:

gdfo, revision_id, revno = branch.last_revision_info()
repo.revision_tree((gdfo, revision_id))

You can make things a little bit cleaner if you just made 'gdfo' the
first part of a revision *key*. Though again indices assume that keys
are strings. And while ('1', 'foo') may sort better, it still sorts
right next to ('1000', 'bar') :)

> 
> With that, the ghost filing problem becomes more clear (at least
> to me):
> 
> - only the indices need to be updated. I.e. gdfo should appear
>   only there since, it really is a mutable property when you take
>   ghosts into account and it's a revision graph property.
> 
> - filing a ghost requires that its children gdfos be updated. The
>   revids that need to be updated have a gdfo in a (min, max)
>   range, the min should be easy to get: the ghosts will connect
>   somewhere in the existing graph giving the min value. The max
>   gdfo is harder, but in the worst case... we'll fall back to a
>   full history operation (constrained by the min value though).
> 
> How does that sound ?

I agree that it could be stored purely in the index. And that we could
update 'lazily'. Basically, go to all children, update them, and
propagate that.

Then again, finding children in our indexes is an O(N) issue, because
you have to read every revision to find out if it is actually a child.
And I can tell you, *numbering* the gdfo is at least an order of
magnitude faster than reading them all in.

On OOo and 200+k revisions, it only takes 1.0s to *merge_sort* which is
effectively 3 passes over the data. So you are talking... 0.3s to
compute gdfo for all nodes. The main cost we are trying to avoid is the
10+s to read all the nodes.

So the other alternative is to extend our existing structure to add
child pointers to the indexes. For example, when you add a node pointing
at a parent, you also add that parent with a node pointing at the known
list of children.

The main problem is that this is still a 'read something from every
index' because you don't know which indexes would have a child for your
particular revision. It is, though, a "read 1 location" in the index,
rather than reading the whole index.

I just wouldn't make it part of the default return values for an index.
(eg. When doing get_parent_map() don't return child pointers unless
specifically asked.)

Some other bits...

1) Instead of just (gdfo, revision_id), you could sort based on
(first_parent, gdfo, revision_id). Where 'first_parent' is the first
revision that descends from the origin which is your path to gdfo. For
example:

 A B
 | |
 C D
 | |
 E |
 | |
 F |
 |/
 G

In this case you have:
(A, 1, A)
(B, 1, B)
(A, 2, C)
(B, 2, D)
(A, 3, E)
(A, 4, F)
(A, 5, G)

You could do all sorts of things to make that initial value 'cheaper'.
If it is meant to be a hint rather than a strong value, you could take
8-bits of a hash of revision id (or 16, 32, etc).

You still have the problem of needing a way to enter *into* this graph.
As at the moment, we use a simple 'revision_id'.

Oh, and when I said you could just save the full info in
Branch.last_revision() that only works well for new format bzr branches,
and doesn't work well for:
 1) ghosts differing from source to target, since the values would be
different.
 2) foreign repository formats and older bzr formats, which then have to
recompute these values.

I also don't know the general benefit for all operations. I think for
*graph walking* operations, it is very good. As such, though, I'm most
interested in adding another index as a 'cache' of this information.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkqS2uwACgkQJdeBCYSNAAMUNwCaAwCxOsV0dCT+NRxHA0LXSWWu
zjMAn0jFqOiIxDqHgWRhPSdeond0Nm7g
=rX0q
-----END PGP SIGNATURE-----