[MERGE] introduce is_ancestor, avoid using get_ancestry

Fri Jul 27 17:24:06 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:
> Anyway, it would be nice to have a tiny bit of benchmarking on this.

Why do you want that?  This isn't about making anything faster, it's
about avoiding an API that's necessarily slow.

> I would
> also argue that we might consider deprecating get_ancestry.

I would love to, but it has other users that we need to correct first.

> The only valid user
> I know of is "bzr ancestry", which could potentially do it in a different way.

Yes.  It seems reasonable to support Graph.iter_ancestors, so when
that's implemented, we could do:

graph.iter_topo(graph.iter_ancestors(revision))

> I'm trying to understand your 'is_ancestor' code. At a minimum, it needs a
> docstring. And certainly it could use a bit of commenting to help understand
> what you are trying to do.
> 
> Some questions:
> 
> 1) How does the overall search work?

So there are two possible outcomes: True and False, but there are three
possible relationships:

a) candidate_ancestor is an ancestor of candidate_descendant
b) candidate_ancestor is an descendant of candidate_descendant
c) candidate_ancestor is an sibling of candidate_descendant

To check for a, we walk from candidate_descendant, looking for
candidate_ancestor.

To check for b, we walk from candidate_ancestor, looking for
candidate_descendant.

To make a and b more efficient, we can stop any searches that hit common
ancestors.

If we exhaust our searches, but neither a or b is true, then c is true.

In order to find c efficiently, we must avoid searching from
candidate_descendant or candidate_ancestor into common ancestors.  But
if we don't search common ancestors at all, we won't know if we hit
common ancestors.  So we have a walker for common ancestors.  Note that
its searches are not required to terminate in order to determine c to be
true.

> The things that concern me are things like:
> 
>   for node in common:
>     c_ancestors = walker.find_seen_ancestors(node)
>     walker.stop_searching_any(c_ancestors)
> 
> Which seems like it has to do another search for every common ancestor that is
> found. (Which might be moderately cheap, I'm not really sure).

We must stop searches into common ancestors, but that includes stopping
searches of descendants of new common ancestors that we find.  If we
only stop searches of the common ancestors that we find, then searches
into their descendants will contine, which defeats the purpose.

There is no efficient way of storing information about the ancestors of
all nodes in a graph.  If you try to store a map listing all ancestors
of node A, and all ancestors of all ancestors of node A, you get
exponential scaling of time and space.

The best format I'm aware of for storing such information is a graph.
And we've already got one.

And yes, it's cheap to traverse bits of the graph that we've already
seen-- The current implementation is backed by the KnitIndex.  The new
implementation will be backed by the GraphIndex's caches.  And if we
find we need even more caching, we can insert that at the Graph level or
lower.

> 2) Why do you reset the 'common' set every pass through the loop?

It's a set of newly-found common nodes.  Not clearing it would make the
set intersections more expensive.  But I can call it new_common instead.

> I'm a little concerned about the different walkers 'missing' the common entries
> (one finds it just after the other one, etc). But that is probably all caught
> by the 'find_seen_ancestors' check.

Yes, I don't think there's a hole there.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGqhwm0F+nu1YWqI0RAmC9AJ9yBgVroaIoTWU9k/j9gUyXkT74pwCeM2EM
YZtFxTx081b/oljd6hj3wxc=
=RTKy
-----END PGP SIGNATURE-----