[MERGE] Graph.heads()

Wed Aug 22 23:25:27 BST 2007

On Wed, 2007-08-22 at 12:19 -0500, John Arbash Meinel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Robert Collins wrote:
> > Robert Collins has voted resubmit.
> > Status is now: Resubmit
> > Comment:
> > In packs we have a larger graph space.
> > Rather than one graph per file, we have one graph for the repository,
> > with pairs (file_id, revision_id) as keys.
> > 
> > It would be nice to cut a couple of layers out and work directly in that
> > space. If not, then at least the repository should be able to do that
> > itself and translate only the result - so I'd really prefer to see this

> > functionality on repository.
> 
> Well, a few bits to consider...
> 
> Are you saying that the actual .heads() call should not be on Graph?

No. I'm saying it should not be on a single-key graph layered on a
multi-key graph. Theres lots of unnecessary translation there.

> I would *like* to have the file graph call be:
> 
> g = Repository.get_file_graph(file_id)
> heads = g.heads(revision_ids)

I am happy with this as long as the returned graph uses (file_id,
revision_id) node ids for pack repositories. What that means concretely
is probably a subclass of Graph where 'heads(foo)' is
>>> return self._heads((self._file_id, revision_id))
for pack repositories.

> Alternatively, we could put the whole thing on Repository with
> 
> heads = Repository.get_file_graph_heads(file_id, revision_ids)

I think there would be a smell to end up with a lot of
get_file_graph(file_id, ..) methods. Because its obviously repetitive.

> However, all of this is mute, because it can't actually be used by
> Inventory.find_previous_heads. Because that is passed a
> "versioned_file_store" and not a repository.

So change the API there :). I have a new iterator I'm building for Ian
which does not work in InventoryEntries at all, so that dirstate can do
funky fast logic. This pushes the list of last_modified revisions up the
stack to commit.py.

> Now, I suppose we could push it up further in the stack. Such that these
> lines in CommitBuilder:
> 
>   previous_entries = ie.find_previous_heads(
>       parent_invs,
>       self.repository.weave_store,
>       self.repository.get_transaction())
>   # we are creating a new revision for ie in the history store
>   # and inventory.
>   ie.snapshot(self._new_revision_id, path, previous_entries, tree, self)
> 
> 
> Change to something like:
> candidates = {}
> for inv in parent_invs:
>     if ie.file_id in inv:
>       ... # Reproduce all the sanity checking in find_previous_heads
> 
>       candidates[ie.revision] = ie
> heads = self.repository.get_file_graph_heads(ie.file_id,
> 				             candidates.keys())
> previous_entries = [candidates[rev] for rev in heads]
> ie.snapshot(self._new_revision_id, path, previous_entries, tree, self)

Something like that, sure.

> Now, it would make the most sense to have a new function on ie that acts
> *just like* InventoryEntry.find_previous_heads, but that takes a
> Repository, rather than taking a versioned file store.

I think its better to have CommitBuilder talk to the repository, rather
than Inventory, for this. Because CommitBuilder is made by the
Repository, the repository can choose a CommitBuilder that does what it
needs.

Another thing to consider is that querying the per-file graph is not
needed. The per-file nodes are a subset of the revision graph nodes. And
they have the same topological order. I suspect a little theory work
will prove that a per-file graph node N is an ancestor of A IFF revision
graph node N is an ancestory of A in the revision graph. If so (and
please do check it) querying the revision graph is enough to answer the
question 'what are the per file heads'. And that gives us one graph,
which we can cache in memory and add data to as queries come in that
have not had enough read to answer yet.

> Also, I think Robert really wants to rewrite the whole code so he can do
> things in "batches". I'm not really sure how to make that all work,
> though. I think it is better to have an intelligently cached index,
> rather than trying to figure out how to pre-compute all the files that
> you are going to care about, and then make requests for all of those in
> a streaming fashion.

It would be nice to do this, but I'm also not sure of the right way to
structure it or do it. It would make commits on lightweight, remote,
checkouts much faster, by at least RTT*files-we-make-this-query-on. If
my suspicion about being able to use the revision graph as a proxy to
answer 'has a file graph merge occured' is correct then we don't need to
try to batch these calls; just cache the revision graph.

> 
> I would really like to get something like this merged, so it would be
> nice to understand what you think it would take to do so.
> 
>
> I'd also like to comment that since Graph already has the
> functionality for computing heads (it needs it for the least common
> ancestor code), it would be really good to at least expose it.

I'm not against exposing it, I'm against entrenching the idea that every
file has a separate graph with single-element keys, because that makes
it harder to optimise, and tends to fragment things best done as a
single query.

e.g. I'd prefer
g = repository.get_file_graph()
heads = g.heads([(file_id, revision_id1), (file_id, revision_id2)])
(assuming my concepts about not needing this at all are invalid)

-Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070823/137be0c4/attachment.pgp