[MERGE] Graph.heads()

Mon Aug 27 23:04:25 BST 2007

...[conjecture about file graph relationship to revision graph]
> I haven't been able to come up with a counterexample yet. I'm not positive that
> it isn't possible, though.
> 
> The way I thought of it, was that given 2 nodes, either they are both heads, or
> one of them supersedes the other. And in a subset graph, you are missing nodes,
> such that you don't have linkages you might otherwise have. (Causing a node to
> not be considered a head when it really is.)
> 
> But there is also this property:
> The last-modified value for a file is either the revision id of the node it is
> introduced, or the value of a parent. If it is the value of a parent, then all
> other parents must not have modified the file.
> 
> So in the per-file graph all potentially interesting nodes have to be present.
> 
> For example, in the revision graph:
> 
>   A
>   |\
>   B C
>   |\|
>   D E
>   |/
>   F
> 
> 
> If a given file in 'E' could have a last-modified of C, and in 'D' it has a
> last-modified of 'B'. Then merging E and D would say that you have 2 heads
> (B,C). Which seems incorrect because revision_id 'E' supersedes 'B'. But:
> 
> a) Under our current model, if you have a last-modified of B and C, then E is
> required to have a last-modified of E.
> b) Even if you considered E as "bzr revert file" and we left the text marked as
> C, and didn't update the record, it would mean that you need to resolve the B,C
> conflict in F.
> 
> 
> So I haven't done a rigorous proof, but after a few hours of trying different
> combinations, I wasn't able to come up with a counterexample.
> 
> For even more assurance, we could actually run tests on some of our
> repositories. Looking at the values given for each file graph, and comparing it
> to what you would get for the revision graph as a whole.
> 
> That said, a per-file graph (having less nodes) could be significantly faster
> to crawl through, than the whole tree graph.
> 
> One could imagine a file being introduced in revision 1, and then not modified
> until revision 10,000 (where it was modified in 2 branches simultaneously).
> The per-file graph would have 3 nodes, while you would have to search through
> the entire repository graph to encounter those three nodes.
> 
> More important, I guess, is the general/average case. And determining the cost
> of a round trip to read the individual file case, versus traversing the average
> whole-tree graph.

I have been thinking about this too. Quick definition to be sure: heads
of a list is arrived at by removing any element of the list that can be
reached in the same graph from another element of the list.

There are two cases where heads on a list of revision nodes might give a
different answer than heads on a list of file graph nodes. One is when a
node is reachable from another node in the revision graph, but not in
the file graph (as you pointed out above). The other is when a node is
reachable from another node in the file graph, but not from the revision
graph. The former takes a list [A, B] that we currently would use as [A,
B] and gives [A] or [B]. The latter case takes a list [A, B] that we
currently would use as [A] or [B] and gives [A, B]. 

Given a revision graph R and any file graph F for an arbitrary file:
        1. We only introduce a new node N in F when introducing the same
        new node N in R.
        2. By (1) the set of nodes in F is a subset of the nodes in R.
        3. The parents of the new node N in F are always the heads of
        the nodes of F last introduced by each parent of the new node N
        in R. (Note that taking the heads does not reduce the reachable
        nodes within the graph, it just gives the minimum list for doing
        so)
        4. By (1, 3) the ancestry for any node N in F is a subset of the
        ancestry for N in R. So for any pair of nodes (A, B) where A is
        reachable from B in F, A is reachable from B in R.
        5. We always introduce a new node N in F when introducing a new
        node N in R if the heads in F of the last introduced nodes in F
        for each parent of N in R has a length greater than 1.
        6. By (3, 5) any new node N in F has as ancestors all nodes in F
        that are in the ancestry of N in R.

If we assume the existence of a pair of nodes (A, B) where A is
reachable from B in R, but A is not reachable from B in F:
        7. To support this there must exist a node N in R which provides
        the reachability from B in R but does not provide it to F.
        8. By (6) A would be reachable from B in F.

Reductio ad absurdum - there cannot be a pair of node (A,B) where A is
reachable from B in R, and A is not reachable from B in F. So for any
pair of nodes (A, B) where A is reachable from B in R, A is reachable
from B in F.

Which gives use the two matching ancestry assertions we needed - its
totally fine to use heads() calls in the revision graph as proxies for
heads() calls in the file graph.

Anyone care to check I haven't skipped a step ?

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070828/a01ceaec/attachment-0001.pgp