Rev 40: Some discussion on how we could do updates to the graph. in http://bzr.arbash-meinel.com/plugins/history_db
John Arbash Meinel
john at arbash-meinel.com
Mon Apr 5 17:02:01 BST 2010
At http://bzr.arbash-meinel.com/plugins/history_db
------------------------------------------------------------
revno: 40
revision-id: john at arbash-meinel.com-20100405160148-1frpoq89fj77dhbw
parent: john at arbash-meinel.com-20100405135447-zdtpch683yd0wmes
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: history_db
timestamp: Mon 2010-04-05 11:01:48 -0500
message:
Some discussion on how we could do updates to the graph.
For now it is mostly a discussion of how to determine what ancestry needs
to be merged, and a small amount of detail on the difficulties of determining
dotted revnos with a limited view of ancestry.
-------------- next part --------------
=== added file '.bzrignore'
--- a/.bzrignore 1970-01-01 00:00:00 +0000
+++ b/.bzrignore 2010-04-05 16:01:48 +0000
@@ -0,0 +1,1 @@
+./history_denormalization.html
=== modified file 'history_denormalization.txt'
--- a/history_denormalization.txt 2010-04-02 22:17:01 +0000
+++ b/history_denormalization.txt 2010-04-05 16:01:48 +0000
@@ -233,6 +233,7 @@
a) Paths along the graph do not always follow particularly 'linearly'.
Consider
+
i) Branch 1 is a single patch against rev 1000.
ii) The mainline is developed for 3 months, with lots of changes landing.
iii) Branch 1 is updated with a second patch, and lands on mainline rev
@@ -253,6 +254,236 @@
X is in the ancestry of tip T, then the dotted revno for X will be the same
for all descendents of T.
+
+Partial ``merge_sort`` Computation
+==================================
+
+This section tries to discuss the issues and design of how to compute dotted
+revnos and the overal merge-sorted graph, given a desire to avoid walking all
+history.
+
+The current merge-sorted design is achieved by walking the ancestry graph. A
+'pending' stack is created. Parents are put on the stack, with the left-most
+parent at the top of the stack. Once all parents have been put into pending,
+the top node can is ready to be numbered and can be removed. For example::
+
+ A
+ |\
+ B C
+ |/|
+ D E
+ | |
+ | F
+ |/
+ G
+
+G is first pushed onto the stack. We then push F and D onto the stack (GFD).
+We then go to D, and push C and B (GFDCB). And then go to B and push A. When
+we get to A (GFDCBA), there are no more parents, so A is ready to be numbered
+(0, 0, 1), and it is popped off and marked completed. After that we get to B,
+which is also complete (0, 0, 2). C is also ready. When we pushed it, we noted
+that it was a right-hand parent, so its ``merge_depth`` gets incremented by 1
+vs the original depth of 0. Thus it needs a dotted revno, and gets (1, 1, 1).
+D is then ready to be marked (0, 0, 3). When we get to F, we have to push its
+parent E onto the stack (GFE), when we switch to E, we see its parents are
+already numbered, and it gets (1, 1, 2). F is then ready and gets (1, 1, 3). G
+is then the last entry and gets (0, 0, 4). As a simplification, all numbers
+starting with (0, 0, ?) get truncated down to just (?,).
+
+Difficulties
+------------
+
+The main difficulties I see are determining:
+ 1) First child
+ 2) ``_revno_to_branch_count``
+
+The former is used to determine whether we do a simple last-digit update
+(1,1,1) => (1,1,2), or whether we have to allocate a new branch number. Both
+values are restricted to the current ancestry (so it cannot be simply cached),
+also, complex ancestries make it difficult to track. Consider::
+
+ A
+ |\
+ | \
+ | \
+ | B
+ | /|
+ | C E
+ |/ X
+ D / G
+ |/ /|
+ F H I
+ | |/
+ | J
+ |/
+ K
+
+In this case, the numbers are::
+
+ A 1
+ B 1.1.1
+ C 1.1.2
+ D 2
+ E 1.2.1
+ F 3
+ G 1.1.3
+ H 1.1.4
+ I 1.3.1
+ J 1.1.5
+ K 4
+
+The tricks are that G is still the first child with C as a left-hand parent,
+so it continues the simple sequence. E lands into mainline before I, so I gets
+bumped to the next available branch number, which ends up being 3 (not a
+simple increment from its parent G).
+
+It should be possible to generate a 'bounds' for the ancestry we need to look
+at, though. Specifically, we know that C landed in D and I landed in K. Thus
+we only need to look at revisions that landed in D, F, K. If it landed before
+D, then C's number would be higher. If it lands after K, then it doesn't
+matter for I. We do need to consider all the revs in D, though, because of
+stuff like::
+
+ A
+ |\
+ | \
+ | \
+ | B
+ | /|
+ | C D
+ | |X
+ | E G
+ |/ /|
+ F H I
+ | |/
+ | J
+ |/
+ K
+
+In the above, D causes a bump to the branch number that I needs to be aware
+of.
+
+Avoiding All Ancestry
+---------------------
+
+The other trick is that when we see a new merge, how *much* of the ancestry do
+we need to bring in. It should be possible to use a couple tricks to limit the
+range.
+
+ 1) gdfo. In the graph directly above, the gdfo numberings are::
+
+ 1 A
+ 2 B
+ 3 C D
+ 4 E G
+ 5 F H I
+ 6 J
+ 7 K
+
+ This doesn't help a huge amount, because I think we still need to walk the
+ mainline back to A. Because while walking the left-hand ancestry of K, we
+ get back to G with gdfo of 4. Because gdfo(G) < gdfo(F) it could have been
+ merged there. We don't have to read the ancestry of A, though.
+
+ 2) 'known children'. There is another complication when you have a new 'root'
+ ancestry. Consider::
+
+ :
+ A
+ |
+ B C
+ | |
+ | D
+ |/
+ E
+
+ C is a new line of development (or a ghost), which ends up with a
+ gdfo(C)=1. As such, we would potentially have to walk the whole ancestry
+ priory to A, until we get back to the root revision. Because C *could
+ have* been merged into any of them.
+
+ One possibility is to notice that the only known child of C is D, which is a
+ revision we are already including in our ancestry search. Note that this
+ probably should also cover slightly more complex cases like::
+
+ :
+ A
+ |
+ B C
+ | |\
+ | D E
+ | |/
+ | F
+ |/
+ G
+
+ Where the descendants of C are not trivial, but they are 'contained'. Note
+ that probing for known children should be a simple index lookup (``SELECT
+ child FROM parent WHERE parent=?``). However, that can give answers
+ outside of the scope of the ancestry we currently care about. eg, in the
+ above example, if E did not merge into F, but instead merged into a
+ descendant (H) of G. It would not affect the numbering of C,D,F. Then
+ again, if we walked the children of E and saw that the gdfo of H was >
+ gdfo(G) then we could also prune that as uninteresting. Actually, as long
+ as gdfo(H) > the mainline ancestry we have searched so far, we should be
+ done.
+
+ One possible way to implement the search would be to start at G. In step
+ 1, we can find B and F. We should know that B is on the mainline, so we
+ focus on F. Our current bounds for 'interesting' children is < gdfo(B). We
+ lookup known children of F, and find only G, so the bounds set by B is
+ sufficient. We then walk back to D & E. The bounds set by B is still
+ sufficient, since there are no interesting outbound links for either
+ revision. (gdfo(F) < gdfo(B), however F is already marked as not having an
+ children with gdfo(child) < gdfo(B).) As such, both D and E should get
+ marked that their children ancestry is satisfied, similarly walking to C
+ finishes the result.
+
+ 3) Combining the two, lets look again at::
+
+ A
+ |\
+ | \
+ | \
+ | B
+ | /|
+ | C D
+ | |X
+ | E G
+ |/ /|\
+ F H I :
+ | |/
+ | J
+ |/
+ K
+
+ Starting at K, we grab F & J. gdfo(F) = 5. gdfo(K) = 7, gdfo(J) = 6. Since
+ gdfo(J) = 6, we know that it cannot be merged into F (nor can any of its
+ descendents). So J gets marked as "interesting" (it is an ancestor of K),
+ but "complete" (it could not have been merged into a mainline revision
+ that we have not fetched).
+
+ Next we step to H & I, both with gdfo = 5. These also get marked
+ "interesting & complete". We then walk to G gdfo=4. We check the known
+ children of G, and see that they are all marked either complete, or have
+ gdfo >= 5 (the hypothetical children off of ':' all have to have gdfo >=
+ 5.) The only other possible answer here would be that F was an immediate
+ descendent of G. G then gets marked "interesting & complete". We then
+ walk to C gdfo=3. We then see that C has a child E that is not interesting
+ & complete, and has a gdfo < 5. We could probably take a couple different
+ tacks.
+
+ a) Walk children of E until we see that it either reaches gdfo >= 5, or
+ is, in fact, and ancestor of F (as is the case here). Or
+ b) Step F's dotted_revno information, loading all of its ancestors, and
+ setting the new gdfo threshold to A (1). At the same time, we can
+ mark all of the ancestors as 'not interesting' because we know they
+ are ancestors of F.
+
+ I don't have a great feeling for (a) vs (b). On the one hand, we will need
+ to page in the dotted_revno info anyway, because we'll use it to generate
+ the revnos for G, etc.
+
..
vim: ft=rst tw=78 ai
More information about the bazaar-commits
mailing list