history-db, caching dotted revnos, incremental merge_sort

Fri Apr 9 23:02:50 BST 2010

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Some updates about the bzr-history-db plugin I've been working on:
  lp:~jameinel/+junk/bzr-history-db

I finally got 'incremental merge sort' working, and performing fairly
well. I'm pretty sure I've gotten it down to a minimal amount of
ancestry being read, in order to appropriately compute merge_sort and
the dotted revnos. At a minimum, I know that it is finally faster to do
an incremental merge sort to expand bzr.dev, rather than computing
KG.merge_sort from scratch each time. (now at 2m16s rather than 3m30s,
but initial 'correct' result was all the way at 11min)

Anyway, this also gave me some interesting insights into merge_sort and
dotted_revno behaviors. Which I'll try to share.

1) For bzr.dev, finding the interesting ancestry seems to be more
   expensive than computing dotted_revnos. At least from a "how many
   times do I have to walk the mainline + merged info."
   The breakdown of times to step the mainline pointer are:

   701,174  total times
   649,184  times because I couldn't tell if the current search tips
            could have been merged earlier
    32,509  times to ensure that I have a LH ancestor to number from
     7,597  times to get the latest NULL root branch count (0,22,1),
            etc.
     7,171  times to make sure I have the most recent branch count (to
            ensure that I get 1.5.1 rather than 1.2.1 when generating a
            new branch.) [see also (3)]
     4,713  as the first step because I'm importing 4.7k branches (I
            know I'll always need to step at least once, so I start with
            that)

   So yes, there is some overhead to compute dotted revnos. But 93% of
   the time we have to walk the mainline to know that the revisions
   haven't been merged.

   Some of that is some very long-lived integration branches. 2.0 and
   2.1 each span about 200 trunk revisions. Not to mention bzr.ab where
   Aaron branched off of bzr.dev at 1551 and the most recent merge was at
   bzr.dev at 4096 (yeah, spanning 2.5k revisions).

   We may be able to do better here, I'm not sure what we can assume.

   The breakdown for MySQL followed a similar pattern:
   1,865,862 total
   1,835,125 'unknown' if merged
      17,095 initial steps for 17k branches
      13,591 ensure LH
          87 find the latest branch counter

2) The #1 way search tips get filtered is after I've walked the mainline
   far enough back such that the gdfo tells me it could not have been
   merged.

   bzr.dev	mysql	cause
   730,354	492,422	gdfo(search_tip) >= gdfo(mainline)
    30,358	 71,601 found dotted_revno(search_tip) while walking
                        mainline
    35,720	 34,231 all known children are already in the
                        'interesting' set.
     6,783	  5,597 found dotted_revno(child) while walking mainline

3) I found a very nice trick for computing a new branch counter.
   Specifically, consider (time going down):
     A         1
     |\        |\
     | \       | \
     |  \      |  \
     |   B     |   1.1.1 ------.
     |  /|\    |   |     \      \
     | C E |   |   1.1.2   1.2.1 |
     |/ / /    | /       /      /
     D F H     2   1.2.2   1.3.1
     |/ X      | /      \ /
     G / J     3  .------' 1.2.3
     |/ /|\    | /        /    \
     I / K L   4        1.2.4    1.4.1
     |/  |/    |         |     /
     N   M     5        1.2.5
     |  /
     | /
     |/
     O

  If you start by assuming that N is already imported, you can determine
  that revisions K L M are new pretty quickly. (one step along the
  mainline, and J is obvious. that is the dotted_revno(search_tip)
  case.)

  However, when it comes to numbering L, you can see that you need a new
  branch number, because L is not the first child of J that was merged
  into mainline. However, since you've only loaded J, the 'most recently
  merged' descendant of 1 that you know about is just 1.2.X.  We need to
  include more ancestry to know if it is safe to use 1.3.1 (it isn't).

  My original thought, was that you should load until you find 1.2.1.
  Because once you found that revision, you could be sure that any
  descendant of 1 that was merged after the 1.2.* series, would have
  also been loaded.

  Using that on bzr, ended up causing me to step the mainline 3.2Million
  times (vs the 7,000 times you see above.)

  What I realized, was that I don't need 1.2.1, I need 1.X.1. Basically,
  once I've found any the first revision of any branch that descended
  from 1 and was merged into mainline, I know that I've found the
  most-recent branch that was merged. (Assume a hypothetical 1.X+1.1,
  because it has a larger branch count than X, it must be merged after
  1.X.1. Since we load ancestry 'backwards', we would have encountered
  it already.)

  As you can see, it dropped the cost from 3.2M down to a mere 7k. This
  also holds true for NULL_REVISION based merges. I don't need to walk
  the whole ancestry to find all the revisions that start at NULL, I
  just have to find the most recent, and start numbering one greater
  than there.

4) If the only change we made was to add gdfo information into our
   bzrlib cache, that would be a decent addition for the purpose of
   computing the 'what was merged into X that wasn't merged into Y'. I
   would guess you still load a rather large amount of your revision
   graph every time, but it would seem that that much of the graph has
   to be loaded. (known_children looks pretty good on paper, it doesn't
   actually scale to log of the whole ancestry. It might do better on
   'log -r -10..-1' where you are less likely to have
   known-but-not-yet-merged children.

   However, we have a 'lot' of leapfrog merges in bzr.dev. If you just
   compare the dotted revno first rev at the tip of each merge, to the
   revno it was merged into. (example, bzr.dev 5137 merged 5086.6.2,
   which is 5137 - 5086 = 51 mainline revisions)

   Counting only unique dotted revnos vs mainline, we have an *average*
   spread of 81, and a median spread of 11, max of 4280. Filtering out
   the NULL based ones, we get:
     avg 71, median 11, max 3016
   I'm guessing the max is stuff like bzr.ab, where it is based on revno
   1551, but a child is merged as late as bzr.dev at 4096.

5) Some other interesting bzr.dev numbers. If you take the merge_sort
   graph, and then find the offset to the mainline:

   offsets = {}
   for idx, node in enumerate(ms):
     if node.merge_depth == 0:
       offsets[node.revno[0]] = idx

   You can then walk the merge_sort again, and find out how many
   'revisions' are merged between the point where this revision was
   merged, and where it originated.

   anc_spread = []
   for node in ms:
     if node.merge_depth == 0:
       last_main = node.revno[0]
       continue
     if node.revno[0] == last_merged:
       continue
     last_merged = node.revno[0]
     if last_merged == 0:
       continue
     anc_spread.append(offsets[last_main] - offsets[last_merged])

   This has:  avg: 543.4  median: 72 max: 22,289

   Which means that sometimes you have to walk *far* more revisions than
   you actually merged. And on average, still a fair number.
   Given that we average only ~6 revisions merged per mainline rev (~30k
   total revs, ~5k mainline). So you average a distance of 9x.

6) I would guess some of this is cultural. If we were strong proponents
   of 'rebase', for example, then you would be much more likely to
   rebase your changes rather than merging bzr.dev, and the base
   revision would be likely to be much newer.

   I find (4) to be pretty interesting. That 'median' we land things
   about 10-11 revisions from the point they originated. (So when I land
   my current work, other people have generally managed another 10
   revisions.)

   There are other factors, I'm sure. Like integration branches, times
   when we didn't update bzr.dev before starting a new branch. Daggy
   fixes, etc. But this is the *median* so it is probably fairly
   accurate. (An average of 80 is pretty huge.)

Anyway, this is too long anyway, I hope people found some of the
insights as interesting as I did.

I'm pretty sure it tells me that while we could change how revnos are
created, I don't think it can have a large impact on performance. And
while gdfo is a decent filtering tool, it still costs a lot.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAku/pAoACgkQJdeBCYSNAAObjACeKfUoHya7iLBNkBKvXgm6/wWw
czUAoNkIV3mPIm4dQgMtL5cx8epIJ3zf
=/taS
-----END PGP SIGNATURE-----