history-db, caching dotted revnos, incremental merge_sort
John Arbash Meinel
john at arbash-meinel.com
Fri Apr 9 23:02:50 BST 2010
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Some updates about the bzr-history-db plugin I've been working on:
lp:~jameinel/+junk/bzr-history-db
I finally got 'incremental merge sort' working, and performing fairly
well. I'm pretty sure I've gotten it down to a minimal amount of
ancestry being read, in order to appropriately compute merge_sort and
the dotted revnos. At a minimum, I know that it is finally faster to do
an incremental merge sort to expand bzr.dev, rather than computing
KG.merge_sort from scratch each time. (now at 2m16s rather than 3m30s,
but initial 'correct' result was all the way at 11min)
Anyway, this also gave me some interesting insights into merge_sort and
dotted_revno behaviors. Which I'll try to share.
1) For bzr.dev, finding the interesting ancestry seems to be more
expensive than computing dotted_revnos. At least from a "how many
times do I have to walk the mainline + merged info."
The breakdown of times to step the mainline pointer are:
701,174 total times
649,184 times because I couldn't tell if the current search tips
could have been merged earlier
32,509 times to ensure that I have a LH ancestor to number from
7,597 times to get the latest NULL root branch count (0,22,1),
etc.
7,171 times to make sure I have the most recent branch count (to
ensure that I get 1.5.1 rather than 1.2.1 when generating a
new branch.) [see also (3)]
4,713 as the first step because I'm importing 4.7k branches (I
know I'll always need to step at least once, so I start with
that)
So yes, there is some overhead to compute dotted revnos. But 93% of
the time we have to walk the mainline to know that the revisions
haven't been merged.
Some of that is some very long-lived integration branches. 2.0 and
2.1 each span about 200 trunk revisions. Not to mention bzr.ab where
Aaron branched off of bzr.dev at 1551 and the most recent merge was at
bzr.dev at 4096 (yeah, spanning 2.5k revisions).
We may be able to do better here, I'm not sure what we can assume.
The breakdown for MySQL followed a similar pattern:
1,865,862 total
1,835,125 'unknown' if merged
17,095 initial steps for 17k branches
13,591 ensure LH
87 find the latest branch counter
2) The #1 way search tips get filtered is after I've walked the mainline
far enough back such that the gdfo tells me it could not have been
merged.
bzr.dev mysql cause
730,354 492,422 gdfo(search_tip) >= gdfo(mainline)
30,358 71,601 found dotted_revno(search_tip) while walking
mainline
35,720 34,231 all known children are already in the
'interesting' set.
6,783 5,597 found dotted_revno(child) while walking mainline
3) I found a very nice trick for computing a new branch counter.
Specifically, consider (time going down):
A 1
|\ |\
| \ | \
| \ | \
| B | 1.1.1 ------.
| /|\ | | \ \
| C E | | 1.1.2 1.2.1 |
|/ / / | / / /
D F H 2 1.2.2 1.3.1
|/ X | / \ /
G / J 3 .------' 1.2.3
|/ /|\ | / / \
I / K L 4 1.2.4 1.4.1
|/ |/ | | /
N M 5 1.2.5
| /
| /
|/
O
If you start by assuming that N is already imported, you can determine
that revisions K L M are new pretty quickly. (one step along the
mainline, and J is obvious. that is the dotted_revno(search_tip)
case.)
However, when it comes to numbering L, you can see that you need a new
branch number, because L is not the first child of J that was merged
into mainline. However, since you've only loaded J, the 'most recently
merged' descendant of 1 that you know about is just 1.2.X. We need to
include more ancestry to know if it is safe to use 1.3.1 (it isn't).
My original thought, was that you should load until you find 1.2.1.
Because once you found that revision, you could be sure that any
descendant of 1 that was merged after the 1.2.* series, would have
also been loaded.
Using that on bzr, ended up causing me to step the mainline 3.2Million
times (vs the 7,000 times you see above.)
What I realized, was that I don't need 1.2.1, I need 1.X.1. Basically,
once I've found any the first revision of any branch that descended
from 1 and was merged into mainline, I know that I've found the
most-recent branch that was merged. (Assume a hypothetical 1.X+1.1,
because it has a larger branch count than X, it must be merged after
1.X.1. Since we load ancestry 'backwards', we would have encountered
it already.)
As you can see, it dropped the cost from 3.2M down to a mere 7k. This
also holds true for NULL_REVISION based merges. I don't need to walk
the whole ancestry to find all the revisions that start at NULL, I
just have to find the most recent, and start numbering one greater
than there.
4) If the only change we made was to add gdfo information into our
bzrlib cache, that would be a decent addition for the purpose of
computing the 'what was merged into X that wasn't merged into Y'. I
would guess you still load a rather large amount of your revision
graph every time, but it would seem that that much of the graph has
to be loaded. (known_children looks pretty good on paper, it doesn't
actually scale to log of the whole ancestry. It might do better on
'log -r -10..-1' where you are less likely to have
known-but-not-yet-merged children.
However, we have a 'lot' of leapfrog merges in bzr.dev. If you just
compare the dotted revno first rev at the tip of each merge, to the
revno it was merged into. (example, bzr.dev 5137 merged 5086.6.2,
which is 5137 - 5086 = 51 mainline revisions)
Counting only unique dotted revnos vs mainline, we have an *average*
spread of 81, and a median spread of 11, max of 4280. Filtering out
the NULL based ones, we get:
avg 71, median 11, max 3016
I'm guessing the max is stuff like bzr.ab, where it is based on revno
1551, but a child is merged as late as bzr.dev at 4096.
5) Some other interesting bzr.dev numbers. If you take the merge_sort
graph, and then find the offset to the mainline:
offsets = {}
for idx, node in enumerate(ms):
if node.merge_depth == 0:
offsets[node.revno[0]] = idx
You can then walk the merge_sort again, and find out how many
'revisions' are merged between the point where this revision was
merged, and where it originated.
anc_spread = []
for node in ms:
if node.merge_depth == 0:
last_main = node.revno[0]
continue
if node.revno[0] == last_merged:
continue
last_merged = node.revno[0]
if last_merged == 0:
continue
anc_spread.append(offsets[last_main] - offsets[last_merged])
This has: avg: 543.4 median: 72 max: 22,289
Which means that sometimes you have to walk *far* more revisions than
you actually merged. And on average, still a fair number.
Given that we average only ~6 revisions merged per mainline rev (~30k
total revs, ~5k mainline). So you average a distance of 9x.
6) I would guess some of this is cultural. If we were strong proponents
of 'rebase', for example, then you would be much more likely to
rebase your changes rather than merging bzr.dev, and the base
revision would be likely to be much newer.
I find (4) to be pretty interesting. That 'median' we land things
about 10-11 revisions from the point they originated. (So when I land
my current work, other people have generally managed another 10
revisions.)
There are other factors, I'm sure. Like integration branches, times
when we didn't update bzr.dev before starting a new branch. Daggy
fixes, etc. But this is the *median* so it is probably fairly
accurate. (An average of 80 is pretty huge.)
Anyway, this is too long anyway, I hope people found some of the
insights as interesting as I did.
I'm pretty sure it tells me that while we could change how revnos are
created, I don't think it can have a large impact on performance. And
while gdfo is a decent filtering tool, it still costs a lot.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAku/pAoACgkQJdeBCYSNAAObjACeKfUoHya7iLBNkBKvXgm6/wWw
czUAoNkIV3mPIm4dQgMtL5cx8epIJ3zf
=/taS
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list