[MERGE] Improvements to find_unique_ancestors()

Tue May 13 23:32:10 BST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Attached is a patch which updates the find_unique_ancestors() logic.

It adds a little bit more debugging info when running -Dgraph.

The big changes to the algorithm are:

1) Once the initial "unique" nodes are found, when we find nodes that are common
to all searchers, aggregate the search into a single searcher, rather than many.
(We can stop the common searcher when its search tips are common to all "unique"
searchers.) If you have 10 searchers, when they get to common nodes, this means
you end up with only 1 searcher.

2) Find overlaps between the unique searchers and already searched nodes. Most
of the time, these will be searching at the same tip. However, when you find
commonality, this lets you jump right away. Which helps accelerate getting all
the searchers to the same nodes. (And thus collapsing a bit faster.) This is put
into a local function so it shows up in --lsprof, it isn't called a lot so it
shouldn't impact performance. (In my testing, we can cover almost the whole
bzr.dev tree in about 200+ steps because of all the shortcuts, I guess.)

3) For the all_unique searcher in (1). This represents the very tip of our
search. It tends to quickly expand to the whole tree (often I'm searching 200+
nodes per step.). However, as it is the tip, we don't really need to step it as
often as the rest of the searchers.

Basically, we are waiting for the "common_searcher" to catch up to the
all_unique one. What I was finding is that all_unique quickly went past common,
and grabbed all of the tree before common caught up. (Because it is searching
very "wide" at a time, while common only has a few nodes.)

4) When "unique" searchers have converged on the same search tips, collapse them
into a single searcher with the intersection of their ancestry.

The biggest wins are probably (1) and (4). Basically trying to decrease the
number of unique searchers as quickly as possible.

For a performance comparison...

If I run a graph search on the last 100 merge nodes in bzr.dev, it takes an
average of 320ms to find the unique nodes. With this code, it takes an average
of 115ms. So it is approx 2x as fast.

My next patches will be trying to get this stuff merged into other functions
like "bzr missing", etc. Though I haven't tuned find_difference() nearly as much
as find_unique_ancestors.

John
=:->

PS> I looked at the Git algorithm. Basically, they search the graph by coloring
nodes. They don't stop the searchers when they are seen as UNINTERESTING.
Because "they want the maximal UNINTERESTING nodes". What they *do* is sort the
search by timestamp. So that newer nodes are stepped before older nodes. (I'm
guessing they stop searching when all tips are UNINTERESTING, but I'm not positive.)
So if your timestamps are reasonable, this actually works well. You basically
need a combination of a complicated graph, and timestamps that go in the wrong
direction to confuse their algorithm. Which means that it *can* get the wrong
answer, but I have the feeling that happens infrequently. We *might* want to
take something like this, as it is similar to the "Greatest Distance From
Origin" or Graph Height ideas. At the expense that you might get some slightly
incorrect answers if someones clock is set years into the future *and* you get
lots of "mesh-style" merging.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkgqFukACgkQJdeBCYSNAANVPwCgoCO9UgCzfiWW8LfS99vw7Lgt
crcAoIXyRQo446+82UAgB03bH0QfCCBA
=Mpqj
-----END PGP SIGNATURE-----
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: find_unique_revisit.patch
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20080513/65f7ca9d/attachment-0001.diff