find_difference shortcut failure without walking out

John Arbash Meinel john at arbash-meinel.com
Thu Dec 13 01:39:18 GMT 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:
> John Arbash Meinel wrote:
>> So I did some testing of my version of the code on our bzr.dev history. And I
>> found that it still fails under certain conditions. This is a graph which shows it.
> 
>> NULL_REVISION
>>     |
>>     a
>>     |
>>     b
>>     |
>>     c
>>     |
>>     d
>>     |\
>>     e f
>>     | |\
>>     | g h
>>     |/| |
>>     i j |
>>     | | |
>>     | k |
>>     | | |
>>     | l |
>>     |/|/
>>     m n
> 
> 

So my change to walk both sides fixed the above case. The case it didn't fix
was this one:

NULL_REVISION
    |
    a
    |
    b
    |
    c
    |
    d
    |\
    e f
    |\|\
    | g h
    | | |
    i j |
    | | |
    | k |
    | | |
    | l |
    |/|/
    m n

The subtle difference between these two graphs is that g merges from e, not i
merges from g.

The problem is that 'l' is seen as common, and both get to 'd' very quickly.

The current algorithm will give you uncommon ancestors of:
(m, i, e) and (n, h, f)

Which should be (m, i), and (n, h), also I'm attaching another real-world graph
to demonstrate one of the other issues.

So far, all of these cases in bzr.dev can be 'solved' by just continuing to
walk for another few (so far 4 seems to be sufficient) common nodes.

The good of this algorithm is that the returned "uncommon" ancestors are always
a superset of the real uncommon ones. (There may still be common ancestors in
the set, but you know you aren't missing any uncommon ones.)

The problem I'm demonstrating in the attached graph, is that in real-world, you
don't get nice little graphs that converge to a single line. They end up
blossoming outward, pulling in lots more nodes as they go.


The only algorithm I can think of off-hand, is the Greatest Distance From
Origin that Aaron and Robert talked about in London. Basically, every node is:
max(parent_distances) + 1.

Because the distance is always increasing, you know that if 1 node has a lower
GDFO, then it cannot be a child of a node with a higher GDFO. Without ghosts,
GDFO is well defined for every node at commit time, and never changes. Which
makes it something easy to memoize.

It would be fairly easy to fit it into another index-like structure, or with
the existing structure.

And then when doing something like "find_differences" you would just keep
iterating until all common search nodes have a lower GDFO than all uncommon nodes.

Without memoizing the values, though, computing one is an O(history) action.

Which brings up another question. When we want to perform operations, are there
some that we are okay with getting a larger set of "uncommon" than pure truth?

I tried to look at the git code, to see how a different system does it. They
don't really have a "find_differences" code, they seem to have a "merge_bases"
code which keeps a list of interesting nodes, and then searches through the
parents. It does something about inserting nodes based on their date, but that
may just declare which nodes get searched first, not whether on not nodes are
searched at all.

Interestingly, all objects have a single instance in memory (they are added to
a hash table based on their sha1). Which means that when 'merge_bases' mutates
the entries to mark them with PARENT1 or PARENT2 it shares the objects.

Anyway, it seems to just recurse through all parents marking them based on what
node the started from. And when it finds a node with both parents, it stops
searching further. Which is basically what our code does for
'border_ancestors'. However, it isn't trying to find the correct set difference.

So I have the feeling that "git log" with crazy parents as I have pointed out
here, will give you the extra nodes.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHYI1GJdeBCYSNAAMRApwHAJ4qqaXHMir6uoKe84gHKjrd2+6xOQCgwY7l
SIWUDDWBCQhrE4mGLQX0/s4=
=tn4a
-----END PGP SIGNATURE-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test2.png
Type: image/png
Size: 49975 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20071212/12913533/attachment-0001.png 


More information about the bazaar mailing list