[PATCH] more work on merge

Tue Jul 12 20:59:40 BST 2005

Aaron Bentley wrote:
> John A Meinel wrote:
>
...

>
> Well, let's look at the utility first, then look at the practicality.
> Are there cases where shortest longest-path gives a bad result?
>

I think the idea of SPF with weighted distances is more intriguing, but
both are probably just different ways of having a computer automate
something that is intrinsicly abstract. Only a human really assigns a
meaning to a merge, and can handle what a conflict means. I would guess
both algorithms would have their own pathological cases.

Probably the only one without is your brute-force, try all reasonable
bases, pick the one that conflicts the least.

>
>>>See my earlier post where
>>>with a simple 13 revisions, I generated 60 that needed to be searched.
>>>If I find a longer route to a revision, then I need to update all routes
>>>out of that revision, since there is now a longer route as well.
>
>
> Okay, so one approach is to keep a tree in which each node is affixed
> only to its most-distant descendant.
>
> So as you're iterating through history breadth-first, each time you hit
> an already-seen node, you steal it and make it an ancestor of your node
> only.  (You know that the current node is farther than or equal to the
> previous one, since you're going breadth-first).
>
> Since you've already seen this node, you don't need to traverse it.
> Once the tree is completely built, you do another traversal to find the
> distance values.  This should scale O(n) with the number of nodes in the
> tree.

As discussed on irc, I think this is a nice way of building up the distance.

>
>
>>>The only thing I can think of is that you could use BFS, and then if you
>>>encounter a node a second time, you immediately update any child entries
>>>that exist in the revision=>distance dictionary.
>
>
> I believe the scaling on this is bad, because you have to update all
> children every time you find a longer path.  However, the constant is
> probably small.
>

Yes, my idea sucked :)

>
>>>>Oh, but you can generate the inventory from the changeset and its base,
>>>>right?  Or just install the changeset's revision first?
>>>
>>>
>>>I can generate the *last* inventory XML in a multi-revision rollup. And
>>>I can generate all intermediate revision XML entries.
>
>
> I am highly suspicious of the notion of storing just the revision XML
> for a revision.  I like the idea that either you have all the data for a
> revision, or you don't have a revision entry for it.
>

It just depends what becomes *important* information. I liked being able
to send a nice human-readable cset which might correspond to many
commits, with a lot of potentially conflicting changes.

And you asked that they be interchangeable with direct merging of
branches. And to date we only needed the revision XML to determine
ancestry. (Right now in the bzr.dev tree we don't even have that
everywhere, we have a number (3) of dangling pointers).

>
>>>If you wanted to generate intermediate inventory XML, then you need to
>>>include intermediate patches. Which negates the advantage of a roll-up.
>>>If you have to have all ancestry inventory XML, then just send a series
>>>of changesets.
>
>
> More and more, I think that a series of changesets is more likely to be
> useful than a roll-up.  Martin and Linus both like a series of small
> patches rather than one big one, and I suspect that is common.
>

But I think both of them would prefer a *clean* small changeset.
I suppose worst case you do your dev on a throwaway branch, get it
right, then merge it into a different branch, which you send.

>
>>>>Extracting texts from the current storage format doesn't seem very
>>>>expensive to me.  We're only talking about modified files, and heck, we
>>>>can optimize by only grabbing one copy of each text.
>>>
>>>
>>>It isn't expensive to get the text, but you need to get the difference
>>>from it's previous version, so it means running diff for each modified
>>>file in the history.
>
>
> I don't see this, or I'm reading it wrong.  We only need to compare the
> leading candidates-- those that are not ancestors of other candidates.
> In practice, I expect this is two or three, but frequently one.  And we
> only need to compare files that differ between THIS and OTHER, which is
> usually a small number.
>
> So total comparisons is
> len((this, other)) * len(leading_candidates) * differing_files(this, other)
>
> This may frequently resolve to 4. (2 leading candidates, 1 file changed)

Except when you go to merge and unmergable tree, and you walk back all
of history just to find that there are no changes.

The question is how do you determine *who* the leading candidates are. I
guess with SPF you just start going back from both tips until one of
them finds a found revision in the other one. I guess if you view it as
growing the tree from both sides, you should get the minimum cost
distance, something like:

from heapq import heappush, heappop, heapify

queue = [(0, 'left', target_rev_id), (0, 'right', base_rev_id)]
heapify(queue)
revs_in_left = {target_rev_id:(0, None)}
revs_in_right = {base_rev_id:(0, None)}

best_rev_id = None

while len(queue) > 0:
  dist, side, rev_id = heappop(queue)
  if ((side == 'left' and rev_id in revs_right)
	or (side == 'right' and rev_id in revs_left)):
    best_rev_id = rev_id
    break
  if side == 'left':
    revs = revs_left
  else:
    revs = revs_right

  rev = revision_source.get(rev_id)
  for par in rev.parents:
    dist_par = dist + compute_distance(rev, par)
    if not revs.has_key(par) or revs[par][0] > dist_par:
      # We shouldn't really ever have the above inequality be
      # true, since the heap should keep us in shortest distance
      # first, but as a just in case
      revs[par] = (dist_par, rev_id)
      heappush(queue, (dist_par, side, par))

if best_rev_id is None:
  raise BzrError('No common revisions.')

# Now we have a best revision from both sides, and the children
# pointers to get us back to the head (I don't know if that is
# necessary)

left_path = []
cur_rev = best_rev_id
while cur_rev is not None:
  left_path.append(cur_rev)
  cur_rev = left_revs[cur_rev][1]
  # We could check to make sure distance is
  # monotonically decreasing

right_path = []
cur_rev = best_rev_id
while cur_rev is not None:
  right_path.append(cur_rev)
  cur_rev = right_revs[cur_rev][1]

return best_rev_id, left_path, right_path

>
> Aaron

I don't know if keeping the paths helps us, but it is easy to hang on to
as you build, and expensive to generate if you don't.

At this point compute_distance() is an arbitrary function. If you set it
equal to 1, then you have BFS starting from both edges. You could have
it check number of changed files, or lines modified, or whatever you
want. It could even do a simple "subtract the text_size" as an
approximation to volume of change. (Not perfect, but is directly
available in the inventory XML without touching the text_store). We
still have to diff the inventory XML, and currently just plain reading
the XML is one of the more expensive operations we have. (Though you
could write a custom diff function which could read the XML directly
without parsing a tree, and still have optimized knowledge about how the
tree can change to do a faster diff than 'diff').

Anyway, the above code does decent. Depending on how expensive
compute_distance() is, it can take pretty long if you have a very big
tree and your common ancestry is a while ago.
But it uses SPF to shortcut having to traverse the entire history of
both trees.
And it gives you the shortest path back to the common ancestor.

If you want to change it into shortest longest path, then you could do:

queue = [(0, 'left', target_rev_id), (0, 'right', base_rev_id)]
revs_in_left = {target_rev_id:(0, None)}
revs_in_right = {base_rev_id:(0, None)}

best_revs = []
while len(queue) > 0:
  dist, side, rev_id = queue.pop(0)
  if ((side == 'left' and rev_id in revs_right)
	or (side == 'right' and rev_id in revs_left)):
    best_revs.append((dist, rev_id))
    continue
  if side == 'left':
    revs = revs_left
  else:
    revs = revs_right

  rev = revision_source.get(rev_id)
  for par in rev.parents:
    dist_par = dist + 1
    if not revs.has_key(par):
      queue.append((dist_par, side, par))
    revs[par] = (dist_par, rev_id)

best_rev_dist = [(revs_left[rev_id][0]+revs_right[rev_id][0], rev_id)
	for rev_id in best_revs]
best_rev_dist.sort()

...

At this point you have a list of revisions along with their distance to
the other node, and 2 dictionaries containing all the back-pointers that
you need.
It may still trace most of the ancestry on one side, since it won't
discover that the other side has those revisions yet. But as soon as you
find that the other side has this node, you don't have to search it's
parents anymore.
You could add a small backtracking search that looked for at most say 10
children, if any of those have been found, then stop searching the
parent. I can't say if this would be a net performance gain or not, but
it would handle the case that you are on the tip of the ancestry, and
the other tree stopped searching, and you just passed them by.

If we are optimizing for the case where we expect the trees to be in
common, you could search all the way back to None. But I would like the
performance to be decent in worst-case scenarios where someone tries to
merge 2 trees each with 1000 revisions in them, and no common ancestry.
It doesn't have to be super-fast, but a simple "enumerate all revisions"
could tell you that the set intersection is Null pretty fast. And it
would prevent the upset of "I said merge, and it searched for 20 minutes
just to say not possible".
I think some number < 10 would help the search to stop, without adding
too much overhead.

John
=:->
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 253 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20050712/c529e0e7/attachment.pgp