[MERGE] faster topo_sort

Fri Feb 24 20:09:40 GMT 2006

On Sat, 2006-02-25 at 03:48 +1100, Martin Pool wrote:
> On 25 Feb 2006, Robert Collins <robertc at robertcollins.net> wrote:
> > On Sat, 2006-02-25 at 00:13 +1100, Robert Collins wrote:
> > > So - really - I'd like to replace the current topo_sort routine with
> > > this one. 
> > 
> > In further news here is another edition of this patch, with the
> > topological_sort in a function and with the tsort tests running against
> > it... they pass.
> 
> Is there any reason not to just actually replace it in tsort.py?  The
> current one is as you say a bit simpleminded.

Well, I wanted your ok: there is a significant difference between this
and tsort.py - this is not stable.

tsort returns all depth 0 nodes first, in sorted order.
then all depth 1 nodes in sorted order.
etc.

This returns all nodes before nodes depending on them, but in no
particular order: different versions of python or different platforms
will create different hash layouts affecting the order we pick seed
nodes to search the graph from: we start with a cloud and then just pick
a random node and emit everying it references followed by it.

This is an important difference: the output from this sort when you have
one extra node that depends on part of the graph is not predictable.
Imagine the graph B:A, C:B - this will always emit as [A, B, C]. IF we
add a node D:B, it may emit as [A, B, D, C] or [A, B, C, D] on two
different machines, or two different python binaries. [actually right
now I'm not aware of specific cases this will happen on, but the
ordering of the iterator/popping of items in the dict is not part of the
contract, so we have to assume it will happen].

While both pass the topo_sort tests, I don't know if you consider the
behaviour this has suitable for the general reweave application that
topo_sort is used for. FWIW It looks like it will be fine to me.

> > +"""Plugin to fix some potential data errors in a branch.
> 
> Not a plugin anymore.

Doh

> I wonder if we should save a nice word like "reconcile" for a more
> common or more user-oriented operation, and call this "fix-graph" or
> something?

This will be part of one multi-facted repair and reconciliation
operation. Root id analysis between branches, possibly inventory cross
check between branches for things like your branch and Aarons
encountered. And this current glitch. But I don't really care about the
name ;).

> > +from bzrlib.weavefile import write_weave_v5 as w5
> 
> Really a good idea?  It doesn't seem to be used anyhow.

Legacy import, the code I adapted this from in bzrtools used the
primitives rather than the higher level apis we have today.

> > +class TopoSorter(object):
> > +
> > +    def sort(self, graph):
> 
> If this is intended to be used only internally by topological_sort(),
> then it should have a private name.  Otherwise (or perhaps anyhow) it
> needs docstrings explaining what is supposed to be passed in by graph
> and how the result is returned.

Yup. Will do. Its meant to be public.

> Maybe it would be better expressed as nested functions within
> topological_sort acting on local variables?  (Maybe this is just
> personal taste but this thing does not seem to feel like a proper
> object -- e.g. in essentially returning a value by setting private
> fields.)  I think in Python if something takes some inputs and returns a
> value it is clearer to just make it a function, all other things being
> equal.

I'm not stuck on it being a class. I factored it out from the larger
Reconciler class where it started life as methods on the object. I think
that nested methods should be reserved for currying and the like -
objects are easier to factor and debug. I will do an overhaul with

> Is this code actually specific to operating on revisions or not?  If not
> it's fine to still use revision terms to make the explanation concrete
> but it should be stated.

the above in mind and make it clearer.

> > +        while self._rev_graph:
> > +            rev_id, parents = self._rev_graph.iteritems().next()
> 
> This seems to construct an iterator, pull one value, and then discard
> it, which seems a bit ugly/wasteful.  Would it work to just iterate
> through it in place of the while loop?

No - we dont want to process entries we have already processed or we can
have 
>>> d = {1:2, 2:3}
>>> for p in d.items():
...     print p
...     d.pop(2)
...     d.pop(1)
... 
(1, 2)
3
2
(2, 3)
Traceback (most recent call last):
  File "<stdin>", line 3, in ?
KeyError: 'pop(): dictionary is empty'

happen. Ah, found what I need, there is a popitem(), dictionaries have
to be special apparently.

> > === added file 'bzrlib/tests/blackbox/test_reconcile.py'
> > --- /dev/null	
> > +++ bzrlib/tests/blackbox/test_reconcile.py	
> > @@ -0,0 +1,51 @@
> > +# Copyright (C) 2006 by Canonical Ltd
> > +# Authors: Robert Collins <robert.collins at canonical.com>
> 
> We don't normally put these in but rather rely on bzr annotations.  Do
> you strongly want it?

Uhm, I've been putting it in on all files, and the test sampler we've
been telling people to use demonstrates it too. Separate thread for this
spawned, we've had it come up variously.

I'm doing an updated patch now with your and Johns comments
incorporated.

Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 191 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060225/b93dc323/attachment.pgp