Improve performance of tsort.topo_sort

Wed Jul 29 07:57:48 BST 2009

2009/7/29 Maarten Bosmans <mkbosmans at gmail.com>:
> Hi there,
>
> I've been looking into the code in tsort.py. With the modified
> algorithm show below the time to topologically sort the graph for the
> bzr.dev branch (~26000 revisions) went down from 680 to 260 ms.
>
> Note that I didn't modify the TopoSorter class. It's of course
> possible to alter the code there instead of in the topo_sort function,
> but by not using the self._var variables the topo_sort function is a
> bit faster.
>
> With this change there is one testcase (test_tsort_partial) failing.
> But if I understand the purpose of topo_sort correctly, this should
> not be a fail, because the exact ordering of the result list is not
> specified, so both the old and the new behaviour is correct.
>
> Maarten Bosmans

Hi, thanks for looking into this.

Can I suggest that in general you proposed changes by pushing a bzr
branch to Launchpad and then using the 'propose merge' link, because
that way your change is tracked through to completion and not lost -
like filing a bug vs just mentioning it in mail.

The test claims that part of the contract of tsort is that when there
is a tie it returns the items in lexicographical order.  (I guess
actually it means Python object comparison order.)  It is nice to have
the sorts be deterministic from their input rather than depending on
unpredictable behaviour of Python dicts or sets, which would tend to
cause varying behaviour and therefore make any bugs hard to reproduce.
 If giving this up made the code much faster it may be worthwhile, but
it should be a conscious choice.  It may not be too slow to sort
nochildnodes in your code?  (Or eventually we could use a heap or
something.)

The algorithm looks pretty much correct to me.

It is a bit unfortunate to have two implementations of tsort, but
perhaps it's best if the all-at-once once can be much faster than the
incremental implementation.

If being an object causes a significant slowdown perhaps the
iter_topo_sorted generator could be changed into a plain generator,
rather than a generator on an object.

>
>
> === modified file 'bzrlib/tsort.py'
> --- bzrlib/tsort.py     2009-07-27 15:20:32 +0000
> +++ bzrlib/tsort.py     2009-07-29 00:57:50 +0000
> @@ -35,7 +35,39 @@
>
>     node identifiers can be any hashable object, and are typically strings.
>     """
> -    return TopoSorter(graph).sorted()
> +    graph = dict(graph)
> +    node_name_stack = []
> +
> +    # count the number of children for every parent in the graph
> +    nchildren = dict.fromkeys(graph.iterkeys(), 0)
> +    for parents in graph.itervalues():
> +        for parent in parents:
> +            if parent in nchildren:
> +                nchildren[parent] += 1
> +    # keep track of nodes without children
> +    nochildnodes = [node for (node, n) in nchildren.iteritems() if n == 0]
> +
> +    while nochildnodes:
> +        # pick a node without a child and add it to the stack.
> +        node_name = nochildnodes.pop()
> +        node_name_stack.append(node_name)
> +
> +        # the parents of the node loose it as a child; if it was the last
> +        # child, add the parent to the list of childless nodes.

s/loose/lose/

> +        parents = graph.pop(node_name)
> +        for parent in parents:
> +            if parent in nchildren:
> +                nchildren[parent] -= 1
> +                if nchildren[parent] == 0:
> +                    nochildnodes.append(parent)
> +
> +    # if there are still nodes left in the graph,
> +    # that means that there is a cycle
> +    if graph:
> +        raise errors.GraphCycleError(graph)
> +
> +    node_name_stack.reverse()
> +    return node_name_stack
>
>
>  class TopoSorter(object):
>
>

-- 
Martin <http://launchpad.net/~mbp/>