RFC get_revision_graph

John Arbash Meinel john at arbash-meinel.com
Wed Feb 20 22:38:11 GMT 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I've mentioned in the past that get_revision_graph is one of the major
performance problems. Partly this is because it is a whole-ancestry function.
But there is also another piece. It seems that get_revision_graph is defined to
not include ghosts.

The way it is implemented in packs (ATM) is that for each node, you query the
index for the parents of the node. Then you query the index again for the
parents to see if they are present (so you can remove ghosts). Then you step and
do the same thing again.

Which means that you query the index at least 2x for every node (even more for
nodes that have a lot of children since the parent nodes get queried each time).
Also, the index is written to not cache the parent information, so it requires
bisecting into the in-memory nodes again. We have a CachingParentProvider, but
that works at a higher level.

I thought there might be 2 possible resolutions.

1) Redefine (or create a new api) that gives the revision graph, only it
includes ghosts. That way you don't have to check if parents exist.

2) Push the revision_graph down a level, or push the ghost filtering up a level.
So that when you ask a node for its parents, you don't have to have it also
query its parents. You can do that when you step to the parents and find out
whether they are present or not. It seems like you would need a reverse map, so
you can figure out which nodes you need to go back and prune a parent from.
However, you only need 1 level of that, so that map doesn't ever grow very big.


Just to give some numbers...

Right now "bzr gannonate bzrlib/repository.py" spends 5s in annotating all of
the history of repository.py. And 25s in get_revision_graph(). That makes it 5x
more expensive to get a 15k revision graph versus 2x revisions of a file and
processing all of those to compute annotations. (If I sneak in a buffer_all
request it drops to 3.5s.)

I also was wondering what happened with our desire to change the index
operations from a sorted bisect into a hash map. That should help reduce at
least some of this overhead (plus it allows us to sort the index nodes by
topology, which means a single read to the index is likely to give us a lot of
the nodes we will be requesting in the next search).

Thoughts?

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHvKvTJdeBCYSNAAMRAj11AJ4nTTJ17l29udCjWIY3wfS8H+kCfwCglsnJ
Mf5SUApHgQkcPg1duHry/aU=
=at+R
-----END PGP SIGNATURE-----



More information about the bazaar mailing list