What loggerhead needs from Bazaar
Michael Hudson
michael.hudson at canonical.com
Thu May 29 05:01:25 BST 2008
So, I'm starting to spend some time working on loggerhead again, and
have been talking a lot with Martin Albisetti about things we'd like to do.
An obvious place to start is that loggerhead uses a couple of APIs that
were deprecated in 1.4 and are in fact already gone in bzr.dev. One of
these is VersionedFile.annotate_iter, which is simple enough but the
other is get_revision_graph, which leads us on to a more interesting
line of thought...
Currently, loggerhead caches a bunch of information about a branch the
first time it looks at it -- roughly speaking, the merged sorted
revision graph. This involved calling get_revision_graph,
unsurprisingly. Now that get_revision_graph is gone, probably the
pragmatic thing to do is fake it -- copying the _old_get_graph from
bzr.dev into loggerhead.
Ideally, we would like to not have loggerhead do whole-history
operations at all, not even at start up. The main problem with this is
revision numbers -- both mapping revids to revnos, and also the other
way around (in fact, I think that with a little loggerhead could get
away with get_revision_id_to_revno_map as its only whole-history op, and
as this function has to merge sort the revision graph we may as well
store the merge sorted graph as it makes some other stuff easier). Most
of the other stuff can be done pretty efficiently with graph operations
now, or at any rate soon, given John's rate of progress :)
The other bottleneck is answering the question "what files changed in
this revision compared to its left hand parent?" This is approximately
what loggerhead currently does:
rev_tree1 = b.repository.revision_tree(last)
rev_tree2 = b.repository.revision_tree(last_but_one)
delta = rev_tree2.changes_from(rev_tree1)
and it takes 0.5-1 seconds for the launchpad tree (about 5000 files).
As the changelog view has twenty such lists of files on it, this is just
too slow and so this information is cached in a grossly-abused sqlite
database.
John pointed out an alternative approach:
introduced_by_last = g.find_unique_ancestors(
last, [last_but_one])
b.repository.fileids_altered_by_revision_ids(introduced_by_last)
But unfortunately this (a) isn't quite right (just because a file
changed between two revisions doesn't mean it's actually different at
the two endpoints) and (b) isn't noticeably faster on Launchpad.
The big cost here is inventory extraction. Make that 10x faster and the
concerns here will evaporate.
It's noticeable that both the problems above are suffered by log too.
Indeed, if log was really fast, you could implement loggerhead's
changelog view as a log formatter (I don't think this can work today as
log.py recomputes lots of stuff that loggerhead has already computed).
In happier news, it seems that loggerhead's other gross sqlite cache,
the revision cache, is just unnecessary with modern bzr, at least on
pack repositories. So I'll be ripping that out soon.
I guess I'm not really expecting much to change as a result of this
mail, I think most of the above problems are already known. But I guess
it may be worth pointing out that while we shouldn't be doing whole
history operations, there are situations where you can't really help it.
Cheers,
mwh
More information about the bazaar
mailing list