What loggerhead needs from Bazaar

Thu May 29 05:01:25 BST 2008

So, I'm starting to spend some time working on loggerhead again, and 
have been talking a lot with Martin Albisetti about things we'd like to do.

An obvious place to start is that loggerhead uses a couple of APIs that 
were deprecated in 1.4 and are in fact already gone in bzr.dev.  One of 
these is VersionedFile.annotate_iter, which is simple enough but the 
other is get_revision_graph, which leads us on to a more interesting 
line of thought...

Currently, loggerhead caches a bunch of information about a branch the 
first time it looks at it -- roughly speaking, the merged sorted 
revision graph.  This involved calling get_revision_graph, 
unsurprisingly.  Now that get_revision_graph is gone, probably the 
pragmatic thing to do is fake it -- copying the _old_get_graph from 
bzr.dev into loggerhead.

Ideally, we would like to not have loggerhead do whole-history 
operations at all, not even at start up.  The main problem with this is 
revision numbers -- both mapping revids to revnos, and also the other 
way around (in fact, I think that with a little loggerhead could get 
away with get_revision_id_to_revno_map as its only whole-history op, and 
as this function has to merge sort the revision graph we may as well 
store the merge sorted graph as it makes some other stuff easier).  Most 
of the other stuff can be done pretty efficiently with graph operations 
now, or at any rate soon, given John's rate of progress :)

The other bottleneck is answering the question "what files changed in 
this revision compared to its left hand parent?"  This is approximately 
what loggerhead currently does:

rev_tree1 = b.repository.revision_tree(last)
rev_tree2 = b.repository.revision_tree(last_but_one)
delta = rev_tree2.changes_from(rev_tree1)

and it takes 0.5-1 seconds for the launchpad tree (about 5000 files). 
As the changelog view has twenty such lists of files on it, this is just 
too slow and so this information is cached in a grossly-abused sqlite 
database.

John pointed out an alternative approach:

introduced_by_last = g.find_unique_ancestors(
     last, [last_but_one])
b.repository.fileids_altered_by_revision_ids(introduced_by_last)

But unfortunately this (a) isn't quite right (just because a file 
changed between two revisions doesn't mean it's actually different at 
the two endpoints) and (b) isn't noticeably faster on Launchpad.

The big cost here is inventory extraction.  Make that 10x faster and the 
concerns here will evaporate.

It's noticeable that both the problems above are suffered by log too. 
Indeed, if log was really fast, you could implement loggerhead's 
changelog view as a log formatter (I don't think this can work today as 
log.py recomputes lots of stuff that loggerhead has already computed).

In happier news, it seems that loggerhead's other gross sqlite cache, 
the revision cache, is just unnecessary with modern bzr, at least on 
pack repositories.  So I'll be ripping that out soon.

I guess I'm not really expecting much to change as a result of this 
mail, I think most of the above problems are already known.  But I guess 
it may be worth pointing out that while we shouldn't be doing whole 
history operations, there are situations where you can't really help it.

Cheers,
mwh