Loggerhead directions

Thu Apr 15 08:46:44 BST 2010

On 15/04/10 11:54, John Arbash Meinel wrote:

>> I'm curious though about how stacked branches
>> play in with this? Can you chain the caches so that the cache for a
>> stacked branch gets deleted if and when a stacked branch is deleted?
>>
>
> So I just have 1 cache. Shared by all branches you want to put in it.
> Without chaining. I haven't really worked out how to clean data out of
> the cache. Certainly it is possible, but I don't see it being a big deal.
>
> Right now, the cache is a local-only thing, based purely on the revision
> graph. The location of the cache is a branch config, but I would expect
> people to either set it up as a global, or possibly per repo, etc.

Cool. I can imagine having one cache per project in LP and one cache per 
repository locally.

> The main problem is that we instantiate a new Branch instance *per
> request*. Which means that any caching I assume to happen on or under
> Branch won't persist between HTTP requests.
>
> So far, I've just gone via the bzrlib apis you added recently
> (dotted_revno_to_revision_id, iter_merge_sorted_revisions, etc.) I
> haven't quite worked out if they are enough. But the *big* issue is that
> you have 0 caching between requests.

In the historycache plugin, my solution was to serialise the cache to 
disk. The next bzr command that required it would load it - it didn't 
need to merge-sort the graph and assign revnos again.

> loggerhead is in a bit of a pickle, trying to stay stateless and yet
> handle cache state... I don't have a great answer here.

It sounds like the sort of thing memcached was designed to solve. Is it 
worth considering?

>> Why do I say that? I suspect 90% of projects on any hosting site are
>> small to medium in size. They *may* benefit from history-db but
>> Loggerhead ought to perform fine for those branches without it. By
>> sticking with bzrlib APIs, we can selectively enable history-db only on
>> large projects, at least until a descendant of it makes it into the core.
>
> What are you defining as 'medium'? Bzr itself is now 30k revisions and
> 1k files.

I'd call bzr medium. Any project with 5k items in the tree or 50K 
revisions in history is large IMO. Having said that, I suspect large 
histories will be very commonplace though in a few years, given the rate 
of adoption of DVCS technology. Perhaps bzr will still feel medium when 
it reaches 50k revisions in a few years?

> Also note that in loggerhead trunk, viewing the 'trunk' branch of emacs
> (when cached) takes say 700ms, but consumes 120MB of RAM. My history-db
> branch can do the work in say 450ms, and consume only 30MB of RAM.
> (Note, viewing 2 emacs branches only goes up to ~140MB, as a lot of the
> StaticTuples get to be shared between them [I think].)

Thanks for measuring this. It certainly reinforces in my mind the 
benefits of only loading the data we need when we need it, vs the full 
graph every time. I wonder what the memory usage would be for OOo or the 
kernel?

One nice thing about the revision-id-to-revno map is that the data 
*ought* to be highly stable. It should only change when new revisions 
are added and, provided the mainline is only appended to, the old data 
should remain a correct subset.

I wonder if we can use those facts to our advantage when setting up 
codebrowse caching for Launchpad? For example, I suspect 90% of feature 
branches are simply a few revisions over and above a revision of the 
mainline of a series branch, i.e. there are no additional dotted-revnos 
for those branches after their creation point. And if there were, users 
would rarely visit pages displaying them? And if they did, we could 
still calculate those each time, rather than cache data that may never 
be needed?

In other words, I'm wondering whether the right caching deployment might 
be something like:

1. If a branch has less than X (1k?) revisions, don't bother with a cache.
2. If the branch is the only one for a project or the branch is
    assigned a series (like trunk or 2.1), then cache revnos for
    revison-ids in that branch.

Not sure. If any case, I think we should:

1. Assume that caches for all branches will be overkill.

2. Look at ways of using the cache of a parent branch for
    stacked branches.

Ian C.