sketch of a new interface between loggerhead and bzrlib

Fri Jun 22 16:30:11 BST 2007

Thank a lot for this reply!

Aaron Bentley wrote:
> Michael Hudson wrote:
>> This thread is aiming to be a bit more technical than the last one I
>> started :)
> 
>> One of the reliability issues we have with loggerhead/codebrowse comes
>> from its use of a bdb database to cache information about revisions.
> 
> Surprising.  I would have expected that a database cache would
> complement Bazaar well.

Oh, these are eminently practical concerns, not theoretical ones.  I
don't know where in the stack the problem is, but the changes caches
routinely get corrupted and have to be thrown away (one or two a day on
average on codebrowse.launchpad.net).

Also the process by which the caches get filled sucks for large trees
(it passes list of 100 revids at a time to get_revision_trees()).

> After all, Bazaar revisions never change, so
> you don't have to even worry about cache expiry if you don't want to.
> But a database should give you even better speed than Bazaar itself
> provides.

Yup.

> A database cache should provide a stopgap for the fact that Bazaar
> doesn't mark a directory modified when its children (or grandchildren)
> are modified.  You can calculate this once, which is somewhat slow, and
> then never need to calculate that again.

Indeed.  But I don't think we need to depend on the fact that there's a
db behind things.

> Plus you can run queries like "What are the last 10 commits made across
> all branches?" and "Who has made the most commits".

The problem here is that I don't know to what extent we can assume the
cache is full.  If someone merges from head on their launchpad branch
and pushes that, there could easily be a few hundred new revisions which
will take a few minutes to analyze.  I guess you could just display a
warning when this is happening:

   "Warning: this search does not include 67 unprocessed new revisions"

>> Do the wonderful and wise in here think this interface is reasonable?
>> Implementable?  I'm reasonably sure it's featureful enough to provide
>> all the functionality of today's loggerhead, but maybe I'm being blind
>> here, so I'd appreciate comments on that too.
> 
> Well, it reminds me a little too much of my work on trac+bzr, where I
> had to implement shadow "Branches", "Trees" and "Repositories" that
> behaved more like Trac expected them to.  Since you have the freedom to
> adapt Loggerhead to work smoothly with Bazaar's model, I'd think heavy
> adaptation should not be necessary.

Indeed.

> FileChangeRef: seems either too short or to long.  Why only path, fileid
> and oldpath?
> 
> If you look at the output of Tree._iter_changes, it tells you
> 1. file-id
> 2. path
> 3. old path
> 4. past and current name (indicates whether a file was renamed)
> 5. past and current parent directory (indicates whether the file was
> moved into another directory)
> 6. past and present versioning
> 7. past and present file type
> 8. whether the contents were changed
> 9. whether the execute bit was changed
> 
> Pretty much everything here seems like something you might want to
> provide.  It does not provide it in a convenient form, but we have been
> planning to change that.  Note that comparing the old path to the new
> path will not tell you whether a file has been renamed-- its parent
> directory may have been renamed instead.

That's a good point.

> I explicitly do not like organizing things by change type.  I like to
> see them organized by file.

Well, the reason I put it like that is that is how it's likely to be
displayed:

    added: ...
  removed: ...
  renamed: ...
 modified: ...

But it's no big deal really.

> So instead of your FileChanges object, I'd rather see a list of
> FileDelta objects.  FileDelta objects would have members file_id,
> old_path, new_path, old_name, new_name, old_parent, new_parent,
> old_type, new_type, content_change, execute_change.  Actually, we should
> add something like that to bzrlib.

I like this FileDelta object though.

> Also, are you sure it's a good idea to restrict yourself to the lefthand
> parent?  I'd want it to be *possible* to compare with the righthand
> parent at least.

I'm not sure.  Given that there will be a way of comparing an arbitrary
pair of revisions, this will certainly be possible, but that's different
from it being easy, or cached.  I guess a list of FileDeltas per parent
wouldn't be so hard.

> I don't think it's a good idea to make a representation of tree changes
> be part of the RevisionInfo object because that data is relatively
> expensive to generate, and not always needed.  Perhaps a getter instead?

It could be a @cachedproperty or similar, but yes, I think making the
separation explicit is probably a good thing.

> Branch nick is just one possible revision-property.  There are others,
> notably the bugfix property.  I'd suggest exposing all
> revision-properties.  If you want more convenient access to the branch
> nick, you can just make a Python property to retrieve it.

Makes sense.

> BranchRevisionInfo looks interesting, but I would look at putting
> where_merged into the RevisionInfo object instead.  (And this is an
> example of where a database cache wins)

But where_merged isn't stable, is it?  It's also pretty badly named as
defined in the api now, and possibly not what I wanted anyway.

> History.getRevisionInfo: Much of this info is not "cheap".  Especially
> dotted-revno and tree comparison data requires a lot of computation.

But dotted-revno _has_ to be cheap, as it's going to be the most
frequently requested thing.  I was thinking of storing
branch.get_revision_id_to_revno_map() on the History object somewhere --
even for launchpad (my archetypal large repository, if you hadn't
noticed already) it only takes 700ms, which seems a reasonably cost to
pay once per 'bzr push'.

> In contexts where you want only a single revision, a getRevisionInfo call
> makes sense.  For the log view, I would strongly recommend providing an
> iterator or list.

Why, particularly?

> History.getBundle: Please don't generate bundles against the lefthand
> parent.  Bundles are for merging, so they should be generated relative
> to a target branch-- the branch they are being merged into.  In order to
> be merged successfully, branches must contain all the revisions which
> are not present in the target branch.  A bundle that does not contain
> all the necessary revisions is no more useful than a patch.

This is just copying something that loggerhead does already, and I don't
really know why it has an option to generate a bundle.  So maybe I'll
just drop it...

> normalizeFileArguments: I think this would be clearer as two separate
> functions.

Perhaps.

>> There a couple of open issues:
> 
>> 1) This interface is designed to hide all of bzrlib, whereas it could
>>    just provide getReivisionInfo.  I think it's just about worth the
>>    extra effort adding things like getFileText to have the entire
>>    interface in one file.
> 
> Well, to me, bzrlib is quite nice and I don't want to hide it.  But
> BzrInspect isn't Loggerhead and doesn't compete with it.

I don't know what BzrInspect is :)

>> 2) There are some things you'd probably do differently if you were
>>    expecting this data to be cached in a relational DB.  I think it's
>>    probably sensible design the most convenient interface for
>>    loggerhead, and worry about storage after that.
> 
> I think you should push things in both directions-- don't require
> expensive operations to be performed when they are unnecessary.

Sure, but Python is quite good at being lazy.  Just because you return
something with a .changes attribute doesn't mean you have to compute it :)

> But do require that everything you want is conveniently available.

Right, that was the goal I was aiming for.

Cheers,
mwh