sketch of a new interface between loggerhead and bzrlib
Michael Hudson
michael.hudson at canonical.com
Fri Jun 22 16:30:11 BST 2007
Thank a lot for this reply!
Aaron Bentley wrote:
> Michael Hudson wrote:
>> This thread is aiming to be a bit more technical than the last one I
>> started :)
>
>> One of the reliability issues we have with loggerhead/codebrowse comes
>> from its use of a bdb database to cache information about revisions.
>
> Surprising. I would have expected that a database cache would
> complement Bazaar well.
Oh, these are eminently practical concerns, not theoretical ones. I
don't know where in the stack the problem is, but the changes caches
routinely get corrupted and have to be thrown away (one or two a day on
average on codebrowse.launchpad.net).
Also the process by which the caches get filled sucks for large trees
(it passes list of 100 revids at a time to get_revision_trees()).
> After all, Bazaar revisions never change, so
> you don't have to even worry about cache expiry if you don't want to.
> But a database should give you even better speed than Bazaar itself
> provides.
Yup.
> A database cache should provide a stopgap for the fact that Bazaar
> doesn't mark a directory modified when its children (or grandchildren)
> are modified. You can calculate this once, which is somewhat slow, and
> then never need to calculate that again.
Indeed. But I don't think we need to depend on the fact that there's a
db behind things.
> Plus you can run queries like "What are the last 10 commits made across
> all branches?" and "Who has made the most commits".
The problem here is that I don't know to what extent we can assume the
cache is full. If someone merges from head on their launchpad branch
and pushes that, there could easily be a few hundred new revisions which
will take a few minutes to analyze. I guess you could just display a
warning when this is happening:
"Warning: this search does not include 67 unprocessed new revisions"
>> Do the wonderful and wise in here think this interface is reasonable?
>> Implementable? I'm reasonably sure it's featureful enough to provide
>> all the functionality of today's loggerhead, but maybe I'm being blind
>> here, so I'd appreciate comments on that too.
>
> Well, it reminds me a little too much of my work on trac+bzr, where I
> had to implement shadow "Branches", "Trees" and "Repositories" that
> behaved more like Trac expected them to. Since you have the freedom to
> adapt Loggerhead to work smoothly with Bazaar's model, I'd think heavy
> adaptation should not be necessary.
Indeed.
> FileChangeRef: seems either too short or to long. Why only path, fileid
> and oldpath?
>
> If you look at the output of Tree._iter_changes, it tells you
> 1. file-id
> 2. path
> 3. old path
> 4. past and current name (indicates whether a file was renamed)
> 5. past and current parent directory (indicates whether the file was
> moved into another directory)
> 6. past and present versioning
> 7. past and present file type
> 8. whether the contents were changed
> 9. whether the execute bit was changed
>
> Pretty much everything here seems like something you might want to
> provide. It does not provide it in a convenient form, but we have been
> planning to change that. Note that comparing the old path to the new
> path will not tell you whether a file has been renamed-- its parent
> directory may have been renamed instead.
That's a good point.
> I explicitly do not like organizing things by change type. I like to
> see them organized by file.
Well, the reason I put it like that is that is how it's likely to be
displayed:
added: ...
removed: ...
renamed: ...
modified: ...
But it's no big deal really.
> So instead of your FileChanges object, I'd rather see a list of
> FileDelta objects. FileDelta objects would have members file_id,
> old_path, new_path, old_name, new_name, old_parent, new_parent,
> old_type, new_type, content_change, execute_change. Actually, we should
> add something like that to bzrlib.
I like this FileDelta object though.
> Also, are you sure it's a good idea to restrict yourself to the lefthand
> parent? I'd want it to be *possible* to compare with the righthand
> parent at least.
I'm not sure. Given that there will be a way of comparing an arbitrary
pair of revisions, this will certainly be possible, but that's different
from it being easy, or cached. I guess a list of FileDeltas per parent
wouldn't be so hard.
> I don't think it's a good idea to make a representation of tree changes
> be part of the RevisionInfo object because that data is relatively
> expensive to generate, and not always needed. Perhaps a getter instead?
It could be a @cachedproperty or similar, but yes, I think making the
separation explicit is probably a good thing.
> Branch nick is just one possible revision-property. There are others,
> notably the bugfix property. I'd suggest exposing all
> revision-properties. If you want more convenient access to the branch
> nick, you can just make a Python property to retrieve it.
Makes sense.
> BranchRevisionInfo looks interesting, but I would look at putting
> where_merged into the RevisionInfo object instead. (And this is an
> example of where a database cache wins)
But where_merged isn't stable, is it? It's also pretty badly named as
defined in the api now, and possibly not what I wanted anyway.
> History.getRevisionInfo: Much of this info is not "cheap". Especially
> dotted-revno and tree comparison data requires a lot of computation.
But dotted-revno _has_ to be cheap, as it's going to be the most
frequently requested thing. I was thinking of storing
branch.get_revision_id_to_revno_map() on the History object somewhere --
even for launchpad (my archetypal large repository, if you hadn't
noticed already) it only takes 700ms, which seems a reasonably cost to
pay once per 'bzr push'.
> In contexts where you want only a single revision, a getRevisionInfo call
> makes sense. For the log view, I would strongly recommend providing an
> iterator or list.
Why, particularly?
> History.getBundle: Please don't generate bundles against the lefthand
> parent. Bundles are for merging, so they should be generated relative
> to a target branch-- the branch they are being merged into. In order to
> be merged successfully, branches must contain all the revisions which
> are not present in the target branch. A bundle that does not contain
> all the necessary revisions is no more useful than a patch.
This is just copying something that loggerhead does already, and I don't
really know why it has an option to generate a bundle. So maybe I'll
just drop it...
> normalizeFileArguments: I think this would be clearer as two separate
> functions.
Perhaps.
>> There a couple of open issues:
>
>> 1) This interface is designed to hide all of bzrlib, whereas it could
>> just provide getReivisionInfo. I think it's just about worth the
>> extra effort adding things like getFileText to have the entire
>> interface in one file.
>
> Well, to me, bzrlib is quite nice and I don't want to hide it. But
> BzrInspect isn't Loggerhead and doesn't compete with it.
I don't know what BzrInspect is :)
>> 2) There are some things you'd probably do differently if you were
>> expecting this data to be cached in a relational DB. I think it's
>> probably sensible design the most convenient interface for
>> loggerhead, and worry about storage after that.
>
> I think you should push things in both directions-- don't require
> expensive operations to be performed when they are unnecessary.
Sure, but Python is quite good at being lazy. Just because you return
something with a .changes attribute doesn't mean you have to compute it :)
> But do require that everything you want is conveniently available.
Right, that was the goal I was aiming for.
Cheers,
mwh
More information about the bazaar
mailing list