[MERGE] Add simple Storage object

Fri Feb 8 04:00:59 GMT 2008

On Wed, 2008-01-30 at 00:22 -0500, Aaron Bentley wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 

> > Well, using NotImplementedError & our repository interface tests, I have
> > to disagree with you about the difference in clarity here.
> 
> I still feel that a class with 17 methods, 15 of them requiring
> implementation is clearer than a class with 88 methods, 15 of them
> requiring implementation
> 
> (Repos have 81 public methods, and Storage objects have 7 methods that
> aren't present in repos.)

I agree there is a difference in clarity here. But - pack and knits
don't have the same more for storage any more than weaves and knits;
certainly things like svn and hg don't have much in common at all. While
I'm happy for us to work towards a Storage interface, I'd really like us
to sort out the current mess first: it will make the size that Storage
can be clearer, and get rid of accumulated dross. template methods are a
very good pattern for allowing subclasses to change behaviour, and its
easy to generate a list of all the template methods if we have the root
Repository have them consistently named (e.g. like John's do_* pattern).
I'm not saying Storage has no value, but I am saying much of the
problems with Repository are /not/ the lack of Storage, but just
Repository not getting enough clean-me-up-scotty love.

> > There is overlap because one method is old, with warnings about
> > scalability, and one is newer, and we're still evolving to get the right
> > balance (networks are hard, hmmm).
> 
> I like to think that we're getting ever closer to optimal APIs.  I can't
> imagine a more efficient API for build_tree than iter_files_bytes.

Ack, we're getting nice apis together.

> >> 3. The implementation API is not a very friendly one, so it does not
> >> make sense to expose it on Repository.
> > 
> > The implementation API could be a set of _ methods.
> 
> Yes, it's certainly doable.  It does seem to blur the notions of public
> and private APIs a bit.  Repository implementors (e.g. bzr-svn) would be
> defining a bunch of methods that they shouldn't call.

We had a long thread way back about public(external) public(internal)
and private(no peeking) - Martin and others felt that the complexity of
representing this was too high in python, we now only have:
public(to all)
private(and anything inside bzrlib, and in subclasses from elsewhere can
choose to ignore this).

We have also noted in some private methods that its
private-but-for-overriding.

> >> 5. It creates a boundary between core and non-core functionality.  This
> >> should encourage people to add methods to Repository and not Storage.
> >> This will reduce the growth of the Storage interface.
> > 
> > This is an interesting point; I wonder about performance implications.
> 
> This is very much an issue of judgment.  You talked about evolving
> understanding-- well, I hope that our understanding of what we need from
> repos is quite refined at this point.
> 
> I *think* that this API can provide efficient performance on most
> repository types.  The exception I can think of is weaves, because they
> don't have a delta format we can transmit over iter_raw_items.  For
> formats like that, I expect we'd implement iter_byte_streams on top of
> weaves.  (The really hacky way to implement iter_raw_items on weaves
> would be to provide each "raw item" as a fulltext).
> 
> I would really appreciate critical analysis in this area.  If there's
> anything about this interface that's too limiting, it would be really
> helpful to know in advance.

Latency - the smart server is our biggest brightest hope for really
extreme performance over the network; and primitive methods which result
in many api calls locally become dramatically more expensive over the
smart server (even when they may not be so expensive over sftp because
of object based caching).

> >> 7. Smaller interfaces are easier to test, and therefore easier to
> test
> >> thoroughly.
> > 
> > This point I completely agree with.
> > 
> > I have a serious concern about adding Storage in the way you are
> doing
> > so, which is about getting the benefits of your point 7. My concern
> is
> > that having
> > Repository
> >  |
> > StorageRepository->Storage
> > 
> > we will still be allowing people to vary Repository in anyway they
> want
> > - its the current public subclassable interface, and the factory
> logic
> > for disk formats creates RepositoryFormat and Repository instances.
> 
> There is some tension there, yes.  I'm not proposing that we eliminate
> subclassing entirely, but I hope that the amount of variation needed
> would be small, even for RemoteRepository.

The vast bulk of the code in bzrlib is dealing with a basically coherent model.

Until we have one of:
 svn
 hg
 git
 path-tokens
 journalled-inventories

in the core library, I don't think we'll really have a handle on the
amount of variation required to support with high performance
significant variation in the core; and I think we can and should do
that. I found while tuning commit that we still have a lot of stuff on
Repository that is basically noise for modern repositories.

> > Secondly we have many different implementations of much of Repository,
> > *because* different storage idioms required different tradeoffs, and we
> > should consolidate that before we replace the varied implementations
> > with a common one (because the consolidation will give us the interface
> > we need, if not yet on a separate class).
> 
> So you're suggesting a plan where we
> 1. Implement all Storage methods as methods on Repository
> 2. Split those methods out onto a new Storage object?
> 
> This seems at odds with the implementation plans you lay out below.

I'm saying that I'm comfortable and happy to review patches that clean
up Repository to reduce duplication and improve performance and clarity.
Thats (1). And that if we do that well I'd be immensely happier about
Storage, because it will be able to be applied consistently.

> > But if Storage is meant to represent the core disk code, then really, we
> > should be having the factory Logic generate Storage and StorageFormat
> > instances, with Repository becoming a static class in bzrlib which is
> > parameterised by the Storage instance.
> 
> I think that approach is probably too rigid.  There needs to be room for
> *some* variation across types.

Conceptually we have:
disk format -> in memory object structure.

We need to test all of these mapped structures for some set of
interfaces. I don't think reducing coverage here is a good idea.

Saying that 'Both Pack1 and Knit1 are implemented by subclasses of
Storage with a StorageRepository' does allow us to move some of the
permitations to Storage* tests; but it doesn't actually make it easier
for other repository implementors.

> > And crucially, to move to this design safely I have two things I
> think
> > are very important:
> >  - we should not be trying to build out a separate backend without
> > pivoting the disk format factory logic in a single (small) step.
> (this
> > prevents a bunch of potential skew happening)
> 
...

> My current idea is to do this for knit and pack formats only, not weave
> formats or older.  And even formats that do use Storage could still have
> optimizations if that made sense.

I'd like to see the functionality extracted into helper methods on
Repository, and done across all formats so that the top level Repository
public methods start to be solid and sane - they are currently very
weave, or even just a bunch of files centric.

> > But I'm willing to give it the benefit of the doubt. To migrate safely
> > in light of the two specific things I consider really important, I
> > suggest a series of patches:
> > An initial one that either:
> > (
> >  - renames Repository to Storage (in code and tests)
> >  - renames RepositoryFormat to StorageFormat (in code and tests)
> >  - creates a new Repository user class which decorates Storage, and
> > passes all the existing Repository methods through. (but this should not
> > be called Repository, because plugins subclass Repository, and this is
> > an API break)
> > )
> > or
> > (
> >  - creates a new user visible RepositoryHandle or similar class which
> > delegates everything to a Repository instance
> >  - renames Repository to _Repository (and likewise _RepositoryFormat) in
> > code and tests
> >  - creates a Repository subclass of Repository which has a deprecation
> > warning on its constructor (this provides api compatibility)
> > )
> 
> I don't think these plans are really in sync with my idea that Storage
> be a minimal interface, with optimization on the Repository still possible.

I don't think that particular combination can be done in the immediate
term, but getting *a* minimal interface (of methods-to-override) on
Repository is doable.

> >>> But adding another abstraction to Repository, which is already having
> >>> problems - previous cleanups not completed - seems to me to be making
> >>> the api issues there worse, not better.
> >> I am not sure what you're referring to.  Could you expand on that?
> > 
> > Each time we've added repository disk types, we've added new partial
> > abstractions to the family of types in Repository, but not removed the
> > old ones. RevisionStore for instance makes /no/ sense for Pack
> > repositories, but the users interface to Repository essentially exposes
> > RevisionStore sufficiently often that the cleanest way to implement the
> > pack repo was through an adapter to RevisionStore rather than just a
> > couple of methods.
> 
> I've never seen these as cleanups not completed.
> 
> Ideally, they would have never gone on the API in the first place.  But
> now that we're here, I don't see a way of removing them without
> significant API breakage.

deprecation + change the code that uses them within the library. I
consider a deprecated method as 'gone' for cleanliness purposes.

> Tree.inventory is a similar problem, but at least we have talked about
> how to get rid of it.  I don't remember such conversations about Stores.
> 
> I think they're quite entrenched, and it's probably better to work on
> removing Tree.inventory than Repository.revision_store at this point.
> And that Tree.inventory is an area where we're making progress.
> 
> The main problem with stores is that they are public.  I certainly want
> to avoid the mistake of making Storage a public Repository API.

Ack.

> > We've basically got a number of 'Storage-like' objects already; and they
> > are insufficient to handle all the variation between Repository disk
> > format capabilities. This is why I'm sceptical about adding another
> > Storage interface being particularly better.
> 
> In our MPRepo discussion, you didn't think it was tasteful to renew
> Stores and give them the kind of responsibilities I'm now suggesting for
> Storage.  That's why I'm proposing Storage rather than trying to
> rehabilitate Stores.

For the same basic reason I think. I think Repository needs love, rather
than going sideways to avoid the problem (because I don't think sideways
avoids the problem).

> >> Even if there are some partially-completed cleanups, I still think that
> >> it is worth it to try and clean up the API.  You know, if at first you
> >> don't suceed...
> > 
> > Agreed; but perhaps we should try something different? I'll sleep on
> > this and see if any good ideas about different ways to look at this area
> > of bzrlib crop up.
> 
> Cool.  I'll be interested to hear what you come up with.

No breakthroughs as yet, but something is itching at my head, I just
can't articulate it yet.

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080208/95daa73e/attachment.pgp