[MERGE] Add simple Storage object

Wed Jan 30 05:22:04 GMT 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> On Mon, 2008-01-28 at 23:37 -0500, Aaron Bentley wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Robert Collins wrote:
>>> I don't see a functional difference between just making everything in
>>> the base Repository be defined in the primitives you mention, and what
>>> you present.
>> I agree there's no functional difference between those two options.  I
>> want to structure it this way because I think it has benefits for people
>> developing the code
>>
>> 1. It is very clear what a Storage object must implement.  It would not
>> be nearly so clear what a Repository derivative must implement.
> 
> Well, using NotImplementedError & our repository interface tests, I have
> to disagree with you about the difference in clarity here.

I still feel that a class with 17 methods, 15 of them requiring
implementation is clearer than a class with 88 methods, 15 of them
requiring implementation

(Repos have 81 public methods, and Storage objects have 7 methods that
aren't present in repos.)

> There is overlap because one method is old, with warnings about
> scalability, and one is newer, and we're still evolving to get the right
> balance (networks are hard, hmmm).

I like to think that we're getting ever closer to optimal APIs.  I can't
imagine a more efficient API for build_tree than iter_files_bytes.

>> 3. The implementation API is not a very friendly one, so it does not
>> make sense to expose it on Repository.
> 
> The implementation API could be a set of _ methods.

Yes, it's certainly doable.  It does seem to blur the notions of public
and private APIs a bit.  Repository implementors (e.g. bzr-svn) would be
defining a bunch of methods that they shouldn't call.

>> 5. It creates a boundary between core and non-core functionality.  This
>> should encourage people to add methods to Repository and not Storage.
>> This will reduce the growth of the Storage interface.
> 
> This is an interesting point; I wonder about performance implications.

This is very much an issue of judgment.  You talked about evolving
understanding-- well, I hope that our understanding of what we need from
repos is quite refined at this point.

I *think* that this API can provide efficient performance on most
repository types.  The exception I can think of is weaves, because they
don't have a delta format we can transmit over iter_raw_items.  For
formats like that, I expect we'd implement iter_byte_streams on top of
weaves.  (The really hacky way to implement iter_raw_items on weaves
would be to provide each "raw item" as a fulltext).

I would really appreciate critical analysis in this area.  If there's
anything about this interface that's too limiting, it would be really
helpful to know in advance.

>> 6. Pack repositories have 81 public methods by my count (see below).
>> This is large and intimidating, and it does not include
>> Repository.weave_store or VersionedFile, both of which are arguably part
>> of the Repository interface.  Adding even more methods would not be nice.
> 
> KnitPackRepository implements 22 methods however; which is pretty small
> and minimal.

That's encouraging, because it suggests that our minimal subset of
functionality is already becoming clear.  That doesn't really
addresses the point I was making, which was that adding more public
methods to repositories would make them more intimidating, but your
response to 3 does.

>> 7. Smaller interfaces are easier to test, and therefore easier to test
>> thoroughly.
> 
> This point I completely agree with.
> 
> I have a serious concern about adding Storage in the way you are doing
> so, which is about getting the benefits of your point 7. My concern is
> that having
> Repository
>  |
> StorageRepository->Storage
> 
> we will still be allowing people to vary Repository in anyway they want
> - its the current public subclassable interface, and the factory logic
> for disk formats creates RepositoryFormat and Repository instances.

There is some tension there, yes.  I'm not proposing that we eliminate
subclassing entirely, but I hope that the amount of variation needed
would be small, even for RemoteRepository.

> Secondly we have many different implementations of much of Repository,
> *because* different storage idioms required different tradeoffs, and we
> should consolidate that before we replace the varied implementations
> with a common one (because the consolidation will give us the interface
> we need, if not yet on a separate class).

So you're suggesting a plan where we
1. Implement all Storage methods as methods on Repository
2. Split those methods out onto a new Storage object?

This seems at odds with the implementation plans you lay out below.

> But if Storage is meant to represent the core disk code, then really, we
> should be having the factory Logic generate Storage and StorageFormat
> instances, with Repository becoming a static class in bzrlib which is
> parameterised by the Storage instance.

I think that approach is probably too rigid.  There needs to be room for
*some* variation across types.

> And crucially, to move to this design safely I have two things I think
> are very important:
>  - we should not be trying to build out a separate backend without
> pivoting the disk format factory logic in a single (small) step. (this
> prevents a bunch of potential skew happening)

My idea was to extract functionality from KnitRepository and
PackRepository into Storage subclasses.  When all of this was done,
there should be little difference between them (the differences all
being in StoreStorage and PackStorage

>  - we should not replace any currently overridable method with a non
> overridable one until...

I'm not planning to do that at all.

My current idea is to do this for knit and pack formats only, not weave
formats or older.  And even formats that do use Storage could still have
optimizations if that made sense.

> I don't know that you are right about having a very small storage API.
> My strong suspicion is that for optimisation we will continue to want a
> rich interface that people override behaviours on specific to their disk
> layout; and if they are doing that then Storage is dead weight rather
> than a benefit.

I think that optimization will happen, but I think we can be tasteful.
I can imagine that RemoteRepository would continue to implement
get_data_stream_for_search specially, because we want the maximum
possible performance for fetch.  But for the inverse fetch, I don't
think we can do much better than insert_raw_items.  I think
get_revisions (log) and iter_files_bytes (checkout) would both be quite
fast on top of Storage.iter_byte_streams.

So I can well imagine that RemoteRepo's optimizations would be mainly
get_data_stream_for_search and some graph optimizations.

> (Consider RemoteRepository here - unless Storage
> is /really clever/, or asynchronous, RemoteRepository will still have to
> be the interface, and we'll need all the same interface tests we have
> today - a failure on your point 7 IMO).

Hmm.  It seems a shame to test the whole RemoteRepository interface if
only get_data_stream_for_search were changed.  But I concede-- this
probably won't reduce the amount of testing directly.  If we manage to
consolidate KnitRepo and PackRepo, we just reduce the number of
combinations.

> But I'm willing to give it the benefit of the doubt. To migrate safely
> in light of the two specific things I consider really important, I
> suggest a series of patches:
> An initial one that either:
> (
>  - renames Repository to Storage (in code and tests)
>  - renames RepositoryFormat to StorageFormat (in code and tests)
>  - creates a new Repository user class which decorates Storage, and
> passes all the existing Repository methods through. (but this should not
> be called Repository, because plugins subclass Repository, and this is
> an API break)
> )
> or
> (
>  - creates a new user visible RepositoryHandle or similar class which
> delegates everything to a Repository instance
>  - renames Repository to _Repository (and likewise _RepositoryFormat) in
> code and tests
>  - creates a Repository subclass of Repository which has a deprecation
> warning on its constructor (this provides api compatibility)
> )

I don't think these plans are really in sync with my idea that Storage
be a minimal interface, with optimization on the Repository still possible.

>>> But adding another abstraction to Repository, which is already having
>>> problems - previous cleanups not completed - seems to me to be making
>>> the api issues there worse, not better.
>> I am not sure what you're referring to.  Could you expand on that?
> 
> Each time we've added repository disk types, we've added new partial
> abstractions to the family of types in Repository, but not removed the
> old ones. RevisionStore for instance makes /no/ sense for Pack
> repositories, but the users interface to Repository essentially exposes
> RevisionStore sufficiently often that the cleanest way to implement the
> pack repo was through an adapter to RevisionStore rather than just a
> couple of methods.

I've never seen these as cleanups not completed.

Ideally, they would have never gone on the API in the first place.  But
now that we're here, I don't see a way of removing them without
significant API breakage.

Tree.inventory is a similar problem, but at least we have talked about
how to get rid of it.  I don't remember such conversations about Stores.

I think they're quite entrenched, and it's probably better to work on
removing Tree.inventory than Repository.revision_store at this point.
And that Tree.inventory is an area where we're making progress.

The main problem with stores is that they are public.  I certainly want
to avoid the mistake of making Storage a public Repository API.

> We've basically got a number of 'Storage-like' objects already; and they
> are insufficient to handle all the variation between Repository disk
> format capabilities. This is why I'm sceptical about adding another
> Storage interface being particularly better.

In our MPRepo discussion, you didn't think it was tasteful to renew
Stores and give them the kind of responsibilities I'm now suggesting for
Storage.  That's why I'm proposing Storage rather than trying to
rehabilitate Stores.

>> Even if there are some partially-completed cleanups, I still think that
>> it is worth it to try and clean up the API.  You know, if at first you
>> don't suceed...
> 
> Agreed; but perhaps we should try something different? I'll sleep on
> this and see if any good ideas about different ways to look at this area
> of bzrlib crop up.

Cool.  I'll be interested to hear what you come up with.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHoAl80F+nu1YWqI0RAqfrAJ9xfH1579wiHCw/5XMwMNOnPccJBgCfaSd/
KhYvjFavkCYuVPpwcdCvdco=
=3APl
-----END PGP SIGNATURE-----