[MERGE] Add simple Storage object

Mon Feb 11 06:42:00 GMT 2008

On Mon, 2008-02-11 at 11:19 +1100, Robert Collins wrote:

I think this might be something worth adding to doc/developer/

> Going back to basics we have:
> N disk format probes
> Each probe can own a control directory, and once a matching one is
> found
> it becomes a factory object to give us a repository.
> This is Format->Repository
... 
through to
...
> We have a lot of work TODO though on inventory serialisation, revision
> serialisation, annotation cache storage.

Aaron and I have talked on IRC and this simple api is a bit too simple.
While there are things we're not in agreement on, we need reasonable
confidence that the following facilities will fit well:

last-changed inventory hack, journalled inventories, stream_fetch,
get_matching_blocks, annotationss

Case by case, the inventory hack needs the set of lines introduced by a
set of texts; we can add:
iter_lines_present_or_added_by(keys)

journalled inventories are layered upon byte-sequence storage. They
will, like current text reconstruction want a compression-tree to be
scanned to plan a bulk readv to obtain all the bytes-sequences to parse.
We can conceptually do this with two indices: A compression-tree index,
and a location-of-bytes index. The reason we have a single index in pack
repositories for each type of thing is to reduce IO - the data we're
storing against each key is smaller than the key size, so two separate
indices - twice the index IO. My prototype implementation actually just
drops the compression flag from the inventory knit and uses the parent
field itself. So journalled inventories really don't need more than
adding bytes with an attached compression parent field that is cached
for fast access.

steam_fetch - 'get_data_stream' and 'insert_data_stream' should work
fine in the model I proposed - getting a data stream gets a data stream
in the native format for a byte store, for a selected set of keys (the
repository will be responsible for selecting the keys). inserting a data
stream likewise inserts native format streams, but we probably want
stream adapters to convert between formats where that is possible.
So I think we need to add:
get_stored_streams(keys)
insert_stored_streams(stream)
and be ready to write adapters which do transcoding.

get_matching_blocks is part of the core logic of our diff routines; we
apparently use this to make bzr diff, and annotation without an
annotation cache, fast. I think it's reasonable, given that a byte store
_might_ have a matching representation, to have a method (and some
variations for different scenarios)for this, which like the other
methods will want to work on arbitrary sets of keys.
get_matching_blocks((from_key, to_key)+)
get_matching_blocks_and_texts((from_key, to_key)+)

Now annotations are an interesting case, because annotations are
inherently dependent on the file graph; if you and I have different file
graphs (e.g. due to ghosts) then we want different annotations for a
file.

We have many styles of annotation:
 - intrinsic - e.g. weaves
 - stored - e.g. knits, where the annotation is part of the data
 - cached - what we'd like to do for packs
 - derived - where we look at the plain text data and the matching
blocks and generate annotation on the fly.

I can think of two reasonable ways to carve this up. One is to say
"we'll ignore all annotations other than derived". The second is to say
"annotations will come from a separate interface", and supply an
implementation of that interface that is coupled to each concrete byte
store. Current packs will have an implementation that uses the
get_matching_blocks interface to derive annotations, and future pack
repositories will have a cache in place there (which could potentially
use the byte store for storing the annotation data :)).

So here is an updated sketch:
UnifiedByteStore:
 This is responsible for:
storing, indexing and retrieving byte sequences with names that are a
key tuple like ('text', fileid, revisionid), or ('revision', revisionid)
or ('signature', revisionid) or ('inventory', revisionid).
This /is/ RepositoryPackCollection on the Pack repository format, with
perhaps a couple of tweaks.

The public interface will be some variation on:
---
add_bytes(key, value) -> hash
get_bytes(keys) -> iterator of byte_sequences
add_stream(key, object_with_read_close_methods)
get_streams(keys) -> iterator of objects_with_read_close_methods
---
in particular I expect the versionedfile.add_lines keyword parameters
will be desirable for performance, but there is a good chance we can
avoid pushing them down this far. Time will tell.

Packs need write-group management to generate good packs, knits need
file level (or repository wide) locking for data integrity, weaves need
repository wide locking. I think its reasonable to want locking in the
API but it will for most existing implementations mean holding a
reference to a Repository object, rather than being able to be layered
in a single direction.

I also don't think we should require all implementations of these stores
to have locks or write groups - the only bits we need to be able to
substitute arbitrarily is in code that can be given /prepared/ byte
stores for use. That is, most api's on Repository, or on the
revision_store etc, have the opportunity to call private methods on the
byte store they are working with. This makes sense if other things
change from Repository to Repository - and they do. 
We may alternatively choose to have some subclassed and refined
interfaces that add some or all of:
lock_read
lock_write
is_write_locked
is_read_locked
unlock
start_write_group
abort_write_group
commit_write_group
is_in_write_group

Using this we then split out the current stores into two interfaces -
one that takes objects to and from bytes, and one that purely stores
bytes.

The current bzrlib.store code should end with just byte stores.

Graph queries for Pack repositories can be done via private methods on
the unified store, likewise for Knits etc. I suggest this because graph
relationships between keys is not appropriate for a bytestore, but some
stores will have to have indices to perform there core function - so the
disk structure we have today is good - its a matter of how to expose it.

We'll add a RepositoryAnnotation class with an instance sitting on
repository, say at Repository._annotation_provider. This will provide
annotations, which are currently conflated with texts on our versioned
files. For weave and knit repository formats this will basically just
call into the old VersionedFile interface to get an annotation. For
packs this can layer on top of the byte store's public methods to get
the matching blocks, and a private method or so to get the right graph
to annotate.

-Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080211/a14992a6/attachment.pgp