[MERGE] mp-pack repo format

Fri Dec 28 14:32:11 GMT 2007

Robert Collins wrote:
> I think having a 'access texts' api that takes key tuples may be a good
> thing to do. I'd like to note that pack repositories can already do what
> you talk about internally.

I could be wrong, because there are layers upon layers of code, but it
seemed like the knit implementation was constructing the texts, using
packs only for data access.

> I think that the storage layer for byte sequences should be unified some
> more (though it is pretty close to unified already - just needs some
> more mapping work)... I wouldn't call the top level API _iter_texts
> though, as many things will not be 'texts' in the sense we use it
> elsewhere.

We have described the serialized forms of revisions and inventories as
"texts" elsewhere.  That is the sense in which I meant it.  I'd be glad
of a better name.

> And I definately do not think that having a single index is a
> good idea, nor requiring two indices for object access.

That's fine, I'm not proposing that.  The "prefix" names the index to
use for the "text".

>> To achieve this, _iter_texts uses the prefixes 'file', 'revision',
>> 'inventory', and 'signature' as namespaces to describe the kind of text
>> requested.  This should also extend nicely to new types, such as
>> 'annotation'.
> 
> I take it these are tuples - ('file', FILE_ID, REVISION) ?

Right.  It's just a prefix on the existing index names.

>> The format is still experimental, but I am posting it now to make sure
>> that there's general agreement that this is the right way to proceed.
> 
> I would rather see a series of small patches that do cleanups in the API
> of our current repository,

I really can't find my way around the implementation of our current
repository, and I don't want to rewrite the knit delta application code
again.  See below for why I don't want to invest any effort in knit deltas.

But at the moment, I don't want *any* kind of deltas.  I want to use
fulltexts.

> and no new repository type in the short term.
> I've found in the past that conflating API changes with disk changes
> leads to big unwieldy patches that make it hard to reason about the
> design.

I'm trying to optimize text retrieval.  I want to get
Repository.iter_files_bytes screamingly fast before we
even look at delta application

I want to cut delta application out of the picture, so I'm using
fulltexts.  I want to optimize the process of building trees so that
with a fulltext pack, it's just as fast as building a tree from disk.

Then, when we add delta compression, we can make sure that its overhead
this adds is acceptable.

> I plan to add an experimental type for us all to hammer on

I don't understand.  You don't want a new repository type in the short
term, but you're planning to add one?

> I'm
> really quite unsure that mpdiffs are the right way to go for the bottom
> storage layer.

Bearing in mind that delta compression is really not my interest right
now...

What I know is that
- we have mpdiffs implemented already
- unlike knits, mpdiffs don't need extra metadata to indicate whether a
  file contains a final EOL
- it is easy to extract the number of lines from an mpdiff
- it is easy to extract annotation information from single-parent
  mpdiffs
- it is easy to extract SequenceMatcher.get_matching_blocks output from
  single-parent mpdiffs
- our mpdiff text builder does not require the construction of
  intermediate texts (but can use them)
- our mpdiff text builder could easily be adapted to use packs
- we have efficient APIs for converting knit deltas into mpdiffs

None of this says we can't use something else.  But right now, we don't
*have* anything else, and I have no doubts that mpdiffs are an
improvement over knit deltas.

>> One option that seems interesting to me is to push these operations,
>> including the graph operations, onto Store rather than Repository,
>> because the original design of stores had them providing this data to
>> Repositories.
> 
> I don't like this idea, the store concept has splits by type which
> doesn't actually map to the physical constraints of repositories.

I think stores just didn't evolve properly.  I had the notion of using
('revision', revision_id) tuples as keys into stores quite early on.  I
think we can still make stores the low-level byte-sequence access for
all types if that's what we want.  But if repositories are supposed to
be the high-level API and stores aren't suitable as a low-level API, we
should consider eliminating them.

I think that having a lower-level object providing graph and byte
sequence services would simplify repository implementations, whether
it's called a Store or not.  But that can certainly wait on refactoring.

Aaron