[MERGE] mp-pack repo format
Aaron Bentley
abentley at panoramicfeedback.com
Fri Dec 28 14:32:11 GMT 2007
Robert Collins wrote:
> I think having a 'access texts' api that takes key tuples may be a good
> thing to do. I'd like to note that pack repositories can already do what
> you talk about internally.
I could be wrong, because there are layers upon layers of code, but it
seemed like the knit implementation was constructing the texts, using
packs only for data access.
> I think that the storage layer for byte sequences should be unified some
> more (though it is pretty close to unified already - just needs some
> more mapping work)... I wouldn't call the top level API _iter_texts
> though, as many things will not be 'texts' in the sense we use it
> elsewhere.
We have described the serialized forms of revisions and inventories as
"texts" elsewhere. That is the sense in which I meant it. I'd be glad
of a better name.
> And I definately do not think that having a single index is a
> good idea, nor requiring two indices for object access.
That's fine, I'm not proposing that. The "prefix" names the index to
use for the "text".
>> To achieve this, _iter_texts uses the prefixes 'file', 'revision',
>> 'inventory', and 'signature' as namespaces to describe the kind of text
>> requested. This should also extend nicely to new types, such as
>> 'annotation'.
>
> I take it these are tuples - ('file', FILE_ID, REVISION) ?
Right. It's just a prefix on the existing index names.
>> The format is still experimental, but I am posting it now to make sure
>> that there's general agreement that this is the right way to proceed.
>
> I would rather see a series of small patches that do cleanups in the API
> of our current repository,
I really can't find my way around the implementation of our current
repository, and I don't want to rewrite the knit delta application code
again. See below for why I don't want to invest any effort in knit deltas.
But at the moment, I don't want *any* kind of deltas. I want to use
fulltexts.
> and no new repository type in the short term.
> I've found in the past that conflating API changes with disk changes
> leads to big unwieldy patches that make it hard to reason about the
> design.
I'm trying to optimize text retrieval. I want to get
Repository.iter_files_bytes screamingly fast before we
even look at delta application
I want to cut delta application out of the picture, so I'm using
fulltexts. I want to optimize the process of building trees so that
with a fulltext pack, it's just as fast as building a tree from disk.
Then, when we add delta compression, we can make sure that its overhead
this adds is acceptable.
> I plan to add an experimental type for us all to hammer on
I don't understand. You don't want a new repository type in the short
term, but you're planning to add one?
> I'm
> really quite unsure that mpdiffs are the right way to go for the bottom
> storage layer.
Bearing in mind that delta compression is really not my interest right
now...
What I know is that
- we have mpdiffs implemented already
- unlike knits, mpdiffs don't need extra metadata to indicate whether a
file contains a final EOL
- it is easy to extract the number of lines from an mpdiff
- it is easy to extract annotation information from single-parent
mpdiffs
- it is easy to extract SequenceMatcher.get_matching_blocks output from
single-parent mpdiffs
- our mpdiff text builder does not require the construction of
intermediate texts (but can use them)
- our mpdiff text builder could easily be adapted to use packs
- we have efficient APIs for converting knit deltas into mpdiffs
None of this says we can't use something else. But right now, we don't
*have* anything else, and I have no doubts that mpdiffs are an
improvement over knit deltas.
>> One option that seems interesting to me is to push these operations,
>> including the graph operations, onto Store rather than Repository,
>> because the original design of stores had them providing this data to
>> Repositories.
>
> I don't like this idea, the store concept has splits by type which
> doesn't actually map to the physical constraints of repositories.
I think stores just didn't evolve properly. I had the notion of using
('revision', revision_id) tuples as keys into stores quite early on. I
think we can still make stores the low-level byte-sequence access for
all types if that's what we want. But if repositories are supposed to
be the high-level API and stores aren't suitable as a low-level API, we
should consider eliminating them.
I think that having a lower-level object providing graph and byte
sequence services would simplify repository implementations, whether
it's called a Store or not. But that can certainly wait on refactoring.
Aaron
More information about the bazaar
mailing list