Reviewing repository/inventory requirements was Re: [MERGE][RFC] further add performance improvements

Fri May 26 04:13:38 BST 2006

Robert Collins wrote:
> On Tue, 2006-05-23 at 23:16 -0500, John A Meinel wrote:

...

Thanks for mentioning that we take a quick step back and review. I think
it is good to do.

> 
> There are a couple of things here. We have three quite different uses
> for serialised inventories:
> 
>  * The working tree 'manifest' (to use the term that
> git/hg/a-couple-others-use). This needs to be:
>    - Fast to load
>    - Fast to modify

Agreed.

> 
>  * RevisionTree inventories. This needs to be:
>    - Fast to load
>    - Readonly

I would also mention 'Archival'. So being compact is helpful.

> 
>  * 'affected fileid search queries' This is the engine for determining
> what fileid:last-modified-revision pairs were modified by a group of
> revisions when doing 'pull' and 'push'. Currently we satisfy this by
> using the line orientated deltas from knits/weaves to get the entry
> lines in the xml for the range of revisions, and parse just the one line
> giving us the last-modified revision id and the file id. What we need
> here is:
>    - Fast!! mapping of revisionid(s) to 
>      fileids-assigned-new-last-modifieds.
>  
>  * There may be more for example:
>    - What revisions reparented/renamed/altered fileid X.
> 
> I think when we started it was a good simplifying assumption that one
> format worked for all these use cases. Now I suspect we should revisit
> that. I.e. an on disk binary format for the working tree inventory might
> be a good idea: it has no long term compatability issues (the tree is
> not placed in the archive). For instance, cPickle might be all we need
> to be fast here.

Ack.

> 
> For the repository, I dont know if you've seen my talk notes I did about
> storage, but they are exploring what characteristics we want from the
> repository at a gross level. 

I downloaded them on my Mac, but didn't have OOo setup there. (the 2.x
NeoOffice was supposed to be released for public beta today).

> The inventory storage within the repository
> needs a similar analysis performed : we have explored the space enough
> to do a good job at this point, documenting the use cases we want to
> make fast, and the data sizes we encounter. For example, Martin has been
> thinking about a one-blob-per-revision archive format. I've been
> thinking about a one-blob-per-transaction archive format (the difference
> being I want to be able to transfer 50, or 100 or more revisions as a
> single blob, to allow consolidation of old history into a less-inode
> intensive form.) 

I think this is potentially very nice. I'm curious how we would set up
the index on the outside, so they could be found. But creating big
bundles usually has nice packing effects.
We could also do something like 7zip archives, where they pack hunks
together (like tar), but only to a specific extent, so you can still
seek to something in the middle (like zip).

It generally means that to get to entry 15 you have to jump to 10 and
decompress until you get to 15. But if you make some reasonable
statements, you can make it reasonable to get to your data, and still
have very good compression.

> If we have a single-blob per revision concept, having
> xml inventories that are delta compressed into these blobs is 'ok', but
> perhaps we can do better. One possibility: Push the inventory data for
> containers into the 'content' of each inventory directory. So when I
> read revision id X, if I had done a commit to file id F then the
> revision blob for X has an index saying 'F:X'. If I added a file F2 to
> directory D, then then index has 'D:X', and the versioned data of D
> contains some serialised form of 'F2:X, file'. (Note that this strawman
> does not propogate changes up the tree, and is not tuned for all cases
> yet)
> 
> My intent is not to push a specific replacement, but to prompt us to
> think bigger : RIO will at most give us incremental benefits. It should
> be a component of our next format transition, but lets take the
> breathing space knits have given us to do some -real- heavy lifting on
> the core format to meet all the challenges described above. The hg folk
> have some experience on performance in this arena too - Brian was saying
> just the other day that in some places disk seek overhead is enough to
> be a performance bottleneck - and this is another thing to consider in
> evaluating the performance of a rethink of this part of the design.
> 

If we are interested, I did implement a C++ version of RIO (using
Boost.Python). It was about 2x faster than the python version, but we
still had problems that knits hate lots of lines.

> 
> The second thing here, is that we should be clear when we are
> experimenting to learn the space, and when we are testing for code that
> we want included in the next format bump - the sorts of things I'd look
> for as a reviewer are quite different between these intents :)
> 
> 
> Rob

Sure. You just mentioned cElementTree vs Rio, and I figured I would
explore the space a bit. And I found that you could indeed get to 37%
the write time (3x faster), but I couldn't get the read speed very fast.
And since our primary use case is that we need read to be fast...

And with C++ I could make the read speed pretty fast, but at that point
we were running into Knit and # of lines performance issues.

I'm also curious about writing a manifest format which is fast at what
python is good at. (x.split('\n') is pretty fast, looping over lines
with if/else is not).

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060525/add424ee/attachment.pgp