Reviewing repository/inventory requirements was Re: [MERGE][RFC] further add performance improvements

Thu Jun 1 06:15:58 BST 2006

On Thu, 2006-05-25 at 22:13 -0500, John Arbash Meinel wrote:
...

building a current summary:

Agreed upon:  (John, Robert)

 * The working tree 'manifest' (to use the term that
 git/hg/a-couple-others-use). This needs to be:
    - Fast to load
    - Fast to modify

 * 'affected fileid search queries' This is the engine for determining
 what fileid:last-modified-revision pairs were modified by a group of
 revisions when doing 'pull' and 'push'. Currently we satisfy this by
 using the line orientated deltas from knits/weaves to get the entry
 lines in the xml for the range of revisions, and parse just the one line
 giving us the last-modified revision id and the file id. What we need
 here is:
    - Fast!! mapping of revisionid(s) to 
      the fileids-assigned-new-last-modifieds in those revision ids.

  * There may be more for example:
    - What revisions reparented/renamed/altered fileid X.

In discussion:
>  * RevisionTree inventories. This needs to be:
> >    - Fast to load
> >    - Readonly
> 
> I would also mention 'Archival'. So being compact is helpful.

Possibly a better criteria is 'incrementally small'. It doesn't much
matter if the first inventory is 5 times larger than git/hg if the
incremental cost is 1/5th - granted that this is a ridiculous scaling
proposition. Point is that Archival means 'scales well when storing many
of them', so individual compactness is less important than aggregate
compactness.

> > The inventory storage within the repository
> > needs a similar analysis performed : we have explored the space enough
> > to do a good job at this point, documenting the use cases we want to
> > make fast, and the data sizes we encounter. For example, Martin has been
> > thinking about a one-blob-per-revision archive format. I've been
> > thinking about a one-blob-per-transaction archive format (the difference
> > being I want to be able to transfer 50, or 100 or more revisions as a
> > single blob, to allow consolidation of old history into a less-inode
> > intensive form.) 
> 
> I think this is potentially very nice. I'm curious how we would set up
> the index on the outside, so they could be found. But creating big
> bundles usually has nice packing effects.
> We could also do something like 7zip archives, where they pack hunks
> together (like tar), but only to a specific extent, so you can still
> seek to something in the middle (like zip).
> 
> It generally means that to get to entry 15 you have to jump to 10 and
> decompress until you get to 15. But if you make some reasonable
> statements, you can make it reasonable to get to your data, and still
> have very good compression.

Sure. So the broad need here is to establish the same requirements we
can now state for 'storage' for 'inventories'. That should let us fine
tune the needs and allow us to compare what we come up with with what we
need.

> > If we have a single-blob per revision concept, having
> > xml inventories that are delta compressed into these blobs is 'ok', but
> > perhaps we can do better. One possibility: Push the inventory data for
> > containers into the 'content' of each inventory directory. So when I
> > read revision id X, if I had done a commit to file id F then the
> > revision blob for X has an index saying 'F:X'. If I added a file F2 to
> > directory D, then then index has 'D:X', and the versioned data of D
> > contains some serialised form of 'F2:X, file'. (Note that this strawman
> > does not propogate changes up the tree, and is not tuned for all cases
> > yet)
> > 
> > My intent is not to push a specific replacement, but to prompt us to
> > think bigger : RIO will at most give us incremental benefits. It should
> > be a component of our next format transition, but lets take the
> > breathing space knits have given us to do some -real- heavy lifting on
> > the core format to meet all the challenges described above. The hg folk
> > have some experience on performance in this arena too - Brian was saying
> > just the other day that in some places disk seek overhead is enough to
> > be a performance bottleneck - and this is another thing to consider in
> > evaluating the performance of a rethink of this part of the design.
> > 
> 
> If we are interested, I did implement a C++ version of RIO (using
> Boost.Python). It was about 2x faster than the python version, but we
> still had problems that knits hate lots of lines.

I'm interested but not hugely: My sense is that we've done an
interesting experiment ('how well does snapshot vs delta storage of
inventories perform') deeply enough on the snapshot side to draw some
conclusions. (the fact that we store naive deltas between the xml
inventories doesn't stop it being snapshot storage : the system as a
whole talks about inventories as snapshot objects, not as deltas. There
are a number of interesting possibilities with using actual deltas as
the native storage for inventories IMO : greater commit efficiency (we
know what has changed a priori - no need to do text 'diff's which are
quadratic); more compact storage (can diff with greater granularity than
line by line); safe incremental parallel parsing (can be done with xml
but not with existing parsers) [by this I mean extracting multiple
inventories in one pass from a delta stream]. Andrew Suffield said way
back that using xml as the native format was silly : that we should use
the most efficient format for our needs and export $format on demand.
I've come to agree with him now, *because* I think snapshot based
storage for inventories is not a good model. (trees as snapshots *is a
good model* IMO, but not for the inventory component).

Cheers, Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 191 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060601/6d6ab8fa/attachment.pgp