[MERGE][RFC] further add performance improvements

Thu May 25 00:38:28 BST 2006

On Tue, 2006-05-23 at 23:16 -0500, John A Meinel wrote:
> John A Meinel wrote:
> ...This change actually dropped the time by 5 minutes. We lose a little bit
> of compression, but not much. And our read time is within a factor of 3
> of a C XML implementation. (Our write time is about the same).
> 
> The 2-line inventory is very similar to rio, only using '\t' instead of
> '\n' as the delimiter. (And thus not allowing \n or \t in the value).

There are a couple of things here. We have three quite different uses
for serialised inventories:

 * The working tree 'manifest' (to use the term that
git/hg/a-couple-others-use). This needs to be:
   - Fast to load
   - Fast to modify

 * RevisionTree inventories. This needs to be:
   - Fast to load
   - Readonly

 * 'affected fileid search queries' This is the engine for determining
what fileid:last-modified-revision pairs were modified by a group of
revisions when doing 'pull' and 'push'. Currently we satisfy this by
using the line orientated deltas from knits/weaves to get the entry
lines in the xml for the range of revisions, and parse just the one line
giving us the last-modified revision id and the file id. What we need
here is:
   - Fast!! mapping of revisionid(s) to 
     fileids-assigned-new-last-modifieds.

 * There may be more for example:
   - What revisions reparented/renamed/altered fileid X.

I think when we started it was a good simplifying assumption that one
format worked for all these use cases. Now I suspect we should revisit
that. I.e. an on disk binary format for the working tree inventory might
be a good idea: it has no long term compatability issues (the tree is
not placed in the archive). For instance, cPickle might be all we need
to be fast here.

For the repository, I dont know if you've seen my talk notes I did about
storage, but they are exploring what characteristics we want from the
repository at a gross level. The inventory storage within the repository
needs a similar analysis performed : we have explored the space enough
to do a good job at this point, documenting the use cases we want to
make fast, and the data sizes we encounter. For example, Martin has been
thinking about a one-blob-per-revision archive format. I've been
thinking about a one-blob-per-transaction archive format (the difference
being I want to be able to transfer 50, or 100 or more revisions as a
single blob, to allow consolidation of old history into a less-inode
intensive form.) If we have a single-blob per revision concept, having
xml inventories that are delta compressed into these blobs is 'ok', but
perhaps we can do better. One possibility: Push the inventory data for
containers into the 'content' of each inventory directory. So when I
read revision id X, if I had done a commit to file id F then the
revision blob for X has an index saying 'F:X'. If I added a file F2 to
directory D, then then index has 'D:X', and the versioned data of D
contains some serialised form of 'F2:X, file'. (Note that this strawman
does not propogate changes up the tree, and is not tuned for all cases
yet)

My intent is not to push a specific replacement, but to prompt us to
think bigger : RIO will at most give us incremental benefits. It should
be a component of our next format transition, but lets take the
breathing space knits have given us to do some -real- heavy lifting on
the core format to meet all the challenges described above. The hg folk
have some experience on performance in this arena too - Brian was saying
just the other day that in some places disk seek overhead is enough to
be a performance bottleneck - and this is another thing to consider in
evaluating the performance of a rethink of this part of the design.

The second thing here, is that we should be clear when we are
experimenting to learn the space, and when we are testing for code that
we want included in the next format bump - the sorts of things I'd look
for as a reviewer are quite different between these intents :)

Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 191 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060525/96ca038c/attachment.pgp