RFC: versionedfile overhaul

Mon Mar 17 01:19:46 GMT 2008

Aaron and I got time at the sprint to talk about knits, and I thought
I'd summarise that conversation, and some additional things we didn't
discuss, as a proposal for overhauling 'versionedfile'.

Background
----------

VersionedFile is the abstract interface bzrlib uses to work with
historical data. It was a generalisation of the 'weave' interface, which
we transitioned to from a pure 'stores have texts' interface way back.

VersionedFile talks about a single file at a time, and has two
implementations:
 weaves
 knits

We use versionedfiles to store file texts, serialised inventories and
revisions, and digital signatures. The VersionedFile interface is
public, because VersionedFiles are returned from various public
repository methods, and all Repository implementations need to have them
to work with e.g. bzrk.

Knits
-----

Knits started out as an append-only weave, but have been tweak and now
provide quite abstract delta composition with arbitrary disk and index
backends. Further refinements are anticipated to provide the planned
refinements to delta storage over the next few months.

Current issues
--------------

There are a number of issues with VersionedFile today, and knits in
particular. Aaron has been working on tuning 'build tree', which spends
an inordinate amount of time in a combination of IO and python logic,
the current knit interface is driving a large part of this. 

Annotation/non-annotation is not all that cleanly separated. In
particular there is a mix of abstract methods and if-else blocks in use,
and we have a recurring theme of discovering performance bugs related to
annotation.

Knit delta composition creates many small objects; during pack
optimisation I made huge wins by reducing object thrashing (basic types
are up to 30 times faster than objects in python), and the essentially
static disk data is a candidate for further performance wins here.

Deltas and IO are constrained to single logical file operations. This
causes latency multiplication - on a lightweight checkout for example,
build-tree will do one set of readvs per file being created on local
disk, rather than one per pack or so. And files with the same content
but different ids cannot be used for deltas (whether this is desirable
should be policy, not technology).

Things to do
------------

Aaron and I agreed on a cautious first step, which is to change the keys
used in the VersionedFile interface from strings to tuples of strings;
this will be used to create a single 'Knit' for the file texts in a
repository, rather than one knit per file-id (with a key for a single
text such as (fileid, versionid). Aaron would like to move to a single
key-space such as ('text', fileid, versionid); however I think this is
significantly harder to do due to the current index layer (and I'm not
convinced a single keyspace is really a good idea, but thats a different
discussion - we are both agreed that a single keyspace for all file
texts _is_ good).

Stacked branches requires this single keyspace change to perform at all
reasonably on smart servers. So I'm very happy to do the work right now.

I have some further thoughts that I've been mulling over. Here's a
proposal...

Overall overhaul shape
----------------------

 * Audit and shrink the versionedfile interface, deprecating for removal
in 1.5 things we don't like.
 * Change the keyspace from strings to N-tuples of strings, where a
given instance has its own N. (1 for revisions, signatures, inventories,
2 for file texts).
 * Provide a thunk layer for compatibility with weaves and during
transition of knit code.
 * Remove non ghost aware methods where possible, pushing special casing
of ghosts up to the calling layers.

Private Changes
---------------

These are changes which are do not alter the public api.

Less objects
^^^^^^^^^^^^

Inside knits, I'd like to get rid of the high object churn rate. I
propose to make the 'Content' objects be statically allocated at the
loading of bzr, and instead call into them via a tuple based api. For
instance::

     def copy(self):
        return PlainKnitContent(self._lines[:], self._version_id)

would become::

     @staticmethod
     def copy(content):
        return (content[0], content[1], content[2][:])

For a plain text content object.

This is a lowish priority thing to do. I think its very worth testing
the performance impact though: we saved nearly 10% on commit when we did
the same basic optimisation at the KnitVersionedFile level for pack
based commits. This is only worth doing and testing once other issues
are out of the way.

Thunk layer
^^^^^^^^^^^

The entire current VersionedFile api will become a thunk layer, because
the change from string -> tuple of strings is an incompatible change if
done on the same function names, *and*, its impossible for an old object
to access the entire new keyspace in some cases. So the transition will
need an implementation of the old string based keyspace which converts
to the new N-tuple based keys and calls into the new code.

Public changes
--------------

There is really only one public change: VersionedFile goes away and is
replaced by VersionedFiles, the new N-tuple keyspace based interface.
VersionedFile is, in its entirety, deprecated. All bzrlib code is
migrated en situ.

I envisage 4 branches:
 * new VersionedFiles, KnitVersionedFiles and thunk layer.
 * expose those from repository
 * convert all callers to use the new apis
 * deprecate old api

Thoughts?
-Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080317/2ccdfc36/attachment.pgp