[RFC] killing versioned file.join

Wed Apr 9 03:44:33 BST 2008

So this is a little rambly... I've chatted with poolie and spiv about
this, and this is my attempt to bring all the threads together.

I want to remove VersionedFile.join. It was useful when it was
introduced because of the way fetch was structured. but we really want a
streaming format, which we have now via Repository.get_data_stream.

So, to remove VersionedFile.join, the fetch module needs to start using
Repository.get_data_stream and insert_data_stream.

This is complicated by the need to upcast data when converting between
repositories formats, and ordering in some cases.

On a local fetch operation we can examine both sides and run special
code, but when fetching from a smart server its more complex, because
with either a newer or older client, we cannot guarantee that any
description of a repository will be understandable.

What ways do repositories vary today:
 * serialisation of metadata (xml4/5/6/7/8/journalled-inv/...)
 * model of metadata (plain, rich-root, subtrees)
 * atomic-insertion, or require texts,inventories,signatures,revisions.

An in future:
 * delta logic

To eliminate join locally I need to handle only what we do today; but we
should have something relatively compatible with the future planned
changes.

Now, we can ignore differing serialisations for now - they fall back to
the full tree api.

Model changes go through the plain fetch code path, and so does
knit->pack fetching, as well as knit->knit.

For best performance and memory use..
For same model fetches, we want the following:
knit->knit:
 * all text knit hunks, per versionedfile, topological order
 * ditto inventory
 * ditto signatures
 * ditto revisions
knit->pack:
 * as per knit->knit works fine
pack->knit:
 * as per knit->knit
pack->pack (occurs when fetching from a RemoteRepository only):
 * hunks from each pack, in forward-read IO order

For different model fetches, we want the following:
knit->knit:
 * all text knit hunks, per versionedfile, topological order
 * the inventories needed to iterate the revision trees of 
   the revisions being fetched: this means we need the basis
   inventory text, and then the knit hunks.
 * the signature knit hunks in topological order
 * the revision knit hunks in topological order
knit->pack:
 * same as knit-> knit
pack->knit:
 * same as knit-> knit
pack->pack:
 * same inventory details as knit->knit
 * all text hunks, in optimal IO order
 * signature and revision hunks

So it seems to me that today, we simply need to control two things in
getting a data stream to eliminate join() from fetch:
 - whether we supply data in read-optimal order or non-atomic-insert
   order
 - whether we supply enough data to reconstruct all inventories, or
   not

So - 
repository.get_data_stream_for_search(search, data_order,
complete_inventory)
data_order in ("read-optimal", "nonatomic-insert")
complete_inventory in (False, True)

is my proposed replacement API.

Thoughts? I plan to hack on this now, so an uncrafted reply now is
better than a crafted one tomorrow.

-Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080409/a3899af1/attachment.pgp