RFC: get_record_stream and groupcompress

Robert Collins robertc at robertcollins.net
Wed Aug 13 03:20:19 BST 2008


So I've a thorny little tradeoff to encapsulate in code.

The best approach I've thought of to date is a parameter to
VF.get_record_stream called 'sort_is_hint'. If you think that that
sounds fine and dandy, say so now and skip the detail below.

The get_record_stream sort_order parameter is used by different
repositories differently.

Weave and Knit repositories use it to get topologically sorted data
which they *cannot operate correctly without*.

Pack[Knit] repositories use it to get data in arbitrary order, as they
do not have insertion ordering constraints. {though they do have a mild
preference for topologically ordered to get increasing-offset readvs}.

GroupCompress suffers badly when given texts that are not
reverse-topologically sorted. It suffers because it is more complex to
express three texts [e.g. ABC, AB and A] in increasing-content order,
and reverse-topological is a good approximation for getting a long run
of content to refer to up-front. 

GroupCompress works best then, when given reverse-topologically sorted
texts, but getting that sort order from a repository involves
approximately one IO per pack file per file id, rather than one IO per
pack file - so its actually worse than knits were at latency.

We have some options.

We could insert in any order from remote repositories, this would tend
to convert from packs poorly, but a 'pack' operation would fix things
up. Fetching from a gc repository over SFTP or the smart server would
tend to behave well because they would preserve as much ordering as the
base repository has.

We could buffer to a local store - for instance we could do like git
does some form of file-per-object store [but still using atomic
insertion to make data visible to other readers] and then compress that
store.

We could let the stream generator know that for groupcompress the
ordering is not essential (it will be correct regardless) - this would
allow a VF implementation that is accessing data over high latency
transports (http, sftp etc) choose to optimise for latency, and a VF
implementation that is accessing over low latency (file:// etc) to
optimise for ordering, giving best insertion [and potentially network
transfer] size. The smart server would then be answering requests from
its low-latency VF store.

Its this last case that has be attracted to adding a new parameter, but
perhaps its better to take the insert-in-any-order route for now, and
just make sure pack re-orders, so we can pull-and-pack to test size?

-Rob


-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080813/baf3325f/attachment.pgp 


More information about the bazaar mailing list