RFC: get_record_stream and groupcompress

John Arbash Meinel john at arbash-meinel.com
Thu Aug 14 21:34:06 BST 2008

Hash: SHA1

Robert Collins wrote:
> So I've a thorny little tradeoff to encapsulate in code.
> The best approach I've thought of to date is a parameter to
> VF.get_record_stream called 'sort_is_hint'. If you think that that
> sounds fine and dandy, say so now and skip the detail below.

It is okay, but I'm not sure it is the best way. Why can't gc and packs
just pass 'unordered' if the sort is a hint anyway.

> The get_record_stream sort_order parameter is used by different
> repositories differently.
> Weave and Knit repositories use it to get topologically sorted data
> which they *cannot operate correctly without*.
> Pack[Knit] repositories use it to get data in arbitrary order, as they
> do not have insertion ordering constraints. {though they do have a mild
> preference for topologically ordered to get increasing-offset readvs}.
> GroupCompress suffers badly when given texts that are not
> reverse-topologically sorted. It suffers because it is more complex to
> express three texts [e.g. ABC, AB and A] in increasing-content order,
> and reverse-topological is a good approximation for getting a long run
> of content to refer to up-front. 
> GroupCompress works best then, when given reverse-topologically sorted
> texts, but getting that sort order from a repository involves
> approximately one IO per pack file per file id, rather than one IO per
> pack file - so its actually worse than knits were at latency.
> We have some options.
> We could insert in any order from remote repositories, this would tend
> to convert from packs poorly, but a 'pack' operation would fix things
> up. Fetching from a gc repository over SFTP or the smart server would
> tend to behave well because they would preserve as much ordering as the
> base repository has.

That is the route I would take.

> We could buffer to a local store - for instance we could do like git
> does some form of file-per-object store [but still using atomic
> insertion to make data visible to other readers] and then compress that
> store.

Sounds like a lot of work to avoid doing "bzr pack".

> We could let the stream generator know that for groupcompress the
> ordering is not essential (it will be correct regardless) - this would
> allow a VF implementation that is accessing data over high latency
> transports (http, sftp etc) choose to optimise for latency, and a VF
> implementation that is accessing over low latency (file:// etc) to
> optimise for ordering, giving best insertion [and potentially network
> transfer] size. The smart server would then be answering requests from
> its low-latency VF store.
> Its this last case that has be attracted to adding a new parameter, but
> perhaps its better to take the insert-in-any-order route for now, and
> just make sure pack re-orders, so we can pull-and-pack to test size?
> -Rob

My feeling is to cross the bridge when we get there. Let's get basic
groupcompress working. Get fetch working such that it doesn't transmit
all full-texts and require a full recompression of everything.
And then get a "bzr pack" that can figure out how to put everything
I realized the other day that "topo_sort" *doesn't* guarantee grouping
across the whole file_id.
Specifically, it does ancestral grouping for the given key. So if you
had A => B => C => D, and it happened to start with D, you'd get full
grouping, but if it started with B, you would get A & B, then sometime
later you would get C and D.

It would also be really neat if we could find a way to do appropriate
cross-file grouping. I can't really think of much from just a file-id
stand-point, though. Given file size, we might try to insert large ones
first. Though I would be careful to not insert all the large files into
one chunk, and have only tiny ones for the next.

So I think the hint is a nice thing, but I feel it is a bit premature.

Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org


More information about the bazaar mailing list