RFC: get_record_stream and groupcompress
John Arbash Meinel
john at arbash-meinel.com
Thu Aug 14 21:34:06 BST 2008
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Robert Collins wrote:
> So I've a thorny little tradeoff to encapsulate in code.
>
> The best approach I've thought of to date is a parameter to
> VF.get_record_stream called 'sort_is_hint'. If you think that that
> sounds fine and dandy, say so now and skip the detail below.
>
It is okay, but I'm not sure it is the best way. Why can't gc and packs
just pass 'unordered' if the sort is a hint anyway.
> The get_record_stream sort_order parameter is used by different
> repositories differently.
>
> Weave and Knit repositories use it to get topologically sorted data
> which they *cannot operate correctly without*.
>
> Pack[Knit] repositories use it to get data in arbitrary order, as they
> do not have insertion ordering constraints. {though they do have a mild
> preference for topologically ordered to get increasing-offset readvs}.
>
> GroupCompress suffers badly when given texts that are not
> reverse-topologically sorted. It suffers because it is more complex to
> express three texts [e.g. ABC, AB and A] in increasing-content order,
> and reverse-topological is a good approximation for getting a long run
> of content to refer to up-front.
>
> GroupCompress works best then, when given reverse-topologically sorted
> texts, but getting that sort order from a repository involves
> approximately one IO per pack file per file id, rather than one IO per
> pack file - so its actually worse than knits were at latency.
>
> We have some options.
>
> We could insert in any order from remote repositories, this would tend
> to convert from packs poorly, but a 'pack' operation would fix things
> up. Fetching from a gc repository over SFTP or the smart server would
> tend to behave well because they would preserve as much ordering as the
> base repository has.
That is the route I would take.
>
> We could buffer to a local store - for instance we could do like git
> does some form of file-per-object store [but still using atomic
> insertion to make data visible to other readers] and then compress that
> store.
Sounds like a lot of work to avoid doing "bzr pack".
>
> We could let the stream generator know that for groupcompress the
> ordering is not essential (it will be correct regardless) - this would
> allow a VF implementation that is accessing data over high latency
> transports (http, sftp etc) choose to optimise for latency, and a VF
> implementation that is accessing over low latency (file:// etc) to
> optimise for ordering, giving best insertion [and potentially network
> transfer] size. The smart server would then be answering requests from
> its low-latency VF store.
>
> Its this last case that has be attracted to adding a new parameter, but
> perhaps its better to take the insert-in-any-order route for now, and
> just make sure pack re-orders, so we can pull-and-pack to test size?
>
> -Rob
My feeling is to cross the bridge when we get there. Let's get basic
groupcompress working. Get fetch working such that it doesn't transmit
all full-texts and require a full recompression of everything.
And then get a "bzr pack" that can figure out how to put everything
"optimally".
I realized the other day that "topo_sort" *doesn't* guarantee grouping
across the whole file_id.
Specifically, it does ancestral grouping for the given key. So if you
had A => B => C => D, and it happened to start with D, you'd get full
grouping, but if it started with B, you would get A & B, then sometime
later you would get C and D.
It would also be really neat if we could find a way to do appropriate
cross-file grouping. I can't really think of much from just a file-id
stand-point, though. Given file size, we might try to insert large ones
first. Though I would be careful to not insert all the large files into
one chunk, and have only tiny ones for the next.
So I think the hint is a nice thing, but I feel it is a bit premature.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkiklr4ACgkQJdeBCYSNAAODWQCgnI5MkYaHGYIfyCDNcdLd/cSp
F38AoLyOQiMOMxOKjkS9vMnYOHcwqKUZ
=cHey
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list