Better compression
Robert Collins
robertc at robertcollins.net
Sat Jul 26 05:50:33 BST 2008
On Fri, 2008-07-25 at 20:54 -0500, John Arbash Meinel wrote:
>
> I also have some thoughts on some problems with GroupCompress, which
> I'd
> like your feedback on.
>
> 1) get_record_stream() only works in full-texts. This is mostly
> because
> groupcompress doesn't have a delta that makes sense out of context,
> and
> the context is all texts inserted so far. You could cheat a little
> bit,
> but you at least need all the lines that the delta referenced.
>
> This doesn't matter a lot for local or dumb transport access, but it
> doesn't give us something good to put over the wire for bzr+ssh
> access.
> I don't think we want to stream 400MB of data for NEWS on a new
> checkout.
We'd just stream the compressed records; or even just create a new
stream with only the needed texts (assuming we decide the potential
privacy concerns of giving extra texts is a problem). We can also check
that extra texts are already held by the client (in-ancestry) and then
just emit them anyway. (NB: This clearly only matters *at all* when the
VFS is able to be disabled for pull operations).
> 2) Going along the same lines, when you copy from one repository to
> another, you extract to fulltexts and then recompress back into a new
> GC
> object. I don't know how to make that work with the idea of garbage
> collection during branch/push/pull. We *could* always transmit the
> whole
> GC group. We *could* special case when all keys in a group are
> requested. Though we should at least do a minimal amount of error
> checking before we accept a hunk into the new repository. (We may not
> extract all the texts, but we could at least unzip the chunk.)
I *really* like the idea of checking the sha of everything everytime.
Its been a long standing ergh-compromise that we don't.
> 3) Out-of-order insertion also makes it hard to be selective during
> push/pull.
Not sure what this means.
> Anyway, I'm concerned that when this lands, remote dumb-transport
> branching will probably be faster, but local branching (because of CPU
> overhead) and bzr+ssh branching will be slower (because of either
> working in full-texts, or having to re-delta the transmission.)
Also autopack and pack operations are potentially impacted.
> I think we have something good here, but we need to be careful about
> how
> it is going to impact other pieces.
Indeed :)
> If we chose to copy whole GC objects around, then I would recommend
> making them a bit smaller. (10MB instead of 20MB, or something like
> that.)
So broadly, here is what we can cheaply do to a group:
- truncate it (take the N first texts inserted). Keeping the newest
texts at the front allows incremental pulls to do be quite clever.
- extract N texts
- append texts
- send it as-is
And the operations we need to perform on/with them:
- stream over the network
- fetch via a VFS
- push via a VFS
- combine packs to reduce storage
- extract texts:
- annotate - topological order (which will tend to read an
entire group on the first request, then walk closer to the front)
- build-tree - latest revision of the entire tree (probably randomish
access)
- merge - three revs of every file id altered in the merge range, or
annotate capped at the merge range for the same file ids.
Any that are missed? I have some thoughts on how they all will work, but
lets try to sketch the boundaries out before dropping into details.
-Rob
--
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080726/67f41f7b/attachment.pgp
More information about the bazaar
mailing list