Better compression

Sat Jul 26 05:50:33 BST 2008

On Fri, 2008-07-25 at 20:54 -0500, John Arbash Meinel wrote:
> 
> I also have some thoughts on some problems with GroupCompress, which
> I'd
> like your feedback on.
> 
> 1) get_record_stream() only works in full-texts. This is mostly
> because
> groupcompress doesn't have a delta that makes sense out of context,
> and
> the context is all texts inserted so far. You could cheat a little
> bit,
> but you at least need all the lines that the delta referenced.
> 
> This doesn't matter a lot for local or dumb transport access, but it
> doesn't give us something good to put over the wire for bzr+ssh
> access.
> I don't think we want to stream 400MB of data for NEWS on a new
> checkout.

We'd just stream the compressed records; or even just create a new
stream with only the needed texts (assuming we decide the potential
privacy concerns of giving extra texts is a problem). We can also check
that extra texts are already held by the client (in-ancestry) and then
just emit them anyway. (NB: This clearly only matters *at all* when the
VFS is able to be disabled for pull operations).

> 2) Going along the same lines, when you copy from one repository to
> another, you extract to fulltexts and then recompress back into a new
> GC
> object. I don't know how to make that work with the idea of garbage
> collection during branch/push/pull. We *could* always transmit the
> whole
> GC group. We *could* special case when all keys in a group are
> requested. Though we should at least do a minimal amount of error
> checking before we accept a hunk into the new repository. (We may not
> extract all the texts, but we could at least unzip the chunk.)

I *really* like the idea of checking the sha of everything everytime.
Its been a long standing ergh-compromise that we don't.

> 3) Out-of-order insertion also makes it hard to be selective during
> push/pull.

Not sure what this means.

> Anyway, I'm concerned that when this lands, remote dumb-transport
> branching will probably be faster, but local branching (because of CPU
> overhead) and bzr+ssh branching will be slower (because of either
> working in full-texts, or having to re-delta the transmission.)

Also autopack and pack operations are potentially impacted.

> I think we have something good here, but we need to be careful about
> how
> it is going to impact other pieces.

Indeed :)

> If we chose to copy whole GC objects around, then I would recommend
> making them a bit smaller. (10MB instead of 20MB, or something like
> that.)

So broadly, here is what we can cheaply do to a group:
 - truncate it (take the N first texts inserted). Keeping the newest
texts at the front allows incremental pulls to do be quite clever.
 - extract N texts
 - append texts
 - send it as-is

And the operations we need to perform on/with them:
 - stream over the network
 - fetch via a VFS
 - push via a VFS
 - combine packs to reduce storage 
 - extract texts:
   - annotate - topological order (which will tend to read an
   entire group on the first request, then walk closer to the front)
   - build-tree - latest revision of the entire tree (probably randomish
   access)
   - merge - three revs of every file id altered in the merge range, or
   annotate capped at the merge range for the same file ids.

Any that are missed? I have some thoughts on how they all will work, but
lets try to sketch the boundaries out before dropping into details.

-Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080726/67f41f7b/attachment.pgp