Better compression

Sun Jul 27 15:08:24 BST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
| On Fri, 2008-07-25 at 20:54 -0500, John Arbash Meinel wrote:
|> I also have some thoughts on some problems with GroupCompress, which
|> I'd
|> like your feedback on.
|>
|> 1) get_record_stream() only works in full-texts. This is mostly
|> because
|> groupcompress doesn't have a delta that makes sense out of context,
|> and
|> the context is all texts inserted so far. You could cheat a little
|> bit,
|> but you at least need all the lines that the delta referenced.
|>
|> This doesn't matter a lot for local or dumb transport access, but it
|> doesn't give us something good to put over the wire for bzr+ssh
|> access.
|> I don't think we want to stream 400MB of data for NEWS on a new
|> checkout.
|
| We'd just stream the compressed records; or even just create a new
| stream with only the needed texts (assuming we decide the potential
| privacy concerns of giving extra texts is a problem). We can also check
| that extra texts are already held by the client (in-ancestry) and then
| just emit them anyway. (NB: This clearly only matters *at all* when the
| VFS is able to be disabled for pull operations).

Sure. All of this has me concerned with the CPU overhead of the new
format. Extracting and recompressing is not particularly cheap. Though I
agree that inserting into a new repository makes for a good time to
check sha1-sums.

I wonder if we could use the trick I put together for Weaves.
Specifically, 'bzr check' on weaves would create N sha.new() objects,
and then just feed the line into whatever object needed it. It seems
like we could do something similar for GC objects, so you could get
single-pass verification of all texts.

The downside is that you would have to *parse* everything first, so you
know when to use what line. So it could certainly be better to just do
each text as you go.

John
=:->

|
|> 2) Going along the same lines, when you copy from one repository to
|> another, you extract to fulltexts and then recompress back into a new
|> GC
|> object. I don't know how to make that work with the idea of garbage
|> collection during branch/push/pull. We *could* always transmit the
|> whole
|> GC group. We *could* special case when all keys in a group are
|> requested. Though we should at least do a minimal amount of error
|> checking before we accept a hunk into the new repository. (We may not
|> extract all the texts, but we could at least unzip the chunk.)
|
| I *really* like the idea of checking the sha of everything everytime.
| Its been a long standing ergh-compromise that we don't.
|
|> 3) Out-of-order insertion also makes it hard to be selective during
|> push/pull.
|
| Not sure what this means.
|
|> Anyway, I'm concerned that when this lands, remote dumb-transport
|> branching will probably be faster, but local branching (because of CPU
|> overhead) and bzr+ssh branching will be slower (because of either
|> working in full-texts, or having to re-delta the transmission.)
|
| Also autopack and pack operations are potentially impacted.
|
|> I think we have something good here, but we need to be careful about
|> how
|> it is going to impact other pieces.
|
| Indeed :)
|
|> If we chose to copy whole GC objects around, then I would recommend
|> making them a bit smaller. (10MB instead of 20MB, or something like
|> that.)
|
| So broadly, here is what we can cheaply do to a group:
|  - truncate it (take the N first texts inserted). Keeping the newest
| texts at the front allows incremental pulls to do be quite clever.
|  - extract N texts
|  - append texts
|  - send it as-is
|
| And the operations we need to perform on/with them:
|  - stream over the network
|  - fetch via a VFS
|  - push via a VFS
|  - combine packs to reduce storage
|  - extract texts:
|    - annotate - topological order (which will tend to read an
|    entire group on the first request, then walk closer to the front)
|    - build-tree - latest revision of the entire tree (probably randomish
|    access)
|    - merge - three revs of every file id altered in the merge range, or
|    annotate capped at the merge range for the same file ids.
|
| Any that are missed? I have some thoughts on how they all will work, but
| lets try to sketch the boundaries out before dropping into details.
|
| -Rob
|

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkiMgVgACgkQJdeBCYSNAAMEMwCggrRByuGxiis8sbaDp5h6TWwk
oUgAnAqn59FyKDLjJhr/rnWm4/lfntqq
=jO4G
-----END PGP SIGNATURE-----