Better compression

Tue Jul 29 04:31:16 BST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
| On Fri, 2008-07-25 at 20:54 -0500, John Arbash Meinel wrote:
|> I also have some thoughts on some problems with GroupCompress, which
|> I'd
|> like your feedback on.
|>
|> 1) get_record_stream() only works in full-texts. This is mostly
|> because
|> groupcompress doesn't have a delta that makes sense out of context,
|> and
|> the context is all texts inserted so far. You could cheat a little
|> bit,
|> but you at least need all the lines that the delta referenced.
|>
|> This doesn't matter a lot for local or dumb transport access, but it
|> doesn't give us something good to put over the wire for bzr+ssh
|> access.
|> I don't think we want to stream 400MB of data for NEWS on a new
|> checkout.
|
| We'd just stream the compressed records; or even just create a new
| stream with only the needed texts (assuming we decide the potential
| privacy concerns of giving extra texts is a problem). We can also check
| that extra texts are already held by the client (in-ancestry) and then
| just emit them anyway. (NB: This clearly only matters *at all* when the
| VFS is able to be disabled for pull operations).

So I was talking to Martin about this, and I realize that we can
actually transform the delta without having to extract all of the full
texts. I'll outline it here. In the short term, we should know what the
CPU overhead is, before we decide to spend the full work to implement
this. But here goes:

1) Group compress starts with a full-text entry

2) For each further text, delta against all existing lines. Except for
lines which are control codes, and lines which would be a copy except
the insertion is shorter.

3) So in a given delta, we have "copy" instructions, and "insert"
instructions. "copy" instructions are given as "start_bytes, num_bytes".
This is defined in terms of the "output" stream.

The reason this delta isn't portable, is because the copy instructions
are based on the texts inserted into the output. However if you know
what texts the target has, I think it can be easily converted into
deltas against those full texts. When you see "copy bytes 1000-2000",
you can determine what text those bytes were in, and just translate 1000
into the appropriate offset for that text. The range is going to be the
same.

Here is an example. Assume 2 input texts "abcd" and "bcdefg" and
"cdefghi" (with newlines, etc inbetween). Group compress will output
something like:

['label: text1\n',
~ 'sha1: ...\n',
~ 'i,5\n',
~ 'a\n',
~ 'b\n',
~ 'c\n',
~ 'd\n',
~ '\n',

~ 'label: text2\n',
~ 'sha1: ...\n',
~ 'c,66,6\n', # Copy 6 bytes at offset 66
~ 'i,3\n',    # insert 3 lines (this will change to a byte count)
~ 'e\n',
~ 'f\n',
~ 'g\n',
~ 'i,1\n',    # Insert a trailing newline, all texts end with an extra
~ '\n',	     # newline (so that we always strip 1 newline from every
~             # text)

~ 'label: text3\n',
~ 'sha1: ...\n',
~ 'c,68,4\n',  # Copy 4 bytes from offset 68
~ 'c,144,6\n', # And 6 bytes from offset 6
~ 'i,2\n',     # Insert 2 lines
~ 'h\n',
~ 'i\n',
~ 'i,1\n',
~ '\n',
]

SO you can see that the 'c,144,6' doesn't have much meaning outside of
this output stream, as it depends on text1 being inserted, and text2 (in
that order.)

However, a few observations:

1) We only will copy lines that follow an insertion

2) offset 144 is actually offset 7 in text2. We can compute that by
noticing there were 6 bytes inserted by the 'c,66,6' line.

So we could convert "c,144,6" into "c,text2,7,6". The really good thing
is that this form is storage independent, and can be trivially applied
in the target.

Further, if we new that the target repository didn't have 'text2' we
could just expand it at that point to an Insert.

3) As we walk through the text, we can index just the lines after 'i' to
dereference into the label and offset. We know that some lines won't
ever be referenced (like the 'i,1\n\n'). We could trivially change the
serializer to use 'I' for those lines, to tell the deserializer it isn't
very interesting. I'm not specifically sure how important that is, but
it is a trivial change.

4) I wonder about using a back-reference rather than offset-from-0.
Specifically, instead of 144, we could use something like -20. That
might start to be a bigger win when you are at offset 1000000. It may be
a bit harder to compute, though you could make it relative to the send
of the previous entry, rather than having to track how many bytes you
*will* add to the stream for this entry.

Anyway, it would mean that we can compute a grouping-independent delta
stream without having to re-run delta compression, so it should be
relatively cheap and could certainly be done by the smart server.

John
=:->

|
|> 2) Going along the same lines, when you copy from one repository to
|> another, you extract to fulltexts and then recompress back into a new
|> GC
|> object. I don't know how to make that work with the idea of garbage
|> collection during branch/push/pull. We *could* always transmit the
|> whole
|> GC group. We *could* special case when all keys in a group are
|> requested. Though we should at least do a minimal amount of error
|> checking before we accept a hunk into the new repository. (We may not
|> extract all the texts, but we could at least unzip the chunk.)
|
| I *really* like the idea of checking the sha of everything everytime.
| Its been a long standing ergh-compromise that we don't.
|
|> 3) Out-of-order insertion also makes it hard to be selective during
|> push/pull.
|
| Not sure what this means.
|
|> Anyway, I'm concerned that when this lands, remote dumb-transport
|> branching will probably be faster, but local branching (because of CPU
|> overhead) and bzr+ssh branching will be slower (because of either
|> working in full-texts, or having to re-delta the transmission.)
|
| Also autopack and pack operations are potentially impacted.
|
|> I think we have something good here, but we need to be careful about
|> how
|> it is going to impact other pieces.
|
| Indeed :)
|
|> If we chose to copy whole GC objects around, then I would recommend
|> making them a bit smaller. (10MB instead of 20MB, or something like
|> that.)
|
| So broadly, here is what we can cheaply do to a group:
|  - truncate it (take the N first texts inserted). Keeping the newest
| texts at the front allows incremental pulls to do be quite clever.
|  - extract N texts
|  - append texts
|  - send it as-is
|
| And the operations we need to perform on/with them:
|  - stream over the network
|  - fetch via a VFS
|  - push via a VFS
|  - combine packs to reduce storage
|  - extract texts:
|    - annotate - topological order (which will tend to read an
|    entire group on the first request, then walk closer to the front)
|    - build-tree - latest revision of the entire tree (probably randomish
|    access)
|    - merge - three revs of every file id altered in the merge range, or
|    annotate capped at the merge range for the same file ids.
|
| Any that are missed? I have some thoughts on how they all will work, but
| lets try to sketch the boundaries out before dropping into details.
|
| -Rob
|

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkiOjwMACgkQJdeBCYSNAANbxACgv1YGVsyC0WB06g+tuZgMh33p
nRwAoLXXzUMhpA5WEATB8n7vrlfxviVr
=mnbn
-----END PGP SIGNATURE-----