Better compression

Sat Jul 26 02:54:34 BST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
| On Fri, 2008-07-25 at 16:17 -0500, John Arbash Meinel wrote:
...

|> The only problem is that I'm not really sure that zobj.decompress(out)
|> is going to return all the data up to that point. It would make *sense*
|> for it to, but it may not without calling flush.
|
| flush will drop the compression ratio; there is no guarantee that what
| has been output is enough (there could be a very long dict hit and its
| waiting to see what the next match will be). I think that adding 40K to
| the current zlib output position is safe; then clamp the offsets after
| the zlib is finished.

Interesting thought.

|
|> |  - calculate the largest prefix needed when given a multiple-text
|> | request and satisfy the stream from the single group object
|>
|> That would be nice. I'm not sure how you do this in cases where they are
|> out-of-order. Like you want to returns texts 1 & 2 from hunk A, 3 & 4
|> from hunk B, and then 5 & 6 back from hunk A again.
|>
|> If it was 'unordered' it would be obvious, but 'topological'... do you
|> cache it?
|
| topological will tend to be in the same groups because they are also
| grouped topologically on insert (reverse is still topological :)). We
| probably want a 'topological within group as much as possible' sort.

Yeah, I worked out that topo_sort does keep groups together, but for
non-contiguous ancestry (the first 10, skip 10, the last 10) then it
will be somewhat random whether all 20 will be kept together. Because it
uses "popitem()" to start a sequence, returns all of that sequence, and
then pops the next.

...

| OTOH 100 decompresses of very small objects should be able to be made
| very cheap - there is not much setup needed in zlib itself. So I think
| it might make sense to not batch up revision objects - just keep them
| 'loose'.

Sure. Right now we don't delta compress them at all, just gzip/zlib
them. It works fairly well.

Sure.

|
|> At least for log --short -r-100..-1, we spend 400/470ms in
|> zlib.decompress().
|
| interesting
|
|> So doing the prefix thing might help, but not decompressing the same
|> hunks repeatedly helps more. (This is including my patch to cache the
|> last decompression.) We could play with it a bit more using an LRU cache
|> or something like that if we really wanted.
|
| True.

Except I tried a full dictionary, and for those 100 items we *needed* 5
decompresses. So I would tend towards making the total size a bit shorter.

Yeah, my point too. And then it also allows the old formats to just
return lines. (They're 'chunks' too.)

I also have some thoughts on some problems with GroupCompress, which I'd
like your feedback on.

1) get_record_stream() only works in full-texts. This is mostly because
groupcompress doesn't have a delta that makes sense out of context, and
the context is all texts inserted so far. You could cheat a little bit,
but you at least need all the lines that the delta referenced.

This doesn't matter a lot for local or dumb transport access, but it
doesn't give us something good to put over the wire for bzr+ssh access.
I don't think we want to stream 400MB of data for NEWS on a new checkout.

2) Going along the same lines, when you copy from one repository to
another, you extract to fulltexts and then recompress back into a new GC
object. I don't know how to make that work with the idea of garbage
collection during branch/push/pull. We *could* always transmit the whole
GC group. We *could* special case when all keys in a group are
requested. Though we should at least do a minimal amount of error
checking before we accept a hunk into the new repository. (We may not
extract all the texts, but we could at least unzip the chunk.)

3) Out-of-order insertion also makes it hard to be selective during
push/pull.

Anyway, I'm concerned that when this lands, remote dumb-transport
branching will probably be faster, but local branching (because of CPU
overhead) and bzr+ssh branching will be slower (because of either
working in full-texts, or having to re-delta the transmission.)

I think we have something good here, but we need to be careful about how
it is going to impact other pieces.

If we chose to copy whole GC objects around, then I would recommend
making them a bit smaller. (10MB instead of 20MB, or something like that.)

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkiKg9oACgkQJdeBCYSNAAPHywCdGvikuihtyybNBWsgeTfGALy5
VgoAn19ulvTOM9+zwQYHlK09r8tmlrBw
=b3oV
-----END PGP SIGNATURE-----