Better compression

John Arbash Meinel john at arbash-meinel.com
Sat Jul 26 02:54:34 BST 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
| On Fri, 2008-07-25 at 16:17 -0500, John Arbash Meinel wrote:
...

|> The only problem is that I'm not really sure that zobj.decompress(out)
|> is going to return all the data up to that point. It would make *sense*
|> for it to, but it may not without calling flush.
|
| flush will drop the compression ratio; there is no guarantee that what
| has been output is enough (there could be a very long dict hit and its
| waiting to see what the next match will be). I think that adding 40K to
| the current zlib output position is safe; then clamp the offsets after
| the zlib is finished.

Interesting thought.

|
|> |  - calculate the largest prefix needed when given a multiple-text
|> | request and satisfy the stream from the single group object
|>
|> That would be nice. I'm not sure how you do this in cases where they are
|> out-of-order. Like you want to returns texts 1 & 2 from hunk A, 3 & 4
|> from hunk B, and then 5 & 6 back from hunk A again.
|>
|> If it was 'unordered' it would be obvious, but 'topological'... do you
|> cache it?
|
| topological will tend to be in the same groups because they are also
| grouped topologically on insert (reverse is still topological :)). We
| probably want a 'topological within group as much as possible' sort.

Yeah, I worked out that topo_sort does keep groups together, but for
non-contiguous ancestry (the first 10, skip 10, the last 10) then it
will be somewhat random whether all 20 will be kept together. Because it
uses "popitem()" to start a sequence, returns all of that sequence, and
then pops the next.

...

| OTOH 100 decompresses of very small objects should be able to be made
| very cheap - there is not much setup needed in zlib itself. So I think
| it might make sense to not batch up revision objects - just keep them
| 'loose'.

Sure. Right now we don't delta compress them at all, just gzip/zlib
them. It works fairly well.

|
|> |  - C decompressor (we do string splits on large texts that are not
|> | needed, and parsing is often slow in python)
|>
|> As near as I can tell, you only split_lines(plain[start:end]) for the
|> delta chunk. I haven't specifically profiled that portion very much.
|
| Right, but we have 10MB of bytes, then we pull out the delta. Its still
| a copy that isn't needed - a forward reading C decompressor can
| decompress in-place.

Sure.

|
|> At least for log --short -r-100..-1, we spend 400/470ms in
|> zlib.decompress().
|
| interesting
|
|> So doing the prefix thing might help, but not decompressing the same
|> hunks repeatedly helps more. (This is including my patch to cache the
|> last decompression.) We could play with it a bit more using an LRU cache
|> or something like that if we really wanted.
|
| True.

Except I tried a full dictionary, and for those 100 items we *needed* 5
decompresses. So I would tend towards making the total size a bit shorter.

|
|> Of the rest of the time, 36ms is in get_raw_records, and 10ms is in
|> parse. 16ms being in the get_record_stream function itself.
|>
|>
|> |  - 'strings'/'chunked' output mode to avoid string joining on
|> | decompression
|>
|> You actually do that, up until the point that you join them.
|> I think it is specifically "chunks" though you call it 'lines'. So it
|> would be good to change that variable name.
|
| Well my point is to not string join at all; let the caller string join
| if they need it.
|
| -Rob

Yeah, my point too. And then it also allows the old formats to just
return lines. (They're 'chunks' too.)

I also have some thoughts on some problems with GroupCompress, which I'd
like your feedback on.

1) get_record_stream() only works in full-texts. This is mostly because
groupcompress doesn't have a delta that makes sense out of context, and
the context is all texts inserted so far. You could cheat a little bit,
but you at least need all the lines that the delta referenced.

This doesn't matter a lot for local or dumb transport access, but it
doesn't give us something good to put over the wire for bzr+ssh access.
I don't think we want to stream 400MB of data for NEWS on a new checkout.

2) Going along the same lines, when you copy from one repository to
another, you extract to fulltexts and then recompress back into a new GC
object. I don't know how to make that work with the idea of garbage
collection during branch/push/pull. We *could* always transmit the whole
GC group. We *could* special case when all keys in a group are
requested. Though we should at least do a minimal amount of error
checking before we accept a hunk into the new repository. (We may not
extract all the texts, but we could at least unzip the chunk.)

3) Out-of-order insertion also makes it hard to be selective during
push/pull.

Anyway, I'm concerned that when this lands, remote dumb-transport
branching will probably be faster, but local branching (because of CPU
overhead) and bzr+ssh branching will be slower (because of either
working in full-texts, or having to re-delta the transmission.)

I think we have something good here, but we need to be careful about how
it is going to impact other pieces.

If we chose to copy whole GC objects around, then I would recommend
making them a bit smaller. (10MB instead of 20MB, or something like that.)

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkiKg9oACgkQJdeBCYSNAAPHywCdGvikuihtyybNBWsgeTfGALy5
VgoAn19ulvTOM9+zwQYHlK09r8tmlrBw
=b3oV
-----END PGP SIGNATURE-----



More information about the bazaar mailing list