Better compression

Robert Collins robertc at robertcollins.net
Fri Jul 25 23:06:41 BST 2008


On Fri, 2008-07-25 at 16:17 -0500, John Arbash Meinel wrote:

BTW, you might consider quoting just a tad less :)


> | Things to improve there that I know of:
> |  - only read the prefix of a group needed to access a given text
> (better
> | performance on most-recent texts) (needs output to record a prefix
> | length per text rather than just the zlib size)
> 
> Well, this does get into a bit of zlib stuff, right? Like being able to
> say you only need "*this*" much zlib to get your data out. I don't know
> if we could cleanly do that without using Z_SYNC_FLUSH. I suppose we
> might do something like:
> 
> out = zobj.compress(more_data):
> if out:
> ~  # We just got data back from the compressor, so readers only need to
> ~  # read to this point.
> ~  set_current_prefixes()
> 
> The only problem is that I'm not really sure that zobj.decompress(out)
> is going to return all the data up to that point. It would make *sense*
> for it to, but it may not without calling flush.

flush will drop the compression ratio; there is no guarantee that what
has been output is enough (there could be a very long dict hit and its
waiting to see what the next match will be). I think that adding 40K to
the current zlib output position is safe; then clamp the offsets after
the zlib is finished.

> 
> |  - calculate the largest prefix needed when given a multiple-text
> | request and satisfy the stream from the single group object
> 
> That would be nice. I'm not sure how you do this in cases where they are
> out-of-order. Like you want to returns texts 1 & 2 from hunk A, 3 & 4
> from hunk B, and then 5 & 6 back from hunk A again.
>
> If it was 'unordered' it would be obvious, but 'topological'... do you
> cache it?

topological will tend to be in the same groups because they are also
grouped topologically on insert (reverse is still topological :)). We
probably want a 'topological within group as much as possible' sort.

> I *did* find that zlib.decompress(2MB) => 10MB was taking 100ms or so
> (10 calls, 828ms, so maybe 83ms each?)
> 
> There is always the memory capped question to answer.
> 
> I guess the data isn't as close together as I thought, as log -r -100..
> ~ does 5 decompresses. But that is certainly better than 100.

OTOH 100 decompresses of very small objects should be able to be made
very cheap - there is not much setup needed in zlib itself. So I think
it might make sense to not batch up revision objects - just keep them
'loose'.

> |  - C decompressor (we do string splits on large texts that are not
> | needed, and parsing is often slow in python)
> 
> As near as I can tell, you only split_lines(plain[start:end]) for the
> delta chunk. I haven't specifically profiled that portion very much.

Right, but we have 10MB of bytes, then we pull out the delta. Its still
a copy that isn't needed - a forward reading C decompressor can
decompress in-place.

> At least for log --short -r-100..-1, we spend 400/470ms in
> zlib.decompress().

interesting

> So doing the prefix thing might help, but not decompressing the same
> hunks repeatedly helps more. (This is including my patch to cache the
> last decompression.) We could play with it a bit more using an LRU cache
> or something like that if we really wanted.

True.

> Of the rest of the time, 36ms is in get_raw_records, and 10ms is in
> parse. 16ms being in the get_record_stream function itself.
> 
> 
> |  - 'strings'/'chunked' output mode to avoid string joining on
> | decompression
> 
> You actually do that, up until the point that you join them.
> I think it is specifically "chunks" though you call it 'lines'. So it
> would be good to change that variable name.

Well my point is to not string join at all; let the caller string join
if they need it.

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080726/c0d76adc/attachment.pgp 


More information about the bazaar mailing list