Better compression
John Arbash Meinel
john at arbash-meinel.com
Fri Jul 25 22:17:33 BST 2008
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Robert Collins wrote:
| On Fri, 2008-07-25 at 15:01 -0500, John Arbash Meinel wrote:
|> -----BEGIN PGP SIGNED MESSAGE-----
|> Hash: SHA1
|>
|> Ian Clatworthy wrote:
|> | Robert Collins wrote:
|> |> On Thu, 2008-07-17 at 16:00 +1000, Rob Weir wrote:
|> |>> On 17 Jul 2008, raindog at macrohmasheen.com wrote:
|> |>>> What level perf increase is seen with this? Do you have any
|> |>>> comparissons? Sent from my Verizon Wireless BlackBerry
|> |>> With regards to space, I converted bzr.dev from pack-0.92 to
|> |>> btree+groupcompress:
|> |>>
|> |>> 85664 backup.bzr
|> |>> 40232 .bzr/
|> |>>
|> |>> a saving of over 50%.
|> |
|> | I saw a similar gain on Python last night: 346.5 MB -> 181.4 MB.
|> |
|> |> The compressor would have been getting texts in topological order
there.
|> |> It does much better with reverse-topological order, which is why I
need
|> |> a new InterRepository object or something... I tested a bzr.dev
|> |> repository and saw more like 30 MB with better ordering in use, but it
|> |> was a custom-hack rather than a reusable thing.
|> |
|> | So I'm wondering what that implies w.r.t. maximising how fast-import
|> | works. If we know that compressing in a forward direction will
generally
|> | take more space than compressing going in the other direction, we'll
|> | probably always want to consume the stream *quickly*, then compress as
|> | much as we can during the final pack that fastimport does, yes?
|> |
|> | So I guess I'm requesting enough flexibility in our API to say
|> | "load quickly" so that fastimport doesn't take forever overly
|> | compressing stuff on a first pass which will only be thrown away.
|> |
|> | Ian C.
|> |
|> |
|>
|> Just to give the current state of it. Doing a plain "bzr branch" into a
|> gc-plain repository dropped the disk-size from 85MB => 41MB. This was
|> with forward topological sorting (instead of reverse).
|>
|> So I went ahead and hacked up some code to force reverse-order
|> insertion. There were a couple things I discovered:
|
| I have generic stuff for this partly done in a bzr branch; I'll publish
| that shortly.
|
|> 1) Changing the GenericRepoFetcher to use full-texts rather than using
|> an adapter was quite an improvement for small projects (like bzrtools).
|>
|> For specific results, doing a normal fetch using an adapter took 40s for
|> bzrtools. Using full-texts takes 6.8s. (That includes all the time to
|> compress the texts in the target.) Of course, branching into a pack-0.92
|> is only 1.8s.
|
| This is because knits are doing less string manipulation to get the
| fulltext out than extracting the fulltext from the in-progress
| compression group.
|
|> Trying to do the same with bzr.dev was *bad*. The specific issue is that
|> we request all the versions of all texts in a single request. And if you
|> set include_delta_closure=True, then it unpacks all the requested texts
|> into full-lines in memory, and then ''.join()s them one by one. (Of
|> course, on the other side I was caching the byte form, so it was
|> requiring 2xTotal Bytes of all versions of all files.) As an example,
|> NEWS by itself is 400MB, and then you double it, and then you do that
|> for all files....
|>
|> now. KnitVersionedFile._get_content_maps *should* be smart enough to
|> re-use strings where possible, so this shouldn't be a disastrous memory
|> bloat. But converting it to raw bytes and then caching that is.
|>
|> It seems that just doing content_maps causes memory consumption to go up
|> to 446MB. Which is a bit, but not horrible.
|>
|> So I changed the get_bytes_as('fulltext') path to just return the lines,
|> rather than joining them only to split them on the other side.
|>
|> Oddly enough, at this point the memory went from 400 => 600 and would
|> increase to almost 600, then drop back to 450 or so, then increase, then
|> drop back. My best guess is just that we were creating a lot of temp
|> variables, and the memory drop was when the garbage collector kicked in.
|>
|> Either that or it was the GroupCompressor building up the next set of
|> lines before it flushed it out to disk.
|>
|> As for speed... last night when I did "bzr branch" it took 90+ minutes.
|> Now, that was starting with an unpacked repo (with lots of extra junk in
|> the repo.)
|>
|> When I did it today using a repo with only bzr.dev in it, it took 7m16s.
|> Or about 12x faster.
|
| We need to cap the memory usage of get_record_stream on knits. This
| addresses all the issues of bloat (but not those of the lack of
| 'chunked' or 'strings' or whatever we call it in the record stream
| apis). That would be a good thing to move forward.
|
|> 2) I didn't realize, but GroupCompress *does* do cross-text compression.
|> Specifically it is built on the VersionedFiles object, and we are
|> requesting various file ids at the same time. IMO, this is good and bad.
|> Good, in that when you have lots of small texts, you can pack them all
|> together. Bad in that incremental pulls, etc, will be packing your 170K
|> NEWS file in with your tiny foo.py file. We *do* at least topologically
|> sort, which means that by default things will be grouped around their
|> file_id (at least, that is how I understand the tsort.topo_sorted()
|> algorithm, I'd have to poke harder to know for sure.)
|
| :) This is one of the primary goals.
|
|> The other "bad" of this approach, is that it doesn't really have any
|> idea *which* files to mix together. And I'm pretty sure it does it
|> randomly based on dict.popitem(). So that would be a good place that we
|> could tweak for "similarity" between texts.
|
| It would be nice to have the basename used, so that COPYING in two
| different projects could be group compressed together.
|
|> Oh, another "bad" is that if you have 2 large texts, in a row, it may
|> start putting your "best" tip revision for the second text into the
|> compressor for the previous text. Wasting a bit of space there. But more
|> importantly, if it starts a new compressor soon after, it might have
|> saved more space by flushing the first and starting over with the new
|> text. Again, more places for tweaking the algorithm.
|
| Actually reverse topological is most likely to keep many texts of the
| same fileid together (because its 50% likely to land in the 50% most
| recent versions on the first probe into the dict). I intend to have knit
| versioned files do a group(fileid); sort(per-fileid), flatten operation
| though, when reverse topo is asked for, to avoid all these potential
| issues.
|
|> 3) Reverse topological is (again) a big win for space. If forward
|> topological goes from 85MB (pack) => 41M (gc forward) => 31MB (gc
reverse).
|
| Thats close to the size I was predicting in my early tests - thanks for
| hacking this in.
|
|> 4) Something is causing a plain "bzr log --short -r -10..-1" to be
|> slower with the new repository. Specifically for bzr.dev on a fully
|> packed repository on both sides:
|>
|> $ time bzr log dev-gc/bzr.dev
|> real 0m1.372s
|>
|> $ time bzr log dev-packs/bzr.dev
|> real 0m0.499s
|>
|> So we should probably investigate what is happening here.
|
| As you note, the current decompressor is very crude.
|
| Things to improve there that I know of:
| - only read the prefix of a group needed to access a given text (better
| performance on most-recent texts) (needs output to record a prefix
| length per text rather than just the zlib size)
Well, this does get into a bit of zlib stuff, right? Like being able to
say you only need "*this*" much zlib to get your data out. I don't know
if we could cleanly do that without using Z_SYNC_FLUSH. I suppose we
might do something like:
out = zobj.compress(more_data):
if out:
~ # We just got data back from the compressor, so readers only need to
~ # read to this point.
~ set_current_prefixes()
The only problem is that I'm not really sure that zobj.decompress(out)
is going to return all the data up to that point. It would make *sense*
for it to, but it may not without calling flush.
| - calculate the largest prefix needed when given a multiple-text
| request and satisfy the stream from the single group object
That would be nice. I'm not sure how you do this in cases where they are
out-of-order. Like you want to returns texts 1 & 2 from hunk A, 3 & 4
from hunk B, and then 5 & 6 back from hunk A again.
If it was 'unordered' it would be obvious, but 'topological'... do you
cache it?
I *did* find that zlib.decompress(2MB) => 10MB was taking 100ms or so
(10 calls, 828ms, so maybe 83ms each?)
There is always the memory capped question to answer.
I guess the data isn't as close together as I thought, as log -r -100..
~ does 5 decompresses. But that is certainly better than 100.
| - C decompressor (we do string splits on large texts that are not
| needed, and parsing is often slow in python)
As near as I can tell, you only split_lines(plain[start:end]) for the
delta chunk. I haven't specifically profiled that portion very much.
At least for log --short -r-100..-1, we spend 400/470ms in
zlib.decompress().
So doing the prefix thing might help, but not decompressing the same
hunks repeatedly helps more. (This is including my patch to cache the
last decompression.) We could play with it a bit more using an LRU cache
or something like that if we really wanted.
Of the rest of the time, 36ms is in get_raw_records, and 10ms is in
parse. 16ms being in the get_record_stream function itself.
| - 'strings'/'chunked' output mode to avoid string joining on
| decompression
You actually do that, up until the point that you join them.
I think it is specifically "chunks" though you call it 'lines'. So it
would be good to change that variable name.
| - if we get serial accesses deeper into a single group, read the whole
| group and cache. or something.
|
| -Rob
|
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkiKQu0ACgkQJdeBCYSNAAOiuwCaA5IxNTLs9p5QEfGcNfgcdeSP
AO8AoIBlEHlS05/t8W9/yTLVnZBEwqru
=K1c6
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list