Better compression

Fri Jul 25 22:01:50 BST 2008

On Fri, 2008-07-25 at 15:01 -0500, John Arbash Meinel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Ian Clatworthy wrote:
> | Robert Collins wrote:
> |> On Thu, 2008-07-17 at 16:00 +1000, Rob Weir wrote:
> |>> On 17 Jul 2008, raindog at macrohmasheen.com wrote:
> |>>> What level perf increase is seen with this? Do you have any
> |>>> comparissons?  Sent from my Verizon Wireless BlackBerry
> |>> With regards to space, I converted bzr.dev from pack-0.92 to
> |>> btree+groupcompress:
> |>>
> |>> 85664   backup.bzr
> |>> 40232   .bzr/
> |>>
> |>> a saving of over 50%.
> |
> | I saw a similar gain on Python last night: 346.5 MB -> 181.4 MB.
> |
> |> The compressor would have been getting texts in topological order there.
> |> It does much better with reverse-topological order, which is why I need
> |> a new InterRepository object or something... I tested a bzr.dev
> |> repository and saw more like 30 MB with better ordering in use, but it
> |> was a custom-hack rather than a reusable thing.
> |
> | So I'm wondering what that implies w.r.t. maximising how fast-import
> | works. If we know that compressing in a forward direction will generally
> | take more space than compressing going in the other direction, we'll
> | probably always want to consume the stream *quickly*, then compress as
> | much as we can during the final pack that fastimport does, yes?
> |
> | So I guess I'm requesting enough flexibility in our API to say
> | "load quickly" so that fastimport doesn't take forever overly
> | compressing stuff on a first pass which will only be thrown away.
> |
> | Ian C.
> |
> |
> 
> Just to give the current state of it. Doing a plain "bzr branch" into a
> gc-plain repository dropped the disk-size from 85MB => 41MB. This was
> with forward topological sorting (instead of reverse).
> 
> So I went ahead and hacked up some code to force reverse-order
> insertion. There were a couple things I discovered:

I have generic stuff for this partly done in a bzr branch; I'll publish
that shortly.

> 1) Changing the GenericRepoFetcher to use full-texts rather than using
> an adapter was quite an improvement for small projects (like bzrtools).
> 
> For specific results, doing a normal fetch using an adapter took 40s for
> bzrtools. Using full-texts takes 6.8s. (That includes all the time to
> compress the texts in the target.) Of course, branching into a pack-0.92
> is only 1.8s.

This is because knits are doing less string manipulation to get the
fulltext out than extracting the fulltext from the in-progress
compression group.

> Trying to do the same with bzr.dev was *bad*. The specific issue is that
> we request all the versions of all texts in a single request. And if you
> set include_delta_closure=True, then it unpacks all the requested texts
> into full-lines in memory, and then ''.join()s them one by one.  (Of
> course, on the other side I was caching the byte form, so it was
> requiring 2xTotal Bytes of all versions of all files.) As an example,
> NEWS by itself is 400MB, and then you double it, and then you do that
> for all files....
> 
> now. KnitVersionedFile._get_content_maps *should* be smart enough to
> re-use strings where possible, so this shouldn't be a disastrous memory
> bloat. But converting it to raw bytes and then caching that is.
> 
> It seems that just doing content_maps causes memory consumption to go up
> to 446MB. Which is a bit, but not horrible.
> 
> So I changed the get_bytes_as('fulltext') path to just return the lines,
> rather than joining them only to split them on the other side.
> 
> Oddly enough, at this point the memory went from 400 => 600 and would
> increase to almost 600, then drop back to 450 or so, then increase, then
> drop back. My best guess is just that we were creating a lot of temp
> variables, and the memory drop was when the garbage collector kicked in.
> 
> Either that or it was the GroupCompressor building up the next set of
> lines before it flushed it out to disk.
> 
> As for speed... last night when I did "bzr branch" it took 90+ minutes.
> Now, that was starting with an unpacked repo (with lots of extra junk in
> the repo.)
> 
> When I did it today using a repo with only bzr.dev in it, it took 7m16s.
> Or about 12x faster.

We need to cap the memory usage of get_record_stream on knits. This
addresses all the issues of bloat (but not those of the lack of
'chunked' or 'strings' or whatever we call it in the record stream
apis). That would be a good thing to move forward.

> 2) I didn't realize, but GroupCompress *does* do cross-text compression.
> Specifically it is built on the VersionedFiles object, and we are
> requesting various file ids at the same time. IMO, this is good and bad.
> Good, in that when you have lots of small texts, you can pack them all
> together. Bad in that incremental pulls, etc, will be packing your 170K
> NEWS file in with your tiny foo.py file. We *do* at least topologically
> sort, which means that by default things will be grouped around their
> file_id (at least, that is how I understand the tsort.topo_sorted()
> algorithm, I'd have to poke harder to know for sure.)

:) This is one of the primary goals.

> The other "bad" of this approach, is that it doesn't really have any
> idea *which* files to mix together. And I'm pretty sure it does it
> randomly based on dict.popitem(). So that would be a good place that we
> could tweak for "similarity" between texts.

It would be nice to have the basename used, so that COPYING in two
different projects could be group compressed together.

> Oh, another "bad" is that if you have 2 large texts, in a row, it may
> start putting your "best" tip revision for the second text into the
> compressor for the previous text. Wasting a bit of space there. But more
> importantly, if it starts a new compressor soon after, it might have
> saved more space by flushing the first and starting over with the new
> text. Again, more places for tweaking the algorithm.

Actually reverse topological is most likely to keep many texts of the
same fileid together (because its 50% likely to land in the 50% most
recent versions on the first probe into the dict). I intend to have knit
versioned files do a group(fileid); sort(per-fileid), flatten operation
though, when reverse topo is asked for, to avoid all these potential
issues.

> 3) Reverse topological is (again) a big win for space. If forward
> topological goes from 85MB (pack) => 41M (gc forward) => 31MB (gc reverse).

Thats close to the size I was predicting in my early tests - thanks for
hacking this in.

> 4) Something is causing a plain "bzr log --short -r -10..-1" to be
> slower with the new repository. Specifically for bzr.dev on a fully
> packed repository on both sides:
> 
> $ time bzr log dev-gc/bzr.dev
> real 0m1.372s
> 
> $ time bzr log dev-packs/bzr.dev
> real 0m0.499s
> 
> So we should probably investigate what is happening here.

As you note, the current decompressor is very crude.

Things to improve there that I know of:
 - only read the prefix of a group needed to access a given text (better
performance on most-recent texts) (needs output to record a prefix
length per text rather than just the zlib size)
 - calculate the largest prefix needed when given a multiple-text
request and satisfy the stream from the single group object
 - C decompressor (we do string splits on large texts that are not
needed, and parsing is often slow in python)
 - 'strings'/'chunked' output mode to avoid string joining on
decompression
 - if we get serial accesses deeper into a single group, read the whole
group and cache. or something.

-Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080726/c22e0601/attachment.pgp