Better compression

Fri Jul 25 21:01:27 BST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Clatworthy wrote:
| Robert Collins wrote:
|> On Thu, 2008-07-17 at 16:00 +1000, Rob Weir wrote:
|>> On 17 Jul 2008, raindog at macrohmasheen.com wrote:
|>>> What level perf increase is seen with this? Do you have any
|>>> comparissons?  Sent from my Verizon Wireless BlackBerry
|>> With regards to space, I converted bzr.dev from pack-0.92 to
|>> btree+groupcompress:
|>>
|>> 85664   backup.bzr
|>> 40232   .bzr/
|>>
|>> a saving of over 50%.
|
| I saw a similar gain on Python last night: 346.5 MB -> 181.4 MB.
|
|> The compressor would have been getting texts in topological order there.
|> It does much better with reverse-topological order, which is why I need
|> a new InterRepository object or something... I tested a bzr.dev
|> repository and saw more like 30 MB with better ordering in use, but it
|> was a custom-hack rather than a reusable thing.
|
| So I'm wondering what that implies w.r.t. maximising how fast-import
| works. If we know that compressing in a forward direction will generally
| take more space than compressing going in the other direction, we'll
| probably always want to consume the stream *quickly*, then compress as
| much as we can during the final pack that fastimport does, yes?
|
| So I guess I'm requesting enough flexibility in our API to say
| "load quickly" so that fastimport doesn't take forever overly
| compressing stuff on a first pass which will only be thrown away.
|
| Ian C.
|
|

Just to give the current state of it. Doing a plain "bzr branch" into a
gc-plain repository dropped the disk-size from 85MB => 41MB. This was
with forward topological sorting (instead of reverse).

So I went ahead and hacked up some code to force reverse-order
insertion. There were a couple things I discovered:

1) Changing the GenericRepoFetcher to use full-texts rather than using
an adapter was quite an improvement for small projects (like bzrtools).

For specific results, doing a normal fetch using an adapter took 40s for
bzrtools. Using full-texts takes 6.8s. (That includes all the time to
compress the texts in the target.) Of course, branching into a pack-0.92
is only 1.8s.

Trying to do the same with bzr.dev was *bad*. The specific issue is that
we request all the versions of all texts in a single request. And if you
set include_delta_closure=True, then it unpacks all the requested texts
into full-lines in memory, and then ''.join()s them one by one.  (Of
course, on the other side I was caching the byte form, so it was
requiring 2xTotal Bytes of all versions of all files.) As an example,
NEWS by itself is 400MB, and then you double it, and then you do that
for all files....

now. KnitVersionedFile._get_content_maps *should* be smart enough to
re-use strings where possible, so this shouldn't be a disastrous memory
bloat. But converting it to raw bytes and then caching that is.

It seems that just doing content_maps causes memory consumption to go up
to 446MB. Which is a bit, but not horrible.

So I changed the get_bytes_as('fulltext') path to just return the lines,
rather than joining them only to split them on the other side.

Oddly enough, at this point the memory went from 400 => 600 and would
increase to almost 600, then drop back to 450 or so, then increase, then
drop back. My best guess is just that we were creating a lot of temp
variables, and the memory drop was when the garbage collector kicked in.

Either that or it was the GroupCompressor building up the next set of
lines before it flushed it out to disk.

As for speed... last night when I did "bzr branch" it took 90+ minutes.
Now, that was starting with an unpacked repo (with lots of extra junk in
the repo.)

When I did it today using a repo with only bzr.dev in it, it took 7m16s.
Or about 12x faster.

2) I didn't realize, but GroupCompress *does* do cross-text compression.
Specifically it is built on the VersionedFiles object, and we are
requesting various file ids at the same time. IMO, this is good and bad.
Good, in that when you have lots of small texts, you can pack them all
together. Bad in that incremental pulls, etc, will be packing your 170K
NEWS file in with your tiny foo.py file. We *do* at least topologically
sort, which means that by default things will be grouped around their
file_id (at least, that is how I understand the tsort.topo_sorted()
algorithm, I'd have to poke harder to know for sure.)

The other "bad" of this approach, is that it doesn't really have any
idea *which* files to mix together. And I'm pretty sure it does it
randomly based on dict.popitem(). So that would be a good place that we
could tweak for "similarity" between texts.

Oh, another "bad" is that if you have 2 large texts, in a row, it may
start putting your "best" tip revision for the second text into the
compressor for the previous text. Wasting a bit of space there. But more
importantly, if it starts a new compressor soon after, it might have
saved more space by flushing the first and starting over with the new
text. Again, more places for tweaking the algorithm.

3) Reverse topological is (again) a big win for space. If forward
topological goes from 85MB (pack) => 41M (gc forward) => 31MB (gc reverse).

4) Something is causing a plain "bzr log --short -r -10..-1" to be
slower with the new repository. Specifically for bzr.dev on a fully
packed repository on both sides:

$ time bzr log dev-gc/bzr.dev
real 0m1.372s

$ time bzr log dev-packs/bzr.dev
real 0m0.499s

So we should probably investigate what is happening here.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkiKMRcACgkQJdeBCYSNAAOHogCeNSckIWtOOx5zTqOvyNKJ6l7b
L2UAmQERRU7w29/JZ+Q6jrZuY/dFPtG+
=CMLF
-----END PGP SIGNATURE-----