groupcompress+chk some more permutations

Thu Mar 5 20:03:13 GMT 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

So I've been doing a few more permutations with groupcompress and chk
inventories, and I figured I'd try to some up some of what I've found. I
have numbers to back up pretty much all of this, but the email is
already really long...

Summary

I think it is worth exploring lzma compression more. I think bigger leaf
pages are a net win, though it needs more benchmarking. I think getting
rid of the labels for CHK pages makes sense, but we can leave them in
the text pages, etc. With all of these updates (lzma, bigpage, nolabels)
I finally have a bzr repository that is smaller than the --window=200
git version (7.6MB versus 8.1MB). So at this point, there is still a bit
of exploration to do, but it mostly comes down to us deciding where we
want the size/speed tradeoffs to lay, as well as the cost of an external
dependency, etc.

Labels/No Labels

1) I've implemented the code to move the label/sha1/etc block to the
start of the group, and run a separate compression over that text.

2) The really nice part of this is that it makes the data fully
self-describing, you can get a compressed groupcompress blob, and be
able to read everything out of it. (The index is fully redundant.)

3) The downside is that it adds a fairly large overhead to the stored
data. Somewhere between 30-50 bytes per record. This isn't a big deal
for Texts, where the average record size is approx 1kB. However for
Inventory pages, the average compressed size is about 100 bytes. So
adding 30 bytes overhead is 30% bloat. Even more significant, the chk
pages are address by a sha1, which means that they are already
self-describing.

Alternate index

There is also a potential tradeoff between what we need to put in the
index, versus what we have in the group itself. CHK indexes don't
compress particularly well. If we assume optimal compression, a sha1 is
20 bytes, then another 4+bytes for the offset, 4 bytes for length, and
then 4*2 for group offset and node len. If we store more of that data
inside the group header, we can get rid of the last two (group offset
and node len). So at a minimum, an index needs to be something like 32
bytes per record. Which is, interestingly, approx the same overhead we
get from putting that data into the group header.

I've also talked with Martin about the possibility of an index that
writes less than the full 20-bytes of a sha1 sum, and we just handle
collisions by resolving them in the target data. If you figure that sha1
sums are well distributed, then even just 4 bytes gives you 2^32 bit
keyspace. The equation I found for collision chance was:
  1 - e^(-(N^2)/2^(bits+1))

So for 32-bits and 1 million keys, the chance is very high (high enough
that it gets rounded to 1). Lets say, though, that we are okay with a 1%
chance of collision, since we have a fallback anyway:

1 -XXX = 0.01 => XXX = 0.99

       (10^x)^2 == 10^2x      10^12                10^12
0.99 = ------------------ =>  ----- = 0.99 => log2 ----- - 1 = b
       2^(bits+1)             2^(b+1)              0.99

b ~= 39 bits, or with 5 bytes the chance of collision with 1M keys is <1%.

If we want to support 1 billion keys, we could make it 59 bits, or 8
bytes, which is still a lot shorter than the 20 bytes we would be using.

To use a partial index like this, you would just allow a given prefix to
point to multiple locations in the file (be repeated), and then you
verify the sha1 matches when you extract (or try both and find the one
that matches).

If you assume fixed-width fields, you could do 8-bytes for the sha1
prefix, 8-bytes for an offset, 4-bytes for group length, and 4-bytes for
the start in the group. Or 24 bytes per entry. Which is about 2/3rds our
current 35-bits per entry. Which means a 24MB index becomes 16.4MB,
still pretty big in the end. If we made some of these fields
variable-length, we could get back a lot of that space for small
indexes, and we could certainly cap the max size of a group to 16MB, so
we could use a 3-byte group length and 3-byte offset in group. (10%
savings).

If we want to go as small as is 'reasonable', 5-bytes for the sha1
prefix, 5-bytes for the offset (allows 1TB pack files), 3-byte group,
3-byte offset is 16 bytes. So the (as converted) mysql .cix index would
be 10.9MB.

Which is ok. In my head, ideally an index would only be 10% of the
actual data. And at 100bytes per chk entry, that is 10 bytes for an
index entry.

Big Page

I also played around with changing the size of a LeafNode from a 4k page
to a 64k page. The idea is that in a 4k page we fit about 14 inventory
records, so at 64k we should be close to 255 records (similar size to
the internal nodes), which should help to decrease the 'height' of the tree.

It did so beautifully for my 1k rev mysql conversion (which has ~3k
files at the end).

This brought the total number of inventory records from 12k down to 8k
(for 1k revisions, and 7.7k files). And saved approximately 300kB from
the inventory records, and another 100kB from the .cix. (Note that
removing the labels also removes approx 300kB from the inventory
records, and these savings 'stack'.)

I think the primary space saving is because of removing the extra
internal node, which costs us a minimum 20 sha bytes per record, plus
some other overhead. So changing a leaf node is a file_id, sha1sum,
size, and revision_id, and then a 20byte sha sum for every internal page
back to the root.

I'm seriously considering if we want to have a more fixed structure.
With a single root node, and then leaf-pages to fit. I think it could
simplify a lot of the internal code, give us good compression. It also
makes it easier to come up with a canonical representation. It still
gives us 'log' scaling of amount of effort to change the tree. I guess
the biggest problem is the 'rebalancing' effect. Where we don't want to
change from 10 internal references to 11 to have to rebalance all 11
leaf nodes.

We could do some of that by having fixed thresholds, with say a 16-way
fan out, a 256-way fan out, and then a 4096 fan out.

An internal page is potentially a 20-byte sha, plus 1-2 bytes for the
'prefix'. And any decent byte-based delta can take that down to the
20-byte sha 99% of the time. (Our current deltas should be cutting it
down to the 40-byte hex sha without real difficulty.)

The nice thing is that in a big tree, the *compressed* size for changing
a single file would still be the 20-byte sha in the internal node, and
then however many bytes in that one leaf node. Though the uncompressed
content change is certainly significantly different.

We could try to balance the leaves versus root based on uncompressed
size. If a reference is 25 bytes, and a leaf entry is 200 bytes, then
you would want ~1/10th the number of entries in a leaf that you have in
the internal node. We are actually fairly close to that with 4k pages,
but only if the leaves were closer to full. Because we allow extra
splitting, we are often closer to 1 record per leaf. Which is nice for
uncompressed size of updates, but is actually worse for compressed size.

Ultimately, we'd need some benchmarks to see what it would do to
'commit' time, 'update', etc. But having smaller structures on disk
leads to better network performance...

LZMA compression

I tried just replacing 'zlib.compress()' with 'pylzma.compress()', with
some mixed results.

1) The final packed size is very good, and potentially we could do even
better if we start tweaking the group size based on using lzma. My
best-case no-labels, big-page zlib compression was:

$ gc-no-labels big-page
9.7M    nolabel_gc255_big_slow/
                      Raw    %    Compressed    %  Objects
Revisions:       3990 KiB   0%       636 KiB   6%     1043
Inventories:    34762 KiB   3%       947 KiB  10%     8578
Texts:         882272 KiB  95%      7727 KiB  82%     7226
Signatures:         0 KiB   0%         0 KiB   0%        0
Total:         921025 KiB 100%      9310 KiB 100%    16847

Compared with the same no-label big-page lzma compression:

$ nolabel_mini_len_lzma_gc255_big
7.6M    nolabel_mini_len_lzma_gc255_big/
                      Raw    %    Compressed    %  Objects
Revisions:       3990 KiB   0%       481 KiB   6%     1043
Inventories:    34762 KiB   3%       748 KiB  10%     8578
Texts:         882272 KiB  95%      5885 KiB  82%     7226
Signatures:         0 KiB   0%         0 KiB   0%        0
Total:         921025 KiB 100%      7115 KiB 100%    16847

2) Extraction time is also quite decent. 8s versus 6.5s. So we pay ~25%
slower extraction for 30% better compression.

3) Adding labels allows us to avoid some SHA1 computations (extract in
5.2s), but costs us 8.5% compressed size:
$ time wbzr repository-details label_mini_len_lzma_gc255_big
Commits: 1043
                      Raw    %    Compressed    %  Objects
Revisions:       3990 KiB   0%       518 KiB   6%     1043
Inventories:    34762 KiB   3%       998 KiB  12%     8578
Texts:         882272 KiB  95%      6202 KiB  80%     7226
Signatures:         0 KiB   0%         0 KiB   0%        0
Total:         921025 KiB 100%      7718 KiB 100%    16847

(on my machine, the time to compute the sha1 of 921MB is close to 3s).

4) The time to 'bzr pack' was reasonably close, I think it was 32s
versus 45s. So here we pay about 50% more during pack.

5) The biggest downside was the 'non-optimal' times when converting.
Approx 14min to convert versus 3min with zlib.. I'm guessing the main
issue is that conversion copies things in 'topological' and/or
'unordered' order, and AFAICT doesn't even try to group file copies by
file-id. Which means that our group-compress delta code is doing a poor
job and lzma is trying to pick up the slack spending a lot of time doing so.

We may be able to tweak something like this by detecting what sort of
operation is going on, and decreasing the lzma effort during non
'gc-optimal' fetches.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkmwMAEACgkQJdeBCYSNAAP6UwCfcGH++cuN4wivvKX7EdUWvQ1Z
9x4AoIpiEwb4+3ndZhFIzQdPp5nWzU/T
=6Yo/
-----END PGP SIGNATURE-----