groupcompress extraction 10x faster

Thu Feb 19 20:52:37 GMT 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:
...

> As it is now, because of the semi-random ordering, gc actually ends up a
> net loss for my test of "mysql-5.1 -r525".
> 
> gc+chk255:
> Commits: 1043
>                       Raw    %    Compressed    %  Objects
> Revisions:       3990 KiB   0%       826 KiB   2%     1043
> Inventories:    31012 KiB   3%     15328 KiB  38%    12090
> Texts:         882272 KiB  96%     23565 KiB  59%     7226
> Signatures:         0 KiB   0%         0 KiB   0%        0
> Total:         917275 KiB 100%     39720 KiB 100%    20359
> 
> 

...

> and versus the original knit repo:
> Commits: 1043
>                       Raw    %    Compressed    %  Objects
> Revisions:       3949 KiB   0%      1201 KiB   8%     1043
> Inventories:   842115 KiB  48%      1840 KiB  12%     1043
> Texts:         882272 KiB  51%     11174 KiB  78%     7226
> Signatures:         0 KiB   0%         0 KiB   0%        0
> Total:        1728337 KiB 100%     14216 KiB 100%     9312
> 
> 
> Something still isn't right with the gc+chk255 repository, as we
> certainly should be getting *some* compression for inventories, better
> than the chk255 repository. I mostly just wanted to point out that
> without proper ordering the gc compressed texts actually double in size.
> 

So I tried to change the code to do a "gc-optimal" ordering. However, it
turns out not to work as well as we would like:
                      Raw    %    Compressed    %  Objects
Revisions:       3990 KiB   0%       828 KiB   2%     1043
Inventories:    31012 KiB   3%     12747 KiB  42%    12090
Texts:         882272 KiB  96%     16465 KiB  54%     7226
Signatures:         0 KiB   0%         0 KiB   0%        0
Total:         917275 KiB 100%     30040 KiB 100%    20359

When I investigate the raw compression, I see stuff that looks like this:

['label:
sp1f-table.cc-19700101030959-nsxtem2adyqzwe6nz4cgrpcmts3o54v7\x00sp1r-monty at donna.mysql.com-20010131024725-37689\n',
 'sha1: 160494ab404377b2a15be1e91c0cad43f45f9b91\n',
 'c,172,71\n',
 'i,1\n',
 '   \n',
 'c,244,252\n',
 'i,1\n',
 '   \n',
 'c,497,246\n',
 'i,1\n',
 '   \n',
 'c,744,214\n',
 'c,2067874,36\n',
 'c,1078,25\n',
 'c,1652729,19\n',
 'c,1127,21\n',
 'i,2\n',
 '\n',
 '\n',
 'c,2067953,38\n',
 'i,1\n',
 '\n',
 'c,2068000,248\n',
 'i,2\n',
 '\n',
 '\n',
 'c,2068258,103\n',
 'i,1\n',
 '{\n',
 'c,2068371,79\n',
 'i,2\n',
 '}\n',
 '\n',
...

(In total, this record comprises 711 of these lines).

Basically, it is copying lines from all over the place, and 99% of the
insert statements are for the 'trivial' case of '/*\n' where the
instruction to insert is shorter than the copy instruction.  (Actually,
for this specific text, I don't see *any* genuine insertion. My guess is
that it is a merge revision, so the only thing it provides is a unique
combination of copies.)

My guess is that the problem is combining similar-but-not-exact texts in
the same file. With that structure, I'm guessing it is having a hard
time finding nice contiguous regions to combine. Meaning it is finding
lots of similar sub-chunks, but this is causing it *not* to find large
continuous regions.

I don't have a great answer here, but I have some thoughts:

1) The first time a given key prefix (file_id) is seen in a group
compressor, insert all lines.

2) Insert all lines whenever the 'compressed' form exceeds a certain
number of 'operations' or 'total length'. (certainly if len(comp_bytes)
> len(raw_lines), but perhaps even if it is > 0.5 the raw lines.)

3) My 'optimal' ordering is sorting by file-id, and then doing
reverse-topological sorting from there. When I look closely at it, it
seems that this ends up inserting at least 20 completely empty texts as
the first few texts. I don't quite understand that, but it is certainly
odd, and would certainly cause bad matches.

4) The "insert if copy is longer" code could probably be improved. I've
seen cases where we have a regular insert, followed by one of these
psuedo-inserts. The problem with a psuedo-insert is that it is not added
to the index of lines to match against.

5) Maybe the insertion of lines could be triggered if there is a series
of copy instructions with no insertions, as that could be indicative of
a code block that is going to be repeated often in the future.

I went ahead and tried a quick implementation of (1). And for the most
part, it worked. The compression time was quite a bit slower (10m =>
24m). Probably more regions that it had to check for the longest match,
I would assume exacerbated by something like a common copyright header.
However, the total compression was better:

Commits: 1043
                      Raw    %    Compressed    %  Objects
Revisions:       3990 KiB   0%       826 KiB   3%     1043
Inventories:    31012 KiB   3%     12617 KiB  50%    12090
Texts:         882272 KiB  96%     11659 KiB  46%     7226
Signatures:         0 KiB   0%         0 KiB   0%        0
Total:         917275 KiB 100%     25102 KiB 100%    20359

Though it is still overall *worse* than the knit compression :(.

I'm still playing with it, tweaking some of the parameters (like
avoiding all matches against empty texts, and starting new compression
groups earlier when we run into a new file_id.)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkmdxpUACgkQJdeBCYSNAAPDBwCdGeSsKP2axFgf1wr+ZiZIWCrn
MVoAnRNU923sGL6vqedpJwDdahBIfNJV
=NvAW
-----END PGP SIGNATURE-----