groupcompress extraction 10x faster

Thu Feb 19 22:04:16 GMT 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> On Thu, 2009-02-19 at 14:52 -0600, John Arbash Meinel wrote:
>> I don't have a great answer here, but I have some thoughts:
>>
>> 1) The first time a given key prefix (file_id) is seen in a group
>> compressor, insert all lines.
> 
> I don't think this will help that much, and in fact I think it would be
> useful to be able to compress things like Makefile across these
> boundaries. (because they may well be very similar).
> 

Well, it turned out to go from 16MB => 11MB, so it indeed had a very
positive effect.

>> 2) Insert all lines whenever the 'compressed' form exceeds a certain
>> number of 'operations' or 'total length'. (certainly if
>> len(comp_bytes)
>>> len(raw_lines), but perhaps even if it is > 0.5 the raw lines.)
> 
> This is a more interesting metric - its one I've been considering using
> using for triggering 'time for a new group'. (perhaps with a counter
> e.g. when the last 5 texts have been more complex). And perhaps even
> remove them from the old group and put into a fresh one.
> 
>> 3) My 'optimal' ordering is sorting by file-id, and then doing
>> reverse-topological sorting from there. When I look closely at it, it
>> seems that this ends up inserting at least 20 completely empty texts
>> as
>> the first few texts. I don't quite understand that, but it is
>> certainly
>> odd, and would certainly cause bad matches.
> 
> so thats 'newest first'? 

Essentially.

> 
> The ordering we should aim for (because we know it performs well and is
> a good starting point) is something like basename, size; which fileid,
> newest-first is a decent proxy for. So that sort should behave ok.

I'm not convinced that "basename, size" is optimal. I know it is ~ what
git uses. In my past testing using xdelta with multi-parent was one of
the best compressions for builtins.py. (Where the "best" forms were
about 1/2 the size of the "ok" ones.)

> 
>> 4) The "insert if copy is longer" code could probably be improved.
>> I've
>> seen cases where we have a regular insert, followed by one of these
>> psuedo-inserts. The problem with a psuedo-insert is that it is not
>> added
>> to the index of lines to match against.
> 
> Yes, we probably need to do that.
> 
>> 5) Maybe the insertion of lines could be triggered if there is a
>> series
>> of copy instructions with no insertions, as that could be indicative
>> of
>> a code block that is going to be repeated often in the future.
> 
> Possibly. Alternatively we can look at starting new groups.
> 
>> I went ahead and tried a quick implementation of (1). And for the most
>> part, it worked. The compression time was quite a bit slower (10m =>
>> 24m). Probably more regions that it had to check for the longest
>> match,
> 
> Likely. Note that while we don't aim for unique lines, by being fairly
> sensitive we shouldn't end up with many collisions in the current
> algorithm (and its collisions that will drive up the inner loop time
> sorting out which of the multiple possible matches is really valid).
> 

Sure. I do think we still have some issues with very common lines like
"\n". (I think we may have avoided '\n' specifically, but now we run
into things like '\n\n\n'.)

>> I would assume exacerbated by something like a common copyright
>> header.
>> However, the total compression was better:
>>
>> Commits: 1043
>>                       Raw    %    Compressed    %  Objects
>> Revisions:       3990 KiB   0%       826 KiB   3%     1043
>> Inventories:    31012 KiB   3%     12617 KiB  50%    12090
>> Texts:         882272 KiB  96%     11659 KiB  46%     7226
>> Signatures:         0 KiB   0%         0 KiB   0%        0
>> Total:         917275 KiB 100%     25102 KiB 100%    20359
>>
>> Though it is still overall *worse* than the knit compression :(.
>>
>> I'm still playing with it, tweaking some of the parameters (like
>> avoiding all matches against empty texts, and starting new compression
>> groups earlier when we run into a new file_id.)
> 
> One easy knob is to lower the max raw group size. I think we should be
> wary of tying compression choices to fileids or other higher layer data
> - the compression results should dictate the low level behaviour, while
> higher layers (the GCVF) should structure the input to make good use of
> the lower level. 

I sort of agree, the problem is that the compressor knows when things
will fit, while the GCVF knows when there is a file boundary.

> 
> Another separate but useful tool would be to get some stats - a basic
> histogram or scatter plot - of the encoded size, position in file,
> #copies #inserts bytes-in-output-from-copies,
> bytes-in-output-from-inserts. We could use that to spot common
> pathologies,get ratios etc.
> 
> What we should be expecting to see is that long files that are older
> than tip are encoded in many copy instructions, with a few insertions
> (to add deleted content).  So 711 copy instructions isn't actually bad,
> unless the file is only 711 lines long :).
> 
> -Rob

Looking at the raw bits I'm seeing. I can often see a lot of "copy 1
line, copy 2 lines, copy 1 line, copy 1 line, copy 10 lines, etc". It
might be reasonable to only collapse the "copy <2 lines" together into
an "insert".

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkmd12AACgkQJdeBCYSNAAO8ggCfcfcJlxZ8g5EkahFv60+bFOe+
jukAoKoEuCHrI1JaSK26AcWDcuvlPQMQ
=YnlM
-----END PGP SIGNATURE-----