groupcompress extraction 10x faster
John Arbash Meinel
john at arbash-meinel.com
Thu Feb 19 22:04:16 GMT 2009
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Robert Collins wrote:
> On Thu, 2009-02-19 at 14:52 -0600, John Arbash Meinel wrote:
>> I don't have a great answer here, but I have some thoughts:
>>
>> 1) The first time a given key prefix (file_id) is seen in a group
>> compressor, insert all lines.
>
> I don't think this will help that much, and in fact I think it would be
> useful to be able to compress things like Makefile across these
> boundaries. (because they may well be very similar).
>
Well, it turned out to go from 16MB => 11MB, so it indeed had a very
positive effect.
>> 2) Insert all lines whenever the 'compressed' form exceeds a certain
>> number of 'operations' or 'total length'. (certainly if
>> len(comp_bytes)
>>> len(raw_lines), but perhaps even if it is > 0.5 the raw lines.)
>
> This is a more interesting metric - its one I've been considering using
> using for triggering 'time for a new group'. (perhaps with a counter
> e.g. when the last 5 texts have been more complex). And perhaps even
> remove them from the old group and put into a fresh one.
>
>> 3) My 'optimal' ordering is sorting by file-id, and then doing
>> reverse-topological sorting from there. When I look closely at it, it
>> seems that this ends up inserting at least 20 completely empty texts
>> as
>> the first few texts. I don't quite understand that, but it is
>> certainly
>> odd, and would certainly cause bad matches.
>
> so thats 'newest first'?
Essentially.
>
> The ordering we should aim for (because we know it performs well and is
> a good starting point) is something like basename, size; which fileid,
> newest-first is a decent proxy for. So that sort should behave ok.
I'm not convinced that "basename, size" is optimal. I know it is ~ what
git uses. In my past testing using xdelta with multi-parent was one of
the best compressions for builtins.py. (Where the "best" forms were
about 1/2 the size of the "ok" ones.)
>
>> 4) The "insert if copy is longer" code could probably be improved.
>> I've
>> seen cases where we have a regular insert, followed by one of these
>> psuedo-inserts. The problem with a psuedo-insert is that it is not
>> added
>> to the index of lines to match against.
>
> Yes, we probably need to do that.
>
>> 5) Maybe the insertion of lines could be triggered if there is a
>> series
>> of copy instructions with no insertions, as that could be indicative
>> of
>> a code block that is going to be repeated often in the future.
>
> Possibly. Alternatively we can look at starting new groups.
>
>> I went ahead and tried a quick implementation of (1). And for the most
>> part, it worked. The compression time was quite a bit slower (10m =>
>> 24m). Probably more regions that it had to check for the longest
>> match,
>
> Likely. Note that while we don't aim for unique lines, by being fairly
> sensitive we shouldn't end up with many collisions in the current
> algorithm (and its collisions that will drive up the inner loop time
> sorting out which of the multiple possible matches is really valid).
>
Sure. I do think we still have some issues with very common lines like
"\n". (I think we may have avoided '\n' specifically, but now we run
into things like '\n\n\n'.)
>> I would assume exacerbated by something like a common copyright
>> header.
>> However, the total compression was better:
>>
>> Commits: 1043
>> Raw % Compressed % Objects
>> Revisions: 3990 KiB 0% 826 KiB 3% 1043
>> Inventories: 31012 KiB 3% 12617 KiB 50% 12090
>> Texts: 882272 KiB 96% 11659 KiB 46% 7226
>> Signatures: 0 KiB 0% 0 KiB 0% 0
>> Total: 917275 KiB 100% 25102 KiB 100% 20359
>>
>> Though it is still overall *worse* than the knit compression :(.
>>
>> I'm still playing with it, tweaking some of the parameters (like
>> avoiding all matches against empty texts, and starting new compression
>> groups earlier when we run into a new file_id.)
>
> One easy knob is to lower the max raw group size. I think we should be
> wary of tying compression choices to fileids or other higher layer data
> - the compression results should dictate the low level behaviour, while
> higher layers (the GCVF) should structure the input to make good use of
> the lower level.
I sort of agree, the problem is that the compressor knows when things
will fit, while the GCVF knows when there is a file boundary.
>
> Another separate but useful tool would be to get some stats - a basic
> histogram or scatter plot - of the encoded size, position in file,
> #copies #inserts bytes-in-output-from-copies,
> bytes-in-output-from-inserts. We could use that to spot common
> pathologies,get ratios etc.
>
> What we should be expecting to see is that long files that are older
> than tip are encoded in many copy instructions, with a few insertions
> (to add deleted content). So 711 copy instructions isn't actually bad,
> unless the file is only 711 lines long :).
>
> -Rob
Looking at the raw bits I'm seeing. I can often see a lot of "copy 1
line, copy 2 lines, copy 1 line, copy 1 line, copy 10 lines, etc". It
might be reasonable to only collapse the "copy <2 lines" together into
an "insert".
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkmd12AACgkQJdeBCYSNAAO8ggCfcfcJlxZ8g5EkahFv60+bFOe+
jukAoKoEuCHrI1JaSK26AcWDcuvlPQMQ
=YnlM
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list