groupcompress extraction 10x faster

Thu Feb 19 22:39:05 GMT 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> On Thu, 2009-02-19 at 16:04 -0600, John Arbash Meinel wrote:
>> Looking at the raw bits I'm seeing. I can often see a lot of "copy 1
>> line, copy 2 lines, copy 1 line, copy 1 line, copy 10 lines, etc". It
>> might be reasonable to only collapse the "copy <2 lines" together into
>> an "insert".
> 
> If you think about this like knit compression - inserting full texts to
> stop really deep chains - what you're proposing is inserting full texts
> here and there. Another way is just to start a new group when the weight
> of these things starts to be too much. But seeing these runs of copies
> is precisely the goal of groupcompress, so its not actually concerning
> me *that its like that*. What concerns me is more what actual
> overhead/benefit we're paying.
> 
> Consider a file that goes
> ACEG
> ABCEG
> ABCDEG
> ABCDEFG
> 
> we'll output
> ABCDEFG
> then
> c1,5 c7,1
> then
> c1,3 c5,1 c7,1
> then
> c1,1 c3,1 c5,1 c7,1
> 
> So Block copying a reduced region to allow later texts to reference it
> more pithily is a decent idea; its hard though to know where the
> tradeoff will sit - I don't have any quick suggestions :(.
> 
> -Rob
> 

Right, and if you compare that with what knits or xdelta would output,
which would be something like:

d6
d4
d2

My first pass at implementing this did not actually do anything
beneficial. The final result was ~ the same size as the original, which
is about 50% larger that inserting a fulltext when we get a new file_id.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkmd34kACgkQJdeBCYSNAAOevwCgjPHlhp/cEExCABU6Czawy1DW
j0EAn3+mCsOlOrcW0s9auKt5QnrEDZxg
=v8cN
-----END PGP SIGNATURE-----