Better compression

John Arbash Meinel john at arbash-meinel.com
Wed Jul 30 16:38:12 BST 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
| On Mon, 2008-07-28 at 22:31 -0500, John Arbash Meinel wrote:
|> -----BEGIN PGP SIGNED MESSAGE-----
|> Hash: SHA1
|
|
|> So I was talking to Martin about this, and I realize that we can
|> actually transform the delta without having to extract all of the full
|> texts. I'll outline it here. In the short term, we should know what the
|> CPU overhead is, before we decide to spend the full work to implement
|> this. But here goes:
|
| Paraphrasing: reuse the matching line data by translating the byte-reuse
| information into lines-in-texts information, and then applying that to
| the output stream for the texts that this references, followed by
| insertion statements for texts that are being omitted.
|
| I actually considered some variations of this when Aaron raised the
| issue of getting efficient 'matching_blocks' information about texts
| stored in this format.

Yeah, except 'matching-blocks' probably wants linearly increasing
values, which group-compress doesn't give us. (On the plus side, it
handle moving code around quite graciously.)

|
| My feeling about this was that its probably cheaper just to diff on the
| fly. I'd certainly want to hold off on this until we've done other less
| complex things first. I could be wrong - the complexity of this
| transform is O(text-length * texts in stream before this text), wereas
| re-diffing can be O(text-length ^ 2) [though I haven't rigorously
| analysed this, possibly its lower)
|
| -Rob

I think for translating 1 text into delta, maybe it is cheaper. However,
you could translate all of the deltas in the GC object in 1 pass. And
save the known information as you go.

And don't forget, to do an O(N^2) diff algorithm, you first have to
extract the full text, which in itself has some CPU overhead. The nice
thing about translating the deltas is that you don't have to deal in
full-texts along the way.

I wasn't planning on writing it yet. We need to understand what we are
really running into first. But I do think it is an interesting way to go.

John
=:->


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkiQiuQACgkQJdeBCYSNAAPiGgCeM61b2iuLshwtZblKOAT0GYzT
g4kAn0bkc0Sb4GxUj4ZoX/OCyJDZGP6l
=ld5i
-----END PGP SIGNATURE-----



More information about the bazaar mailing list