Recompressing knits

John Arbash Meinel john at arbash-meinel.com
Sat Aug 26 19:33:50 BST 2006


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aaron Bentley wrote:
> Matthieu Moy wrote:
>>> Hi,
>>>

...

> 
> We don't usually ungzip the entire file.  We open the knit index, seek
> to the piece we want, and ungzip just that.  This should be faster than
> ungzipping the whole file.  gzipping the whole file as one piece would
> likely slow things down, and would break existing clients.
> 
> Aaron

Yeah, it would break things as they stand. But I wonder if there
wouldn't be potential in being able to "archive" old revisions. Then
they could be combined into maybe even a bzip2 hunk. I did a check once,
of how much space we could save if we collapsed everything together.

Here are the numbers for my bzr repository:

Just cat the current contents into one big hunk
find . -type f -name \*.knit -print0 | xargs -0 cat >\
	conglomerate.knit

# Uncompress it into one big text file
zcat conglomerate.knit > conglomerate

# Compress this with different methods
bzip2 -k conglomerate
7za a conglomerate.7za conglomerate
cat conglomerate | gzip > conglomerate.gz

Now how big are all of these:

183774438 conglomerate
 14105406 conglomerate.7za
 18492542 conglomerate.bz2
 31286005 conglomerate.gz
 43807111 conglomerate.knit

Or in human terms:
176M conglomerate
 14M conglomerate.7za
 18M conglomerate.bz2
 30M conglomerate.gz
 42M conglomerate.knit

Now, this is sort of a 'best case' scenario, considering the algorithms
can exploit redundancies between files (and since each line is annotated
with a revision id, there is lots of redundancy between files, as well
as within files)

But what does it tell us:

1) Our default format, of using compressed chunks is actually quite
good. We are getting 176/42 ~= 4.2:1 compression with plain knit files.
If we switched to collapsing that, the ration only goes up to 176/30 =
5.8:1.
So we can get another small bump in compression, but not 2x or anything
like that.

2) Using a different algorithm, we can get as good as 176/14 = 12.5:1
compression. I *really* wish that there was a nice LZMA package, because
I think it embodies some really good tradeoffs for bzr. (It takes longer
to compress than bzip2, but it can decompress almost as fast as gzip).

I think it would be really interesting for us to explore some sort of
packed-delta (possibly skip delta, possibly???) format for archiving
really old data. 7za is more like '.zip' than '.gz', in that it is a
file container. It has some interesting properties, like 'solid', which
is packing more than one file into the compression hunk, but still
having an index so that you don't have to extract *all* hunks, like you
would with a tar.bz2 file.

I don't know the physical properties of a 7za file all that well. Like
are the indexes something that would let you download just part of the
file, etc. I have a feeling it is something like .zip (where you have an
index at the end of the file, which points to earlier indexes, etc,
etc). And we would want to optimize for a format that doesn't require
round trips. So ultimately we probably want to keep a separate index file.

But I really think we could come up with a 'packed' format, that bundles
up a bunch of revisions, and all of the associated texts, etc, that
could just be used as a big hunk. So we keep something nice and clean
for individual knits, like we have now. But we have a way to archive old
history into a pack (to go along with knit and weave: crochet?, afghan?)

Anyway, it is definitely something that I think we've thought about, and
is worth exploring. (The possibility of another 3:1 compression, and
reduction of round trips by having everything in one big file. Since we
already have nice readv support ...)

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE8JPnJdeBCYSNAAMRAtReAJ98xDMyqdVwUqhpyBK5gQOOzffV2gCfeKqY
DpVgxribhdtfK7hxQsXviQ8=
=GEla
-----END PGP SIGNATURE-----




More information about the bazaar mailing list