groupcompress+chk some more permutations

John Arbash Meinel john at
Thu Mar 5 20:17:35 GMT 2009

Hash: SHA1

John Arbash Meinel wrote:
> So I've been doing a few more permutations with groupcompress and chk
> inventories, and I figured I'd try to some up some of what I've found. I
> have numbers to back up pretty much all of this, but the email is
> already really long...
> Summary
> I think it is worth exploring lzma compression more. I think bigger leaf
> pages are a net win, though it needs more benchmarking. I think getting
> rid of the labels for CHK pages makes sense, but we can leave them in
> the text pages, etc. With all of these updates (lzma, bigpage, nolabels)
> I finally have a bzr repository that is smaller than the --window=200
> git version (7.6MB versus 8.1MB). So at this point, there is still a bit
> of exploration to do, but it mostly comes down to us deciding where we
> want the size/speed tradeoffs to lay, as well as the cost of an external
> dependency, etc.

I also think I've found one of the reasons why I've had as much
difficulty with this as I have. Of the 7,226 (file_id,revision_id) texts
present in the repository, there are only 6,586 unique sha1s. Which is a
10% difference. Now the delta code would do a pretty good job of
minimizing this, but I certainly don't think it would eliminate it. (If
you got lucky, and one of the original texts was the first entry, you
would have some very efficient deltas, but if you were unlucky and the
dupe was further back...)

We might want to consider changing our text storage so that the actual
bytes of text are referenced by sha1, but the meta-info like per-file
graph, etc is still file_id,revision_id.

Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla -


More information about the bazaar mailing list