[MERGE][RFC] Add simple revision serializer based on RIO.
John Arbash Meinel
john at arbash-meinel.com
Mon May 11 14:50:04 BST 2009
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Martin Pool wrote:
> 2009/5/11 Matt Nordhoff <mnordhoff at mattnordhoff.com>:
>> Martin Pool wrote:
>> [snip]
>>
>>> However, before moving to RIO for future formats (and I say this
>>> having added the code) I would think hard about whether it should use
>>> bencode instead, which has the advantage of being able to represent
>>> somewhat more complex nesting (like dicts inside dicts) without
>>> needing a separate layer of encoding on top. Revisions are pretty
>>> simple but even there it may be useful. I'm not sure about the
>>> relative performance.
>> If I understand correctly, RIO is line-based but bencode is not. Is the
>> delta format still line-based? If so, using bencode would be more difficult.
>
> We don't do line-by-line compression on revisions because generally
> speaking there's not much in common between them. zlib compression in
> groupcompress will pick out common strings like committer names. Good
> question though.
>
We do "line-by-line" delta compression in --dev6 because I found that
this assumption was incorrect. We get 2:1 compression improvements by
doing delta compression. Since the only difference for 'revisions'
fields is the fact that --dev6 puts them in groups w/ delta compression.
Quoting from:
http://bazaar-vcs.org/Roadmap/BrisbaneCore/Details
$ wbzr repository-details mysql-5.1-test
Commits: 56363
Raw % Compressed % Objects
Revisions: 108754 KiB 0% 42960 KiB 8% 56363
vs
$ time wbzr repository-details mysql-gc255big
Commits: 56363
Raw % Compressed % Objects
Revisions: 110860 KiB 0% 19198 KiB 13% 56363
So that is 42.9k => 19.2k. 2.2:1
Now, MySQL is a bit of a special case, because they have a lot of
per-file commit info. However, python is just a bzr-svn conversion, and
they get:
$ time wbzr repository-details python
Commits: 46161
Raw % Compressed % Objects
Revisions: 24169 KiB 0% 17007 KiB 7% 46161
down to:
$ time wbzr repository-details python-gc255big
Commits: 46161
Raw % Compressed % Objects
Revisions: 24246 KiB 0% 5152 KiB 6% 46161
Or 3.3:1. I don't remember the bzr.dev numbers off hand.
I just wanted to point out that the truth behind delta compressing
revision texts changed. Now maybe it is only because GC is sub-line
based, and able to do multiple texts. But certainly in the new format,
we *do* delta compression for Revision texts.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkoILQwACgkQJdeBCYSNAAPhKQCaA7PdZTGiH9/EBwFqukLfWZ0s
h/IAoKNccDghuTXoK9nARt9/+ZvO52ac
=EqwG
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list