[MERGE][RFC] Add simple revision serializer based on RIO.

John Arbash Meinel john at arbash-meinel.com
Mon May 11 14:50:04 BST 2009


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Martin Pool wrote:
> 2009/5/11 Matt Nordhoff <mnordhoff at mattnordhoff.com>:
>> Martin Pool wrote:
>> [snip]
>>
>>> However, before moving to RIO for future formats (and I say this
>>> having added the code) I would think hard about whether it should use
>>> bencode instead, which has the advantage of being able to represent
>>> somewhat more complex nesting (like dicts inside dicts) without
>>> needing a separate layer of encoding on top.  Revisions are pretty
>>> simple but even there it may be useful.  I'm not sure about the
>>> relative performance.
>> If I understand correctly, RIO is line-based but bencode is not. Is the
>> delta format still line-based? If so, using bencode would be more difficult.
> 
> We don't do line-by-line compression on revisions because generally
> speaking there's not much in common between them.  zlib compression in
> groupcompress will pick out common strings like committer names.  Good
> question though.
> 

We do "line-by-line" delta compression in --dev6 because I found that
this assumption was incorrect. We get 2:1 compression improvements by
doing delta compression. Since the only difference for 'revisions'
fields is the fact that --dev6 puts them in groups w/ delta compression.

Quoting from:
http://bazaar-vcs.org/Roadmap/BrisbaneCore/Details

$ wbzr repository-details mysql-5.1-test
  Commits: 56363
                        Raw    %    Compressed    %  Objects
  Revisions:     108754 KiB   0%     42960 KiB   8%    56363

vs

$ time wbzr repository-details mysql-gc255big
  Commits: 56363
                        Raw    %    Compressed    %  Objects
  Revisions:     110860 KiB   0%     19198 KiB  13%    56363

So that is 42.9k => 19.2k. 2.2:1

Now, MySQL is a bit of a special case, because they have a lot of
per-file commit info. However, python is just a bzr-svn conversion, and
they get:
$ time wbzr repository-details python
  Commits: 46161
                        Raw    %    Compressed    %  Objects
  Revisions:      24169 KiB   0%     17007 KiB   7%    46161

down to:
$ time wbzr repository-details python-gc255big
  Commits: 46161
                        Raw    %    Compressed    %  Objects
  Revisions:      24246 KiB   0%      5152 KiB   6%    46161

Or 3.3:1. I don't remember the bzr.dev numbers off hand.


I just wanted to point out that the truth behind delta compressing
revision texts changed. Now maybe it is only because GC is sub-line
based, and able to do multiple texts. But certainly in the new format,
we *do* delta compression for Revision texts.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkoILQwACgkQJdeBCYSNAAPhKQCaA7PdZTGiH9/EBwFqukLfWZ0s
h/IAoKNccDghuTXoK9nARt9/+ZvO52ac
=EqwG
-----END PGP SIGNATURE-----



More information about the bazaar mailing list