[MERGE] BEncode Revision Serializer

Wed Jun 3 21:05:52 BST 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aaron Bentley wrote:
> Jelmer Vernooij wrote:
>> Thanks for the review, the attached patch should fix the issues you've
>> raised. I'll do some more performance testing tomorrow.
> 
> bb:approve
> 
> I think this is good as a dev format, but I wonder whether bencode is
> too forgiving.  Shouldn't a serializer complain if the schema is
> violated (like if rev.revision_id is an int)?
> 
> Aaron

So I went ahead and update Jelmer's patch in a couple of ways.

1) Changed from using a dict() to using a list of tuples, so we could
control the byte stream.
I'm still inspecting the gc blocks a bit, to figure out the effects.
Many things I expected to be better are actually quite different than I
expected.

So far, both Jelmer's form and my 'optimized' form are actually still
slightly larger post-compression than the XML texts. However, this new
layout lets us tweak it as much as we want, to see if we end up
somewhere better.

I'm guessing the loss is mostly the extra bytes for length prefixes.
Though you still see unexpected bits. Like being able to copy the
@work.mysql.com bits from the committer field, and only inserting the
"user" string. (Which then messes up some of the other bits I expected
to be copies, because we don't insert the whole user string.)

2) Adds a "_schema" info which then provides more rigorous type checking
and decoding for the various attributes. It also adds a 'format' string,
similar to what we have in all of our XML texts.

3) I then converted all of bzr.dev to dev6, and then converted that to
dev7. It took 104min to go XML => Dev6, but only 20min to go Dev6 =>
Dev7. We could make that *much* faster by copying across everything but
the Revision texts directly, but for now, it is nice to see that
converting *from* the dev format is, indeed, much faster.

4) To benchmark the time for decoding Revisions, I did:

$ time PYTHONPATH=../bzr/work TIMEIT -s "from bzrlib import branch;
b = branch.Branch.open('bzr-dev7/bzr.dev');
b.lock_read();
keys = b.repository.revisions.keys();
stream = b.repository.revisions.get_record_stream(keys, 'unordered', True)
texts = [r.get_bytes_as('fulltext') for r in stream]
serializer = b.repository._serializer
read = serializer.read_revision_from_string
b.unlock()
" "revs = [read(t) for t in texts]"

Basically, first extract all the raw texts, and then TIMEIT the actual
string => Revision time. This focuses on the serializer. Though honestly
the time to extract the texts is also somewhat important.

With that in mind,

dev6	1.51 sec per loop
dev7	1.34 sec per loop

I didn't test RIO for this, though I think I would like to. On my simple
tests, it was actually considerably faster than XML or Bencode, which
surprised me.

To test the 'whole stack' I then did:

time bzr log --no-aliases --long -n0 >/dev/null
which gave (best of 3):

dev6	7.285s
dev7	7.660s

So surprisingly, even though the new format supposedly decodes 0.1s
faster, it takes 0.3s longer to do 'log' over everything. My guess is
that the slightly worse compression means the compression complexity is
greater, and thus we are spending slightly more time in the text
extraction portion.

Note that "time bzr log --short -n1" was:
dev6	1.747s
dev7	1.779s

So fairly imperceptibly different.

In the end, bencode seems 'as good as' XML for decoding performance,
with the main benefit of giving us more control over the bytes-on-the-wire.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkom16AACgkQJdeBCYSNAAN3ugCfRzUpoR25nQ/Nl/uM8HId5NAJ
EtoAoLNZWLRAk5dQk04eVS3+qdJHE465
=a92I
-----END PGP SIGNATURE-----
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: jam_bencode_serializer.patch
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20090603/b01f15c3/attachment-0001.diff