[MERGE] BEncode Revision Serializer

John Arbash Meinel john at arbash-meinel.com
Thu Jun 4 21:28:38 BST 2009


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:
> John Arbash Meinel wrote:
> 
>>> $ time PYTHONPATH=../bzr/work TIMEIT -s "from bzrlib import branch;
>>> b = branch.Branch.open('bzr-dev7/bzr.dev');
>>> b.lock_read();
>>> keys = b.repository.revisions.keys();
>>> stream = b.repository.revisions.get_record_stream(keys, 'unordered', True)
>>> texts = [r.get_bytes_as('fulltext') for r in stream]
>>> serializer = b.repository._serializer
>>> read = serializer.read_revision_from_string
>>> b.unlock()
>>> " "revs = [read(t) for t in texts]"
>>> Basically, first extract all the raw texts, and then TIMEIT the actual
>>> string => Revision time. This focuses on the serializer. Though honestly
>>> the time to extract the texts is also somewhat important.
> 
> So, I spent some time tweaking the Pyrex code a bit. The main changes
> are to use PyList_Append in _decode_list, change self.decode_object into
> a cdef function, and use a macro for _update_tail. Though the last one
> doesn't have a very large impact.
> 
> If you have a 'def foo()' function in pyrex, it has to go through the
> python GetAttr code, because a child class could override it. With a
> 'cdef' function, only other Pyrex code can override it, so it can use a
> struct pointer.
> 
> It also simplifies the Python validation of properties a bit. With this
> change, I now have
> 
> dev6	  1.540 sec
> dev7-rio  0.969 sec
> dev7-benc 1.010 sec
> 
> However, if I remove the couple of type checking steps I get
> dev7-benc 0.934 sec

As a sort of 'bare metal' comparison, I implemented a deserializer for
just Revision objects in bencoded form.
  lp:///~jameinel/bzr/revision_bencode_decoder

dev7-bare  0.525 sec

...
> 
> dev6	  7.379 sec
> dev7-rio  7.690 sec
> dev7-benc 7.379 sec

dev7-bare 6.583 sec

It isn't really tested well, so I'm not proposing it for merge.

What is extra interesting is the 'extra' impact of implementing the
deserializer in Pyrex. If you notice, the pure decode time is 0.5s
versus 1.0s. But the time saved for 'bzr log' is 0.8s.

My guess is that this is memory fragmentation/garbage collection
improvements. TIMEIT is known to disable garbage collection while it
runs. I'm guessing that not allocation a buffer for every 'key' of all
of the '3:key5:value' pairs is probably creating knock-on effects.
(fewer runs of the garbage collector, because of fewer mallocs?), though
PyString isn't a GC object, the list objects used for bdecode *would*
be, though they would also be thrown away rather quickly...

Anyway, it is parsing 11.3MiB of revision text in 0.5s. Which seems
decent enough. Though if we had a truly custom format we could use
fixed-width prefixes, and then avoid all of the strtol calls.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkooLnYACgkQJdeBCYSNAAPTCgCfbK3JcIiCQySNV/FD5nFR8SDN
NIoAn0BbhIm6FeOB352POcOKu8ukeCGr
=WM/B
-----END PGP SIGNATURE-----



More information about the bazaar mailing list