[RFC] Pure C bencode decoder [was: Re: [MERGE] BEncode Revision Serializer]

Wed Aug 19 14:53:36 BST 2009

On Mon, Jun 01, 2009 at 08:32:30PM -0500, John Arbash Meinel wrote:
[...]
> However, even with that change, if I grab the largest revision string I
> can find ('sp1r-monty at donna.mysql.com-20000830194457-58362',) 902585
> (yes, 902kB for a single revision entry), I get:
[...]
> Now, if I take that revision, and then call
> "bdecode(rev.properties['file-info'].encode('utf-8')", then I start to
> see a difference between python and pyrex:
> 100 loops, best of 3: 12.6 msec per loop
> vs
> 10 loops, best of 3: 47.2 msec per loop

FWIW, to keep my C skills vaguely fresh I thought I'd experiment with
writing a pure C bencode decoder in my spare time.  When I bdecode that
'file-info' property from that revision, I get:

 bzrlib.util._bencode_py: 68.6
 bzrlib._bencode_pyx:      6.02
 bencode_c:                4.71

Times are in ms.

My code is at <lp:~spiv/+junk/bencode-c>.

For convenience, the timeit invocation I used is:

python -m timeit -s "from bzrlib.bzrdir import BzrDir; bd = BzrDir.open('.'); repo = bd.open_branch().repository; repo.lock_read()" -s "file_info = repo.get_revision('sp1r-monty at donna.mysql.com-20000830194457-58362').properties['file-info'].encode('utf-8')" -s "from bzrlib._bencode_pyx import bdecode" "bdecode(file_info)"

On some other microbenchmarks of my _bencode_pyx vs. bencode_c I get (these
times in μs):

'i0e': 1.42 vs. 0.503
'l'*300 + 'e'*300: 86.8 vs 69.1
'l' + 'de'*100 + 'e': 21.7 vs 14.9
'l' + ('99:' + 'x'*99) * 100 + 'e': 31.5 vs. 28.1
'd1:Ai0e1:Bi1e'...'1:zi51ee': 41.4 vs. 31.3
'l' + 'le' * 1000 + 'e': 166 vs 136

So, the improvement varies, but on these quick microbenchmarks it typically
cuts roughly 10-25% off the times.  (Although there's clearly a large win
for the i0e case, it's also the least interesting because we aren't really
interested in how long it takes to bdecode single small ints.)

Possibly some of the improvement is due to my slightly different
implementation strategy as much as using C instead of pyx.  The code size is
coincidentally about the same!  That surprised me a great deal:

390 bencode_c.c
392 _bencode_pyx.pyx

...although bencode_c has no licence boilerplate yet.  I think C's line
count is helped by needing just “#include <Python.h>” where Pyrex needs more
than one line per C function imported.  Some of my lines are >80 columns,
though...

For anyone concerned about installation size, the .so is about a quarter of
the size (24k vs 92k), presumably because it doesn't have the fancy
tracebacks that Pyrex provides.

If I substitute my C code for the Pyrex code the test_bencode tests still
pass.

In practice, the performance improvment this offers for bzr is probably
pretty minimal, we don't (yet?) do a lot of bencode decoding.  So I'm not at
all convinced it's worth adding to bzrlib.  But I wrote the code, so I may
as well share it :)

-Andrew.