bzr-sax actually uses xml.sax interface

John A Meinel john at arbash-meinel.com
Fri Oct 21 06:47:00 BST 2005


I just updated my bzr-sax branch again. Available at:
http://bzr.arbash-meinel.com/branches/bzr/sax/

The latest update is to actually switch from using cElementTree, to
using python's native xml.sax interface, which defines a set of
callbacks, etc.

Basically I created a class which handles each element type based on a
dictionary to function mapping, and I install and remove different
elements on the fly. That way we only have to parse valid xml.

Also, since we switched over to using a format="5" tag, it would be
pretty simple to install a different set of handlers, in case we change
the format (and keep the same basic structure of a <revision> tag, and
an <inventory> tag).

Unfortunately, the results are mixed. On one hand, we no longer depend
on ElementTree or cElementTree. On the other hand, it seems like doing
the callbacks into python code is what is so expensive.

These are the --profile's that I got, first for bzr.dev with cElementTree:
 $ python -O ./bzr --profile revstore2weave --regenerate -r 200
          58447 function calls (57250 primitive calls) in 0.772 CPU
seconds

    Ordered by: cumulative time
    List reduced from 180 to 20 due to restriction <20>

    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
         1    0.000    0.000    0.772    0.772 commands.py:192(run_argv)
         1    0.004    0.004    0.771    0.771 __init__.py:16(run)
         1    0.008    0.008    0.759    0.759 store2weave.py:50(convert)
       400    0.011    0.000    0.599    0.001 branch.py:713(get_revision)
       401    0.009    0.000    0.543    0.001 branch.py:73(decorated)
         1    0.002    0.002    0.364    0.364
revision.py:172(revision_graph)
       200    0.009    0.000    0.306    0.002 store2weave.py:17(get_text)
       401    0.021    0.000    0.231    0.001 branch.py:351(lock_read)
       400    0.003    0.000    0.218    0.001
branch.py:696(get_revision_xml_file)
       400    0.006    0.000    0.215    0.001 __init__.py:189(get)
       400    0.004    0.000    0.188    0.000 text.py:45(_get)
       402    0.176    0.000    0.184    0.000 local.py:80(get)
       804    0.012    0.000    0.123    0.000 __init__.py:906(debug)
       804    0.012    0.000    0.097    0.000 __init__.py:1026(_log)
       401    0.016    0.000    0.084    0.000 branch.py:366(unlock)
       401    0.017    0.000    0.077    0.000 local.py:214(lock_read)
       200    0.002    0.000    0.060    0.000 xml.py:57(write_revision)
       200    0.003    0.000    0.058    0.000 xml.py:69(_write_element)
       200    0.003    0.000    0.054    0.000 ElementTree.py:652(write)
   798/200    0.018    0.000    0.051    0.000 ElementTree.py:662(_write)

And now bzr.dev using ElementTree (the python version)
 $ python -O ./bzr --profile revstore2weave --regenerate -r 200
          125453 function calls (124256 primitive calls) in 1.148 CPU
seconds

    Ordered by: cumulative time
    List reduced from 225 to 20 due to restriction <20>

    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
         1    0.000    0.000    1.148    1.148 commands.py:192(run_argv)
         1    0.003    0.003    1.147    1.147 __init__.py:16(run)
         1    0.009    0.009    1.137    1.137 store2weave.py:50(convert)
       400    0.008    0.000    0.921    0.002 branch.py:713(get_revision)
       200    0.011    0.000    0.605    0.003 store2weave.py:17(get_text)
       400    0.008    0.000    0.568    0.001 xml.py:63(read_revision)
       400    0.008    0.000    0.460    0.001 xml.py:73(_read_element)
       400    0.015    0.000    0.451    0.001 ElementTree.py:574(parse)
         1    0.001    0.001    0.448    0.448
revision.py:172(revision_graph)
       400    0.063    0.000    0.393    0.001 ElementTree.py:1241(feed)
       401    0.008    0.000    0.350    0.001 branch.py:73(decorated)
      1596    0.008    0.000    0.258    0.000 pyexpat.c:582(StartElement)
      1596    0.045    0.000    0.250    0.000
ElementTree.py:1172(_start_list)
       401    0.017    0.000    0.201    0.001 branch.py:351(lock_read)
      5974    0.111    0.000    0.160    0.000 ElementTree.py:1153(_fixname)
       804    0.017    0.000    0.122    0.000 __init__.py:906(debug)
       200    0.004    0.000    0.119    0.001 xml.py:57(write_revision)
       400    0.056    0.000    0.100    0.000 xml5.py:169(_unpack_revision)
       804    0.009    0.000    0.090    0.000 __init__.py:1026(_log)
       200    0.003    0.000    0.079    0.000 xml.py:69(_write_element)

And then for bzr-sax:

 $ python -O ../bzr-sax/bzr --profile revstore2weave --regenerate -r 200
          79584 function calls (79180 primitive calls) in 1.192 CPU
seconds

    Ordered by: cumulative time
    List reduced from 228 to 20 due to restriction <20>

    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
         1    0.001    0.001    1.192    1.192 commands.py:192(run_argv)
         1    0.027    0.027    1.190    1.190 __init__.py:16(run)
         1    0.011    0.011    1.157    1.157 store2weave.py:50(convert)
       400    0.008    0.000    1.004    0.003 branch.py:713(get_revision)
         1    0.004    0.004    0.595    0.595
revision.py:172(revision_graph)
       400    0.012    0.000    0.553    0.001 xml5.py:354(read_revision)
       400    0.007    0.000    0.525    0.001 __init__.py:29(parse)
       200    0.015    0.000    0.458    0.002 store2weave.py:17(get_text)
       401    0.013    0.000    0.447    0.001 branch.py:73(decorated)
       400    0.011    0.000    0.436    0.001 expatreader.py:100(parse)
       400    0.032    0.000    0.387    0.001 xmlreader.py:115(parse)
       800    0.082    0.000    0.324    0.000 expatreader.py:196(feed)
       401    0.021    0.000    0.236    0.001 branch.py:351(lock_read)
      1596    0.014    0.000    0.156    0.000 pyexpat.c:582(StartElement)
       804    0.012    0.000    0.143    0.000 __init__.py:906(debug)
      1596    0.023    0.000    0.142    0.000
expatreader.py:299(start_element)
      1596    0.020    0.000    0.115    0.000 xml5.py:43(startElement)
       804    0.012    0.000    0.114    0.000 __init__.py:1026(_log)
       401    0.023    0.000    0.102    0.000 branch.py:366(unlock)
       400    0.004    0.000    0.096    0.000
branch.py:696(get_revision_xml_file)


So the sax interface might be *slightly* faster than the ElementTree
version, but cElementTree really is leaps and bounds faster.

Now, I have 1 valid defense. That if you go for more than 200 revisions,
the time it takes to read and write the xml is dwarfed by the time it
takes to create the weave file.

Using the same code bases, I developed this plot:
http://bzr.arbash-meinel.com/other/timing-2/TimeRevstore2Weave.png

In it, you can see that the "real sax" is negligably slower than the
iterparse (using cElementTree) version. While bzr.dev is much slower.
This is mostly because I've reformatted the revision xml to make it
create better weaves (as my other posts indicated).

So I would have to say our #1 priority is to get the Weave objects faster.

Is there interest in having some compiled C extensions, for handling the
performance critical portions? Python distutils has decent support for
that sort of thing, it means people would need to do "python setup
install" before using bzr, rather than just having all python code. But
it seems like we have a couple of very obvious performance areas. Though
perhaps the append-only weave would fix some of it, I'm not convinced
that it would.
Mostly because for a lot of texts, (especially inventory) you are pretty
likely to have lines that cross lots of revisions. So when you extract
them, you are still going to have to go way back in the file, and
rebuild the current text from a lot of previous chunks.

Anyway, basically more performance testing, and another branch that
people can play around with.

It would be nice if other people could run some of this stuff to verify
my results.

John
=:->
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 253 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20051021/93e19259/attachment.pgp 


More information about the bazaar mailing list