[RFC/MERGE] Implement DirState parsing in Pyrex
John Arbash Meinel
john at arbash-meinel.com
Tue May 8 00:37:59 BST 2007
Attached is some work that I've done to move some of the core Staten
parsing routines into a Pyrex implementation. It adds a few benchmarks
(which show the advantage of doing things in C), and a few specific
functions.
There are 3 primary functions that I focused on
1) cmp_by_dir()
This is actually a new function I wrote. We often have a need to tell
if a path is before or after another one. However, the paths aren't
sorted by raw string, instead they are actually sorted by directory
chunks. ('a-b' comes after 'a/b').
The fastest we have been able to make a python function was
path1.split('/') < path2.split('/')
However, that has to create a lot of temporary objects (a list, and a
string for every path section).
The pyrex implementation just iterates over the characters of the
string, and knows how to treat '/' as a special character. And it is
much faster.
This is a new benchmark which measure the time to compare ~330 paths
against each other (>100k comparisons):
test_cmp_by_dirs_c OK 67ms/ 70ms
test_cmp_by_dirs_py OK 489ms/ 492ms
(not quite 10x faster, but getting close).
2) bisect_dirblocks()
We use this function to find specific entries in memory. It gets
called any time we need to do a lookup by path.
The biggest savings here is to switch the comparison function to the
raw C cmp_by_dir. The old code had some caching enabled to speed it
up when we are bisecting the same blocks over and over again. Now
that cmp_by_dir doesn't need the intermediate values, we don't need
the cache either. (It is left as a member for api compatibility).
This benchmark creates a ~10k entry tree (almost all directories),
and then searches for every path in it, and a neighboring directory
which isn't present. (so 20k lookups)
test_bisect_dirblock_c OK 48ms/ 3019ms
test_bisect_dirblock_cached_py OK 982ms/ 3867ms
test_bisect_dirblock_py OK 976ms/ 3939ms
Now we are in the ~20x speedup range.
The biggest user of this function is DirState.add() which is actually
kind of slow at adding 20k entries. (It has to do path normalization,
and lookup both the current record and the parent record). Using this
compiled form makes bisect_* not a factor in add time, the next big
point is to get rid of the unicode normalization code.
3) _read_dirblocks()
I factored out the code from _read_dirblocks_if_needed() into a
helper function _read_dirblocks, which has a python implementation,
and now a pyrex implementation.
The new implementation gets a few speed boosts, primarily by not
creating as many objects. Specifically:
a) Compare the current directory to the previous one, to know if we
are in the same directory block. We can do this at a raw char *
level, without having to create a PyString. And if it matches, we
can just re-use the last PyString. This also has the nice property
that all of the strings in a dirblock are the same object, which
should also save memory (and any future comparison checks).
b) Size, is_executable are both things that need to be converted from
a string to something else (integer and boolean). We can do the
conversion in C, and then use the final integer, rather than
converting to a String and then to an Int.
c) Use PyList_Append() functions rather than foo.append().
foo.append() has to look up the 'append' member, create a tuple to
pass it, and then call PyObject_Call(). PyList_Append doesn't have
those lookup costs.
d) Iterate over a string, rather than splitting into a list of
strings. str.split('\0') is actually very efficient, and it makes
python code fast if you can use it versus parsing a string
yourself. However, we can do a bit better if we use (a) and (b),
and just keep a pointer which passes through the string object.
This prevents us from creating as many temporary objects, and we
don't create a temporary list at all.
e) Thanks to the consistency of dirblock records, we can handle any
number of trees with a simple for loop. In python, unrolling the
loop by hand has a specific performance benefit, as does getting
sub-slices. (The old code would do fields[pos:pos+x] because then
the inner function could work on fixed offsets)
This actually makes the code simpler because it doesn't have as
many optimization hacks in it.
What does all this get us? About 2x faster:
test__read_dirblocks_20k_tree_0_parents_c OK 124ms/ 2562ms
test__read_dirblocks_20k_tree_0_parents_py OK 236ms/ 2593ms
test__read_dirblocks_20k_tree_1_parent_c OK 177ms/ 2776ms
test__read_dirblocks_20k_tree_1_parent_py OK 356ms/ 2994ms
test__read_dirblocks_20k_tree_2_parents_c OK 260ms/ 2980ms
test__read_dirblocks_20k_tree_2_parents_py OK 510ms/ 3225ms
On the mozilla tree _read_dirblocks_if_needed:
bzr.dev 10 loops, best of 3: 586 msec per loop
pyrex 10 loops, best of 3: 259 msec per loop
On the Launchpad tree:
bzr.dev 10 loops, best of 3: 53.1 msec per loop
pyrex 10 loops, best of 3: 23.5 msec per loop
At this point, I don't know that it is worth bisecting on disk rather
than just reading into memory. We might go there when we have to, but
for most tree sizes being able to read it in <200ms means we don't
have to really clutter our current code (yet).
The main reason this is RFC is because it is introducing a (potential)
dependency on compiled code. I tried to make sure it was genuinely worth
it before I advocated it. But I think our disk format is fairly stable
(and the parser is actually easier to modify in pyrex because of the
lack of hacks).
Originally, I know Robert was thinking to version the .c file, along
with the .pyx files. However, this will cause a lot of commit churn,
because when I build them it includes my system paths
(/home/jameinel/...foo/bar), and when the PQM builds them, it will
create different paths there.
What I would rather do, is change our *release* process so that we
include them in the final tar file.
That would mean that people who are using bzr.dev directly would either:
1) Use the python implementations
2) Need to have pyrex installed (and gcc, etc)
People who have release versions would
1) Use the python implementations
2) Need to have python-dev and gcc (etc) installed.
Since a lot of people use "apt-get install bzr" having a compile step
isn't as terrible as it used to be. And as long as we follow the "it
will always work on plain python, just be faster if you have a
compiler", I think I'm happy.
Ultimately, I think we want to move pyrex up the stack into
_iter_changes. Some care will be necessary, because pyrex doesn't
support generators. But I think we can write some of the _process_entry
code in pyrex, which should help to speed up 'bzr status', etc.
One other possibility is to have a 'bzr.dev-pyrex' branch. Which would
let us add things like this on the side, and only merge them once we
have decided it is worthwhile.
Also, for now I've gone with writing functions such that both the python
code and the pyrex code are tested by ./bzr selftest. Rather than having
'make check' run one time without them, and one time with them.
Oh, a few other things...
I was thinking to put all of the extension code in one place (I picked
bzrlib/compiled/*). Part of that is we could theoretically have a "bzr
--no-compiled" flag. It might also make it easier to have
CompiledFeature in bzrlib/compiled/__init__.py.
I also made the matching tests bzrlib/tests/compiled/*
This means that I use local variables to select the implementation
(bzrlib/dirstate.py has a '_read_dirblocks' function which points to
either _read_dirblocks_py or _read_dirblocks_c).
I know Robert mentioned having the .pyx right next to the .py, so that
the compiled .so would have the same name. And then 'import bzrlib.foo'
would pick between foo.so or foo.py depending on if it had been compiled.
This also gets my feet a bit wet on writing (fast) pyrex code. One
interesting part is that the intermediate C code allows you to see how
your code is being compiled. Sort of like reading the assembly output
for C code, only I don't have to read assembly :).
It has helped in a few cases where I didn't realize the overhead of one
type of function.
Thoughts?
John
=:->
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: dirstate_pyrex.patch
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20070507/78925c1a/attachment-0001.diff
More information about the bazaar
mailing list