[RFC/MERGE] Implement DirState parsing in Pyrex

Tue May 8 00:37:59 BST 2007

Attached is some work that I've done to move some of the core Staten
parsing routines into a Pyrex implementation. It adds a few benchmarks
(which show the advantage of doing things in C), and a few specific
functions.

There are 3 primary functions that I focused on

1) cmp_by_dir()

   This is actually a new function I wrote. We often have a need to tell
   if a path is before or after another one. However, the paths aren't
   sorted by raw string, instead they are actually sorted by directory
   chunks. ('a-b' comes after 'a/b').

   The fastest we have been able to make a python function was
     path1.split('/') < path2.split('/')
   However, that has to create a lot of temporary objects (a list, and a
   string for every path section).

   The pyrex implementation just iterates over the characters of the
   string, and knows how to treat '/' as a special character. And it is
   much faster.

   This is a new benchmark which measure the time to compare ~330 paths
   against each other (>100k comparisons):

   test_cmp_by_dirs_c                           OK       67ms/      70ms
   test_cmp_by_dirs_py                          OK      489ms/     492ms

   (not quite 10x faster, but getting close).

2) bisect_dirblocks()

   We use this function to find specific entries in memory. It gets
   called any time we need to do a lookup by path.
   The biggest savings here is to switch the comparison function to the
   raw C cmp_by_dir. The old code had some caching enabled to speed it
   up when we are bisecting the same blocks over and over again. Now
   that cmp_by_dir doesn't need the intermediate values, we don't need
   the cache either. (It is left as a member for api compatibility).

   This benchmark creates a ~10k entry tree (almost all directories),
   and then searches for every path in it, and a neighboring directory
   which isn't present. (so 20k lookups)

   test_bisect_dirblock_c                       OK       48ms/   3019ms
   test_bisect_dirblock_cached_py               OK      982ms/   3867ms
   test_bisect_dirblock_py                      OK      976ms/   3939ms

   Now we are in the ~20x speedup range.

   The biggest user of this function is DirState.add() which is actually
   kind of slow at adding 20k entries. (It has to do path normalization,
   and lookup both the current record and the parent record). Using this
   compiled form makes bisect_* not a factor in add time, the next big
   point is to get rid of the unicode normalization code.

3) _read_dirblocks()

   I factored out the code from _read_dirblocks_if_needed() into a
   helper function _read_dirblocks, which has a python implementation,
   and now a pyrex implementation.

   The new implementation gets a few speed boosts, primarily by not
   creating as many objects.  Specifically:

   a) Compare the current directory to the previous one, to know if we
      are in the same directory block. We can do this at a raw char *
      level, without having to create a PyString. And if it matches, we
      can just re-use the last PyString. This also has the nice property
      that all of the strings in a dirblock are the same object, which
      should also save memory (and any future comparison checks).

   b) Size, is_executable are both things that need to be converted from
      a string to something else (integer and boolean). We can do the
      conversion in C, and then use the final integer, rather than
      converting to a String and then to an Int.

   c) Use PyList_Append() functions rather than foo.append().
      foo.append() has to look up the 'append' member, create a tuple to
      pass it, and then call PyObject_Call(). PyList_Append doesn't have
      those lookup costs.

   d) Iterate over a string, rather than splitting into a list of
      strings. str.split('\0') is actually very efficient, and it makes
      python code fast if you can use it versus parsing a string
      yourself. However, we can do a bit better if we use (a) and (b),
      and just keep a pointer which passes through the string object.
      This prevents us from creating as many temporary objects, and we
      don't create a temporary list at all.

   e) Thanks to the consistency of dirblock records, we can handle any
      number of trees with a simple for loop. In python, unrolling the
      loop by hand has a specific performance benefit, as does getting
      sub-slices. (The old code would do fields[pos:pos+x] because then
      the inner function could work on fixed offsets)

      This actually makes the code simpler because it doesn't have as
      many optimization hacks in it.

  What does all this get us? About 2x faster:

  test__read_dirblocks_20k_tree_0_parents_c    OK      124ms/    2562ms
  test__read_dirblocks_20k_tree_0_parents_py   OK      236ms/    2593ms
  test__read_dirblocks_20k_tree_1_parent_c     OK      177ms/    2776ms
  test__read_dirblocks_20k_tree_1_parent_py    OK      356ms/    2994ms
  test__read_dirblocks_20k_tree_2_parents_c    OK      260ms/    2980ms
  test__read_dirblocks_20k_tree_2_parents_py   OK      510ms/    3225ms

  On the mozilla tree _read_dirblocks_if_needed:
	bzr.dev 10 loops, best of 3: 586 msec per loop
	pyrex   10 loops, best of 3: 259 msec per loop

  On the Launchpad tree:
        bzr.dev 10 loops, best of 3: 53.1 msec per loop
        pyrex   10 loops, best of 3: 23.5 msec per loop

  At this point, I don't know that it is worth bisecting on disk rather
  than just reading into memory. We might go there when we have to, but
  for most tree sizes being able to read it in <200ms means we don't
  have to really clutter our current code (yet).

The main reason this is RFC is because it is introducing a (potential)
dependency on compiled code. I tried to make sure it was genuinely worth
it before I advocated it. But I think our disk format is fairly stable
(and the parser is actually easier to modify in pyrex because of the
lack of hacks).

Originally, I know Robert was thinking to version the .c file, along
with the .pyx files. However, this will cause a lot of commit churn,
because when I build them it includes my system paths
(/home/jameinel/...foo/bar), and when the PQM builds them, it will
create different paths there.

What I would rather do, is change our *release* process so that we
include them in the final tar file.

That would mean that people who are using bzr.dev directly would either:

1) Use the python implementations
2) Need to have pyrex installed (and gcc, etc)

People who have release versions would

1) Use the python implementations
2) Need to have python-dev and gcc (etc) installed.

Since a lot of people use "apt-get install bzr" having a compile step
isn't as terrible as it used to be. And as long as we follow the "it
will always work on plain python, just be faster if you have a
compiler", I think I'm happy.

Ultimately, I think we want to move pyrex up the stack into
_iter_changes. Some care will be necessary, because pyrex doesn't
support generators. But I think we can write some of the _process_entry
code in pyrex, which should help to speed up 'bzr status', etc.

One other possibility is to have a 'bzr.dev-pyrex' branch. Which would
let us add things like this on the side, and only merge them once we
have decided it is worthwhile.

Also, for now I've gone with writing functions such that both the python
code and the pyrex code are tested by ./bzr selftest. Rather than having
'make check' run one time without them, and one time with them.

Oh, a few other things...

I was thinking to put all of the extension code in one place (I picked
bzrlib/compiled/*). Part of that is we could theoretically have a "bzr
--no-compiled" flag. It might also make it easier to have
CompiledFeature in bzrlib/compiled/__init__.py.

I also made the matching tests bzrlib/tests/compiled/*

This means that I use local variables to select the implementation
(bzrlib/dirstate.py has a '_read_dirblocks' function which points to
either _read_dirblocks_py or _read_dirblocks_c).

I know Robert mentioned having the .pyx right next to the .py, so that
the compiled .so would have the same name. And then 'import bzrlib.foo'
would pick between foo.so or foo.py depending on if it had been compiled.

This also gets my feet a bit wet on writing (fast) pyrex code. One
interesting part is that the intermediate C code allows you to see how
your code is being compiled. Sort of like reading the assembly output
for C code, only I don't have to read assembly :).
It has helped in a few cases where I didn't realize the overhead of one
type of function.

Thoughts?

John
=:->

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: dirstate_pyrex.patch
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20070507/78925c1a/attachment-0001.diff