[MERGE] faster iter_changes

John Arbash Meinel john at arbash-meinel.com
Tue Sep 23 00:21:01 BST 2008


John Arbash Meinel has voted comment.
Status is now: Semi-approved
Comment:
I think this is going to take a few rounds of review of the different 
pieces. I think I like where it is going, but it isn't ready to land for 
'tweak' (doesn't work on win32). But I don't really want it out of the 
queue via 'resubmit'. So for now I'm just voting 'comment'.


=== modified file 'bzrlib/_dirstate_helpers_c.pyx'

On win32 we don't have:

+cdef extern from "arpa/inet.h":
+    unsigned long htonl(unsigned long)
+
+

or

+cdef extern from 'sys/stat.h':
+    int S_ISDIR(int mode)
+    int S_ISREG(int mode)
+    int S_ISLNK(int mode)
+    int S_IXUSR

This may depend on the compiler tools. I'm currently using mingw32, and 
it
seems to be fine with "S_ISDIR and S_ISREG" but fails trying to find 
S_ISLNK.
More exploration is needed to know what VS 7/8/9 would have available. 
(They
tend to put POSIX stuff into _foo, like _open and _stat sort of things.)

Perhaps we just need a compatibility header, that on win32 S_ISLINK is 
defined
as: "#define S_ISLNK(mode) (0)"

Though I think at one point Alexander had done some interesting trickery 
to get
"symlink" support on win32. By setting certain flags, and then using 
more
invasive checks when those flags were set. So I think it would be good 
to be a
little bit flexible here, but we can put the onus of supporting that 
onto
someone who is actually going to be implementing symlink support on 
win32,
rather than someone who doesn't have a good win32 platform for testing. 
:)


And I don't think you meant to keep:
+cdef extern from "stdio.h":
+    void printf(char *format, ...)
+


+from bzrlib import cache_utf8, errors, osutils
+from bzrlib.dirstate import DirState, pack_stat
+from bzrlib.osutils import pathjoin, splitpath

^- you don't actually use 'pack_stat' here. Also, I think it would be 
better to
do "pathjoin = osutils.pathjoin" at the appropriate times, rather than
importing them again here.


+#cdef object _encode
+_encode = binascii.b2a_base64
+

^- I don't think cdef object gets you anything here, but commenting it 
out gets
you even less.

...

+
+from struct import pack
+cdef _pack_stat(stat_value):


^- This seems like a place, that if you could determine you had
_walkdirs_win32._Win32Stat or _readdir_pyx._Stat you could access the 
values as
C types, rather than converting into python and back down to C.

I would also be *really* curious to poke a sprintf("%8X.%8X.%8X") in 
there, instead of
the base64.encode() and see what we get.


...
+cdef _update_entry(self, entry, abspath, stat_value):
+    """Update the entry based on what is actually on disk.
+
+    :param entry: This is the dirblock entry for the file in question.
+    :param abspath: The path on disk for this file.
+    :param stat_value: (optional) if we already have done a stat on the
+        file, re-use it.
+
+    :return: The sha1 hexdigest of the file (40 bytes) or link target 
of a
+            symlink.
+    """

^- 'stat_value' isn't optional anymore. We moved that code up the stack, 
and
just forgot to update the docstring. We need to update the "def 
update_entry"
docstring as well.


+    # TODO - require pyrex 0.8, then use a pyd file to define access to 
the _st
+    # mode of the compiled stat objects.

^- Do you really mean pyrex 0.8? That seems like a perfectly reasonable
request. If you mean 0.9.8 that might be more difficult.

+    cdef int minikind, saved_minikind
+    cdef void * details
+    # pyrex 0.9.7 would allow cdef list details_list, and direct access 
rather
+    # than PyList_GetItem_void_void below

I didn't realize this. Is that pyrex 0.9.7, or cython?

Regardless, 'details' is a Tuple, not a List object.

Also "cdef _update_entry(self, ...)" since _update_entry is no longer a 
member
of DirState, I think it might be more obvious to use 
"_update_entry(state,
...)".

+
+cdef char _minikind_from_string(object string):
+    """Convert a python string to a char."""
+    return PyString_AsString(string)[0]
+

^- This is a little unsafe in the case that 'string' is an empty string.
Perhaps a quick "if PyString_Size(string) > 0" ?

...

+cdef object _minikind_to_kind(char minikind):
+    """Create a string kind for minikind."""
+    cdef char _minikind[1]
+    if minikind == c'f':
+        return _kind_file
+    elif minikind == c'd':
+        return _kind_directory
+    elif minikind == c'a':
+        return _kind_absent
+    elif minikind == c'r':
+        return _kind_relocated
+    elif minikind == c'l':
+        return _kind_symlink
+    elif minikind == c't':
+        return _kind_tree_reference
+    _minikind[0] = minikind
+    raise KeyError(PyString_FromStringAndSize(_minikind, 1))

^- Why do you use the "_minikind[0]" workaround, rather than &minikind?


I'm skipping the big ProcessEntryC code block for now, I'll try to get 
back to
it later.

=== modified file 'bzrlib/delta.py'
--- bzrlib/delta.py     2008-04-24 07:22:53 +0000
+++ bzrlib/delta.py     2008-09-15 03:23:53 +0000
@@ -241,7 +241,7 @@
                                    (executable[0] != executable[1])))
          elif kind[0] != kind[1]:
              delta.kind_changed.append((path[1], file_id, kind[0], 
kind[1]))
-        elif content_change is True or executable[0] != executable[1]:
+        elif content_change or executable[0] != executable[1]:
              delta.modified.append((path[1], file_id, kind[1],
                                     content_change,
                                     (executable[0] != executable[1])))


^- any comment on why you had to do this?

...

=== modified file 'bzrlib/dirstate.py'
--- bzrlib/dirstate.py  2008-07-30 08:55:10 +0000
+++ bzrlib/dirstate.py  2008-09-18 06:36:28 +0000
@@ -211,6 +211,7 @@
  import time
  import zlib

+import bzrlib
  from bzrlib import (

^- As near as I can tell, you do the import here, so that later you can 
do:

     def iter_changes(self):
         """Iterate over the changes."""
         utf8_decode = cache_utf8._utf8_decode
==>     cmp_by_dirs = bzrlib.dirstate.cmp_by_dirs
         _process_entry = self._process_entry

It seems like it would be a lot cleaner to just use a different variable 
name.
Like "_cmp_by_dirs = cmp_by_dirs".


...
+from bzrlib.osutils import pathjoin, splitpath

Similarly, I find it better to use "pathjoin = osutils.pathjoin" when 
you want
it as a simple function rather than a module lookup.

...

@@ -2774,6 +2738,709 @@
              raise errors.ObjectNotLocked(self)


+def py_update_entry(self, entry, abspath, stat_value,
+                 _stat_to_minikind=DirState._stat_to_minikind,
+                 _pack_stat=pack_stat):

^- Shouldn't this be moved into _dirstate_helpers_py.py ?

Similarly for:
+class ProcessEntryPython(object):


=== modified file 'bzrlib/tests/test__dirstate_helpers.py'

...

+
+class TestUpdateEntry(test_dirstate.TestCaseWithDirState):


I'm assuming this is mostly just a Cut & Paste from the test_dirstate 
tests?


I'll also note that you don't seem to have direct tests for the other 
helpers
like pack_stat and ProcessEntry[C/Py]. I don't know that we have to, but 
I
usually feel better having implementation tests to make sure that the
Pyrex code performs identically to the Python code.

I suppose you are testing it indirectly with iter_changes and:

+        if optimiser is bzrlib.workingtree_4.InterDirStateTree:
+            # Its a little ugly to be conditional here, but less so 
than having
+            # the optimiser listed twice.
+            # Add once, compiled version
+            test_intertree_permutations.append(
+                (optimiser.__name__ + "(C)",
+                 optimiser,
+                 optimiser._matching_from_tree_format,
+                 optimiser._matching_to_tree_format,
+                 optimiser.make_source_parent_tree_compiled_dirstate))
+            # python version
+            test_intertree_permutations.append(
+                (optimiser.__name__ + "(PY)",
+                 optimiser,
+                 optimiser._matching_from_tree_format,
+                 optimiser._matching_to_tree_format,
+                 optimiser.make_source_parent_tree_python_dirstate)


As mentioned, I haven't gone over the ProcessEntryX classes yet, but I 
figured
I could at least start the discussion.

Overall, I think the basic refactoring you've done is good. We still 
need to
figure out how to get everything to compile on all platforms, etc.


For details, see: 
http://bundlebuggy.aaronbentley.com/project/bzr/request/%3C1221720286.9592.27.camel%40lifeless-64%3E
Project: Bazaar



More information about the bazaar mailing list