[MERGE] DirState pyrex helpers

Mon Jul 16 10:32:16 BST 2007

On 7/14/07, John Arbash Meinel <john at arbash-meinel.com> wrote:
> Because the pyrex code just does:
>
> size_t myvar;
>
> The point is that when you have (in pyrex):
>
> cdef int myfunc(object param):
>   size_t size_t_param
>
>   size_t_param = param
>
> When converted to C by the pyrex translater (compiler?)
>
> int myfunc(PyObject *param)
> {
>   size_t size_t_param;
>
>   size_t_param = PyLongFromInt(param);
> }
>
> Pyrex isn't a compiler, it just translates code that looks mostly like
> Python into C code, which is then compiled by GCC, etc.

So the

+    ctypedef int size_t

isn't saying that it's an int, but rather that it should be treated
like an int when converting to and from Python objects?  Maybe it'd be
good to add an comment about this, just to save people worrying.
Maybe an unsigned long would be more generally right?

+cdef void* _my_memrchr(void *s, int c, size_t n):

I wonder if this is really faster than just naively walking back
through the buffer looking for the character...

I think all the strings we're dealing with here won't contain \0?  And
Python strings are null terminated, so maybe you could use strrchr,
which unlike memrchr is standard.

+def _read_dirblocks_c(state):
+    """Read in the dirblocks for the given DirState object.
+
+    This is tightly bound to the DirState internal representation. It should be
+    thought of as a member function, which is only separated out so that we can
+    re-write it in pyrex.
+
+    :param state: A DirState object.
+    :return: None
+    """

The docstring could be a bit more clear that the result is that the
dirblocks are in memory in state.

+cdef class Reader:
+    """Maintain the current location, and return fields as you parse them."""

  Maintain the location within the dirstate file....

+    def get_all_fields(self):
+        """Get a list of all fields"""

   Get a list of all fields from one record of the dirstate?

+    cdef object _get_entry(self, int num_trees, void **p_current_dirname,
+                           int *new_block):
   "Get one dirstate entry: (key, (tree_info, ...))"

Because this is a bit harder to read than regular python I'm thinking
it's good to document it a bit more...

+
+    def init(self):
+        """Get the pointer ready"""
+        cdef char *first
+        cdef int size
+        # The first field should be an empty string left over from the Header
+        first = self.get_next(&size)
+        if first[0] != c'\0' and size == 0:
+            raise AssertionError('First character should be null not: %s'
+                                 % (first,))

Here I found myself wondering why this doesn't read the header - might
be good to clarify that it's done by the python code (?) before this
is invoked?

+
+    if not PyList_CheckExact(dirblocks):
+        raise TypeError('you must pass a python list for dirblocks')
+    _lo = lo
+    if not PyString_CheckExact(dirname):
+        raise TypeError('you must pass a string for dirname')

It can be a lot easier to debug these things if they occur for a user
if you print the repr of the object that was actually passed.

+    void *PyTuple_GetItem_void_void "PyTuple_GET_ITEM" (void* tpl, int index)

Why is this not _void_int?

I do wonder if it would be simpler to just split the paths into lists
as we read them in.  I suppose that would require considerably more
allocation, and anyhow would be out of scope for this change.

+
+    # Use 32-bit comparisons for the matching portion of the string.
+    # Almost all CPU's are faster at loading and comparing 32-bit integers,
+    # than they are at 8-bit integers.
+    # TODO: jam 2007-05-07 Do we need to change this so we always start at an
+    #       integer offset in memory? I seem to remember that being done in
+    #       some C libraries for strcmp()

As Jan said, yes, unaligned access can cause a bus error on some
architectures.  I would be inclined to not worry about that for now.
The code to get this right is a bit complex and anyhow for short
components it may be no slower.  If you consider that many paths might
have components of only 6 characters or so then comparing by ints may
not be useful.

Uh, also you can't rely on ints being 4 characters.  (Unless I'm
missing something and they reliably are in pyrex?)

More importantly: for python are we sure we're comparing the utf-8
values, not the interpreted unicode.  I'm pretty sure we are.  But in
utf-8 we will have some characters with the high bit set, and those
will compare the wrong way if you compare them as signed ints and
chars.  I think you definitely need some tests of this on utf-8 paths.

Again, since we shouldn't have nuls in the strings and Python strings
are guaranteed to be terminated I don't think you need to explicitly
pass the length, which should make a lot of the code smaller.

+cdef int _cmp_path_by_dirblock(char *path1, int path1_len,
+                               char *path2, int path2_len):

Probably needs a docstring.

+cdef object _pystr(char *s, int length):

This doesn't seem to be called?

Adding Pyrex makes it more important that we get some automatic cross
platform testing.

I haven't read the benchmarks.  The bisection code looks ok.  But I
think the size, alignment, and signedness issues need to be addressed.

-- 
Martin