[MERGE] DirState pyrex helpers

Wed Jul 18 21:34:05 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Martin Pool wrote:
> On 7/14/07, John Arbash Meinel <john at arbash-meinel.com> wrote:
>> Because the pyrex code just does:
>>
>> size_t myvar;
>>
>> The point is that when you have (in pyrex):
>>
>> cdef int myfunc(object param):
>>   size_t size_t_param
>>
>>   size_t_param = param
>>
>> When converted to C by the pyrex translater (compiler?)
>>
>> int myfunc(PyObject *param)
>> {
>>   size_t size_t_param;
>>
>>   size_t_param = PyLongFromInt(param);
>> }
>>
>> Pyrex isn't a compiler, it just translates code that looks mostly like
>> Python into C code, which is then compiled by GCC, etc.
> 
> So the
> 
> +    ctypedef int size_t
> 
> isn't saying that it's an int, but rather that it should be treated
> like an int when converting to and from Python objects?  Maybe it'd be
> good to add an comment about this, just to save people worrying.
> Maybe an unsigned long would be more generally right?

I changed it to unsigned. I don't have a specific concern between unsigned an
unsigned long if you prefer.

I've tried to track down an "official" definition, but GNU seems to have it as
part of the compiler. I see stuff like:

#ifndef _SIZE_T
#define _SIZE_T
typdef __darwin_size_t size_t
#endif _SIZE_T

And on Linux I see:
#define __need_size_t
#define __need_NULL
#include <stddef.h>

But the only "stddef.h" I can find is linux/stddef.h.
Which defines only NULL and has no 'size_t' reference.

So while I'm pretty sure it is unsigned, and probably unsigned long, I haven't
been able to actually find that definition anywhere.

I did also find:
./glob.h:31:typedef __SIZE_TYPE__ size_t;
./c++/4.1.2/ext/stdio_filebuf.h:61:      typedef std::size_t  size_t;

neither of which help.

dcmtk defines it as:
./dcmtk/config/cfunix.h:337:typedef unsigned size_t;

But I don't think that is really a 'canonical' source for size definitions.

> 
> +cdef void* _my_memrchr(void *s, int c, size_t n):
> 
> I wonder if this is really faster than just naively walking back
> through the buffer looking for the character...

I went ahead and switched to that.

> 
> I think all the strings we're dealing with here won't contain \0?  And
> Python strings are null terminated, so maybe you could use strrchr,
> which unlike memrchr is standard.

In this particular case, I probably could. Though strrchr doesn't have a length
parameter, so it has to search the entire string for the last occurrence. While
the naive version can read only part of it (in the common case that there is a
slash).

I was also concerned about parsing through a string looking in a substring. I
don't seem to be doing that here, but a lot of the advantages for parsing come
from being able to work on a subset of the total dirblock, without having to
create new PyString objects.

> 
> +def _read_dirblocks_c(state):
> +    """Read in the dirblocks for the given DirState object.
> +
> +    This is tightly bound to the DirState internal representation. It
> should be
> +    thought of as a member function, which is only separated out so
> that we can
> +    re-write it in pyrex.
> +
> +    :param state: A DirState object.
> +    :return: None
> +    """
> 
> The docstring could be a bit more clear that the result is that the
> dirblocks are in memory in state.
>

Done.

> 
> +cdef class Reader:
> +    """Maintain the current location, and return fields as you parse
> them."""
> 
>  Maintain the location within the dirstate file....
> 
> 
> +    def get_all_fields(self):
> +        """Get a list of all fields"""
> 
>   Get a list of all fields from one record of the dirstate?

This was just for debugging, I went ahead and removed it completely.

> 
> +    cdef object _get_entry(self, int num_trees, void **p_current_dirname,
> +                           int *new_block):
>   "Get one dirstate entry: (key, (tree_info, ...))"
> 
> Because this is a bit harder to read than regular python I'm thinking
> it's good to document it a bit more...
> 

I've added some comments. Especially around some less obvious pyrex-isms.

> +
> +    def init(self):
> +        """Get the pointer ready"""
> +        cdef char *first
> +        cdef int size
> +        # The first field should be an empty string left over from the
> Header
> +        first = self.get_next(&size)
> +        if first[0] != c'\0' and size == 0:
> +            raise AssertionError('First character should be null not: %s'
> +                                 % (first,))
> 
> Here I found myself wondering why this doesn't read the header - might
> be good to clarify that it's done by the python code (?) before this
> is invoked?
> 

Comment added. It is because you don't even get to _read_dirblocks until after
you have _read_header.

> +
> +    if not PyList_CheckExact(dirblocks):
> +        raise TypeError('you must pass a python list for dirblocks')
> +    _lo = lo
> +    if not PyString_CheckExact(dirname):
> +        raise TypeError('you must pass a string for dirname')
> 
> It can be a lot easier to debug these things if they occur for a user
> if you print the repr of the object that was actually passed.
> 

Done.

I can say that the python stdlib doesn't do this:

>>> os.lstat(1)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
TypeError: coercing to Unicode: need string or buffer, int found

But I won't say it is particularly helpful.

> +    void *PyTuple_GetItem_void_void "PyTuple_GET_ITEM" (void* tpl, int
> index)
> 
> Why is this not _void_int?

Because the nomenclature is to indicate that it takes a void* instead of an
<object> and *returns* a void* rather than an <object>.

With Pyrex, if you name a function:

object function(object param)

It will do work to make sure objects are Py_INCREF and Py_DECREF at appropriate
 times. However, I'm explicitly working in void* because of wanting to work
closer to 'bare metal'.

Specifically, I'm pulling a string out of a tuple which is in a list, but all I
really want is the char* and int size.

So I'm doing:

void * cur

cur = PyTuple_GetItem_void_void(
        PyList_GetItem_object_void(list, offset))
cur_cstr = PyString_AS_STRING_void(cur)
cur_size = PyString_GET_SIZE_void(cur)

What I'm really trying to do here is work in terms of the PyObject pointers,
rather than a higher level "<object>" abstraction.

I could change these to use PyObject* instead of void*, but then I feel it is a
bit harder to tell the difference between PyObject *foo, and object foo. And
when I first wrote it, I didn't realize you could do PyObject*.

If you feel like it would be more obvious/better/something a different way, I'm
willing to change it.

> 
> I do wonder if it would be simpler to just split the paths into lists
> as we read them in.  I suppose that would require considerably more
> allocation, and anyhow would be out of scope for this change.

Simpler to split the paths into lists as we read them in...

You mean instead of having a dirblock record be:

(dirname, name, file_id)

Such as:

('foo/bar', 'baz', 'baz-200720707-aoudnatoheu-1')

You would have it be:

(['foo', 'bar'], 'baz', 'baz-200720707')

That certainly would be possible, and probably wouldn't be a lot of overhead
during parsing. Right now _get_entry is already designed to reuse the 'dirname'
object for all entries in the same directory (saves memory, and should save time).

What you would add is a .split() call for all directories (~1 in 10 entries are
directories in large projects).

It would make bisecting more straightforward, but I don't know that it would be
faster. cmp_by_dirs can actually be pretty fast because it is only a slight
variation on memcmp based on how the first unmatched character differs. Versus
comparing 2 lists.

> 
> +
> +    # Use 32-bit comparisons for the matching portion of the string.
> +    # Almost all CPU's are faster at loading and comparing 32-bit
> integers,
> +    # than they are at 8-bit integers.
> +    # TODO: jam 2007-05-07 Do we need to change this so we always start
> at an
> +    #       integer offset in memory? I seem to remember that being
> done in
> +    #       some C libraries for strcmp()
> 
> As Jan said, yes, unaligned access can cause a bus error on some
> architectures.  I would be inclined to not worry about that for now.
> The code to get this right is a bit complex and anyhow for short
> components it may be no slower.  If you consider that many paths might
> have components of only 6 characters or so then comparing by ints may
> not be useful.

I went ahead and did an aligned check, and just skip this loop if either string
is not aligned.

> 
> Uh, also you can't rely on ints being 4 characters.  (Unless I'm
> missing something and they reliably are in pyrex?)

I don't care if they are 4-chars. As long as they are a faster way to iterate
through memory. I realized I have a typo, in that I used '%4' instead of '%
sizeof(int)'. But I went ahead and fixed that.

> 
> More importantly: for python are we sure we're comparing the utf-8
> values, not the interpreted unicode.  I'm pretty sure we are.  But in
> utf-8 we will have some characters with the high bit set, and those
> will compare the wrong way if you compare them as signed ints and
> chars.  I think you definitely need some tests of this on utf-8 paths.

Good points. I'll add the tests and switch them over to 'unsigned char' to be
explicit. (good catch, on my Mac at least, char* is a signed char, which
evaluates incorrectly)
I'll also add a PyString_CheckExact() call for the cmp_by_dirs() function,
which ensures that you have a real PyString, and not a PyUnicode.

As far as worrying about the 'signed ints' portion, it doesn't matter, because
all I evaluate for there is equality. I don't compare whether one is > or < the
other. I only evaluate that on the individual characters level. (If I find int
!= int, I have to switch to per-character comparison anyway).

> 
> Again, since we shouldn't have nuls in the strings and Python strings
> are guaranteed to be terminated I don't think you need to explicitly
> pass the length, which should make a lot of the code smaller.

Then you can't use the integer looping, unless you add code which checks for a
NULL inside a 32-bit integer.

I'm sure I've seen some really cool hack to do so, as part of one of the libc's
I've seen. But that just adds a bit more hackery.

Also, as Andrew has pointed out, being nul safe is just good practise.

> 
> +cdef int _cmp_path_by_dirblock(char *path1, int path1_len,
> +                               char *path2, int path2_len):
> 
> Probably needs a docstring.
> 
> +cdef object _pystr(char *s, int length):
> 
> This doesn't seem to be called?

Removed. It was used at some point, but isn't anymore.

> 
> Adding Pyrex makes it more important that we get some automatic cross
> platform testing.

Certainly. I can say that I've been running all of the dirstate tests on Mac,
but that isn't the same as having automated tests for all of it.
It also makes things more interesting on non x86 platforms. Like AMD64 and such.

> 
> I haven't read the benchmarks.  The bisection code looks ok.  But I
> think the size, alignment, and signedness issues need to be addressed.
> 

(Using a merge-directive diff, because the bundle has gotten pretty huge)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGnnk8JdeBCYSNAAMRAnBoAJ9FuQyYfcme9xqipUuh+6CNogwlGgCfdj25
KPh4qROfbzNSusAiCpQjXaE=
=vgdO
-----END PGP SIGNATURE-----
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: dirstate_pyrex.diff
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20070718/47b6e37b/attachment-0001.diff