[MERGE] Read .kndx files in pyrex

Fri Jun 29 03:47:11 BST 2007

Overall, this looks pretty good to me.  I have some comments...

> +    cdef object process_options(self, char *option_str, char *end):
> +        """Process the options string into a list."""
> +        cdef char *next
> +
> +        # options = PyString_FromStringAndSize(option_str,
> +        #                                      end-option_str)
> +        # return options.split(',')

Don't leave commented out code lying around (unless there's a reason, in which
case there should be a comment saying what it is).

> +        final_options = []
> +
> +        while option_str < end:
> +            # Using strchr here actually hurts performance dramatically.
> +            # Because you aren't guaranteed to have a ',' any time soon,
> +            # so it may have to search for a long time.
> +            # The closest function is memchr, but that seems to be a
> +            # GNU extension.
> +            next = self._end_of_option(option_str, end)

I don't understand how _end_of_option is going to be faster than strchr?
Doesn't it do essentially the same thing?

(_end_of_option is safer though, because it won't keep searching for a ','
indefinitely even if there's no NULL, so I'm fine with having it, I just don't
understand this comment.  Or is that what you're alluding to?)

> +    cdef object process_parents(self, char *parent_str, char *end):
> +        cdef char *next
> +        cdef int int_parent
> +        cdef char *parent_end
> +
> +        # parents = PyString_FromStringAndSize(parent_str,
> +        #                                      end - parent_str)
> +        # real_parents = []
> +        # for parent in parents.split():
> +        #     if parent[0].startswith('.'):
> +        #         real_parents.append(parent[1:])
> +        #     else:
> +        #         real_parents.append(self.history[int(parent)])
> +        # return real_parents

Again, mysterious commented out code.

> +
> +        parents = []
> +        while parent_str <= end and parent_str != NULL:

How could parent_str be NULL here?  The only way I can see that happening is if
it is passed as NULL to process_parents, but in that case there's no need to
check it every time around the loop.

(Although I'm more interesting in clarity than the possible micro-optimisation.)

> +            # strchr is safe here, because our lines always end
> +            # with ' :'

Unless the file is corrupt?  What if someone maliciously creates a branch with
this sort of kndx?

*Never* trust input.

> +                # This in an integer mapping to original
> +                # TODO: ensure that we are actually parsing the string
> +                int_parent = strtol(parent_str, &parent_end, 10)

You declared int_parent as an "int", but strtol returns a "long int".

> +    cdef int process_one_record(self, char *start, char *end) except -1:
> +        """Take a simple string and split it into an index record."""
> +        cdef char *version_id_str
> +        cdef int version_id_size
> +        cdef char *option_str
> +        cdef char *option_end
> +        cdef char *pos_str
> +        cdef int pos
> +        cdef char *size_str
> +        cdef int size
> +        cdef char *parent_str
> +        cdef int parent_size
> +        cdef void *cache_entry
> +
> +        version_id_str = start
> +        option_str = strchr(version_id_str, c' ')

Again, you're using strchr.  That makes me nervous.

> +        # TODO: Check that we are actually reading integers
> +        pos = strtol(pos_str, NULL, 10)
> +        size = strtol(size_str, NULL, 10)

If there isn't already a function somewhere for converting a string to an
integer and making sure the whole string is consumed, you ought to make a
function out of it so you can avoid scattering this TODO everywhere.  You've
already written that logic once.

[...]
> +            index = <object>PyTuple_GetItem_void_void(cache_entry, 5)

My pyrex is rusty, but what happens with refcounting when you cast like this?
It's not immediately obvious to me that this will do the right thing.

If it is correct, it's probably worth adding a brief comment explaining why.

> +        # Find the next newline
> +        last = strchr(start, c'\n')

As usual, strchr makes me worried.  Is it safe to assume these strings really
are NULL terminated?

Conversely, is it safe to assume that there are no NULLs in .kndx files that
might confuse this logic in dangerous ways?

I've only skimmed the non-.pyx files.

C code is *much* easier to screw up in dangerous ways than Python code.  Perhaps
it would be good to consider adding some fuzz-style tests to try make sure we
don't have any nasty bugs lurking.

It's also easier to screw up memory management and have a leak.  It may be good
to have test or script that just exercises this code over and over so we can
check the memory usage doesn't climb forever.

-Andrew.