[MERGE] Read .kndx files in pyrex

Fri Jun 29 04:55:32 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> Overall, this looks pretty good to me.  I have some comments...
> 
>> +    cdef object process_options(self, char *option_str, char *end):
>> +        """Process the options string into a list."""
>> +        cdef char *next
>> +
>> +        # options = PyString_FromStringAndSize(option_str,
>> +        #                                      end-option_str)
>> +        # return options.split(',')
> 
> Don't leave commented out code lying around (unless there's a reason, in which
> case there should be a comment saying what it is).

The commented code is actually reference code. Which is correct, but slow.

It gives me a way to make sure that I was doing what I expected. I suppose I
could put the statement: "The following code is equivalent to this shorter code
that relies on creating python objects".

> 
>> +        final_options = []
>> +
>> +        while option_str < end:
>> +            # Using strchr here actually hurts performance dramatically.
>> +            # Because you aren't guaranteed to have a ',' any time soon,
>> +            # so it may have to search for a long time.
>> +            # The closest function is memchr, but that seems to be a
>> +            # GNU extension.
>> +            next = self._end_of_option(option_str, end)
> 
> I don't understand how _end_of_option is going to be faster than strchr?
> Doesn't it do essentially the same thing?
> 
> (_end_of_option is safer though, because it won't keep searching for a ','
> indefinitely even if there's no NULL, so I'm fine with having it, I just don't
> understand this comment.  Or is that what you're alluding to?)
> 

Exactly. I couldn't find a libc functional equivalent of strnchr(). I guess
there is memchr().

The specific reason is because I have a single buffer in memory, and I only
want to search within a given line for ','. Without stopping early, the code
spends far too long searching into the *next* line. Actually, I only really
want to search the next few bytes (until the next whitespace character).

I was trying to document that, but if you feel it would be better in a
different way (or if memchr() is a better function to use, and we know it is
portable), then I'll switch.

>> +    cdef object process_parents(self, char *parent_str, char *end):
>> +        cdef char *next
>> +        cdef int int_parent
>> +        cdef char *parent_end
>> +
>> +        # parents = PyString_FromStringAndSize(parent_str,
>> +        #                                      end - parent_str)
>> +        # real_parents = []
>> +        # for parent in parents.split():
>> +        #     if parent[0].startswith('.'):
>> +        #         real_parents.append(parent[1:])
>> +        #     else:
>> +        #         real_parents.append(self.history[int(parent)])
>> +        # return real_parents
> 
> Again, mysterious commented out code.

Again, reference code which uses Python objects rather than working within a
single string buffer.

> 
>> +
>> +        parents = []
>> +        while parent_str <= end and parent_str != NULL:
> 
> How could parent_str be NULL here?  The only way I can see that happening is if
> it is passed as NULL to process_parents, but in that case there's no need to
> check it every time around the loop.

I think I was originally doing something like:

next = strchr(parent_str, c' ')
...
parent_str = next

sort of thing. But I explicitly check for next == NULL, so you are right that
it isn't needed.

> 
> (Although I'm more interesting in clarity than the possible micro-optimisation.)
> 
>> +            # strchr is safe here, because our lines always end
>> +            # with ' :'
> 
> Unless the file is corrupt?  What if someone maliciously creates a branch with
> this sort of kndx?
> 
> *Never* trust input.
> 

You will still have a NULL at the end of the file, right?

>> +                # This in an integer mapping to original
>> +                # TODO: ensure that we are actually parsing the string
>> +                int_parent = strtol(parent_str, &parent_end, 10)
> 
> You declared int_parent as an "int", but strtol returns a "long int".
> 

As I understand it, there isn't a "strtoi". There is only atoi (which doesn't
let you check for not processing all characters) and strtol. if you don't like
it, I can do:

int_parent = <int> strtol(...)

I honestly don't care to try and make our code support > 2^32 entries in a
single knit index.

>> +    cdef int process_one_record(self, char *start, char *end) except -1:
>> +        """Take a simple string and split it into an index record."""
>> +        cdef char *version_id_str
>> +        cdef int version_id_size
>> +        cdef char *option_str
>> +        cdef char *option_end
>> +        cdef char *pos_str
>> +        cdef int pos
>> +        cdef char *size_str
>> +        cdef int size
>> +        cdef char *parent_str
>> +        cdef int parent_size
>> +        cdef void *cache_entry
>> +
>> +        version_id_str = start
>> +        option_str = strchr(version_id_str, c' ')
> 
> Again, you're using strchr.  That makes me nervous.
> 
> 
>> +        # TODO: Check that we are actually reading integers
>> +        pos = strtol(pos_str, NULL, 10)
>> +        size = strtol(size_str, NULL, 10)
> 
> If there isn't already a function somewhere for converting a string to an
> integer and making sure the whole string is consumed, you ought to make a
> function out of it so you can avoid scattering this TODO everywhere.  You've
> already written that logic once.
> 
> [...]
>> +            index = <object>PyTuple_GetItem_void_void(cache_entry, 5)
> 
> My pyrex is rusty, but what happens with refcounting when you cast like this?
> It's not immediately obvious to me that this will do the right thing.
> 
> If it is correct, it's probably worth adding a brief comment explaining why.
>

When you cast to an <object> it increments the ref counter. I would be happy to
document that.

>> +        # Find the next newline
>> +        last = strchr(start, c'\n')
> 
> As usual, strchr makes me worried.  Is it safe to assume these strings really
> are NULL terminated?
> 
> Conversely, is it safe to assume that there are no NULLs in .kndx files that
> might confuse this logic in dangerous ways?

If you hit a \0 early, I think you'll just mark it as corrupt. We can certainly
add tests for that sort of thing.

Note that with the python code, '\0' sitting around will just do weird things.
Like having a revision id with a null in the middle.

> 
> I've only skimmed the non-.pyx files.
> 
> C code is *much* easier to screw up in dangerous ways than Python code.  Perhaps
> it would be good to consider adding some fuzz-style tests to try make sure we
> don't have any nasty bugs lurking.

We can do it. But writing tests that make sure you don't leak memory is a
little odd. Especially since they would tend to be "do this a whole lot of
times just in case something is broken".

At least, it doesn't fit my idea of what unit testing is all about.

If we want to be extra safe, we could count the number of references to some
specific class. I forget exactly how to do it, but I thought you could go
through gc or some other such thing, and get a count of how many references an
object had. Which in turn would let you run the function, count the references
to something, and then run it again and see if they went up/down.

You'd have to watch out that your monitoring code didn't cause measuring
complications, though.

> 
> It's also easier to screw up memory management and have a leak.  It may be good
> to have test or script that just exercises this code over and over so we can
> check the memory usage doesn't climb forever.
> 
> -Andrew.

I can see that being an external thing, but I'd be fine with having it.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGhIK0JdeBCYSNAAMRAq8AAKC96xB2b6lPC/GolN+NCqoHe15KzwCg0YWJ
NPjVfmXHXW1cIP5uF4Iwjcg=
=SOVx
-----END PGP SIGNATURE-----