[MERGE] Read .kndx files in pyrex (updated)

Fri Jun 29 18:17:43 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Attached is a revised version which should address all of your comments.

Matthew D. Fuller wrote:
> On Thu, Jun 28, 2007 at 10:55:32PM -0500 I heard the voice of
> John Arbash Meinel, and lo! it spake thus:
>> I was trying to document that, but if you feel it would be better in
>> a different way (or if memchr() is a better function to use, and we
>> know it is portable), then I'll switch.
>
> My memchr(3) manpage says:
>
> STANDARDS
>      The memchr() function conforms to ISO/IEC 9899:1990 (``ISO C90'').
>
> so it's probably a little late in the game for a platform to not have
> it...
>

I've switched all calls to strchr() over to memchr(). It is slightly
uglier because you have to cast back to a char * from void*, but not a
big deal.

>
>> As I understand it, there isn't a "strtoi".
>
> There isn't.  int foo = (int)strtol(...) is the standard phraseology
> (or you may want an unsigned int and strtoul() in this case...).  Of
> course, you could also just make int_parent a long, but what effects
> that might have up-stack I don't know.

Technically those locations should only be positive integers (since they
are offset into a file, parent number, size, etc). But the current code
uses 'int()' and doesn't check that the return value is negative. So I
kept with it.

The main effect "up the stack" is that anything larger than an "int" is
converted to a PyLong instead of a PyInt. Which is about slightly slower.

% TIMEIT '100L + 200L'
10000000 loops, best of 3: 0.101 usec per loop

% TIMEIT '100 + 200'
10000000 loops, best of 3: 0.0809 usec per loop

It is a small thing, but trying to use PyInt where we can is probably an
overall win.

Robert Collins wrote:
> PQM is using dapper; I don't have an immediate answer for this, but I've
> seen it before and fixed it in the fast stat fingerprint code I wrote in
> pyrex. HTH.
> 
> -Rob
> -- 

I decided to write the tests like:

try:
    self.assertRaises(errors.KnitCorrupt,
                      self.get_knit_index, transport, 'filename', 'r')
except TypeError, e:
    if (str(e) == ('exceptions must be strings, classes, or instances,'
                   ' not exceptions.IndexError')
        and sys.version_info[0:2] >= (2,5)):
        self.knownFailure('Pyrex <0.9.5 fails with TypeError when'
                          ' raising new style exceptions with python'
                          ' >=2.5')
    else:
        raise

This way I'm ensuring that exactly the expected TypeError is being
raised, and that it is happening under the correct circumstances. (As I
mentioned in another email, I could also check the Pyrex version, but
only if they have pyrex actually installed).

Andrew Bennetts wrote:
> Overall, this looks pretty good to me.  I have some comments...
> 
>> +    cdef object process_options(self, char *option_str, char *end):
>> +        """Process the options string into a list."""
>> +        cdef char *next
>> +
>> +        # options = PyString_FromStringAndSize(option_str,
>> +        #                                      end-option_str)
>> +        # return options.split(',')
> 
> Don't leave commented out code lying around (unless there's a reason, in which
> case there should be a comment saying what it is).

I changed these to comment:

# This is alternative code which creates a python string and splits it.
# It is "correct" and more obvious, but slower than the following code.
# It can be uncommented to switch in case the other code is seen as
# suspect.

If we prefer to just remove it altogether, that is ok.

> 
>> +        final_options = []
>> +
>> +        while option_str < end:
>> +            # Using strchr here actually hurts performance dramatically.
>> +            # Because you aren't guaranteed to have a ',' any time soon,
>> +            # so it may have to search for a long time.
>> +            # The closest function is memchr, but that seems to be a
>> +            # GNU extension.
>> +            next = self._end_of_option(option_str, end)
> 
> I don't understand how _end_of_option is going to be faster than strchr?
> Doesn't it do essentially the same thing?
> 
> (_end_of_option is safer though, because it won't keep searching for a ','
> indefinitely even if there's no NULL, so I'm fine with having it, I just don't
> understand this comment.  Or is that what you're alluding to?)

As I mentioned in the other email, it is because strchr() would search
past the local end (' '). But I've gone ahead and just changed it over
to memchr().

Overall, I've actually seen a 10% speed improvement with these changes
(342ms => 333ms). Whether that is because memchr() is faster than my
custom version, or if memchr() is universally faster than strchr (I
suppose it doesn't have to check every character for NULL).

...

>> +
>> +        parents = []
>> +        while parent_str <= end and parent_str != NULL:
> 
> How could parent_str be NULL here?  The only way I can see that happening is if
> it is passed as NULL to process_parents, but in that case there's no need to
> check it every time around the loop.
> 
> (Although I'm more interesting in clarity than the possible micro-optimisation.)

Actually, the calling code can't pass it in as NULL, because there is
already a check for that case, and it indicates that the record is
corrupt (not enough sections).

> 
>> +            # strchr is safe here, because our lines always end
>> +            # with ' :'
> 
> Unless the file is corrupt?  What if someone maliciously creates a branch with
> this sort of kndx?
> 
> *Never* trust input.
> 

Then you would run off the end and hit the final NULL. However, I did
switch to memchr().

>> +                # This in an integer mapping to original
>> +                # TODO: ensure that we are actually parsing the string
>> +                int_parent = strtol(parent_str, &parent_end, 10)
> 
> You declared int_parent as an "int", but strtol returns a "long int".

I refactored this into using a helper function which uses strtol() and
raises a ValueError if it fails.

> Again, you're using strchr.  That makes me nervous.
> 

switched to memchr()

> 
>> +        # TODO: Check that we are actually reading integers
>> +        pos = strtol(pos_str, NULL, 10)
>> +        size = strtol(size_str, NULL, 10)
> 
> If there isn't already a function somewhere for converting a string to an
> integer and making sure the whole string is consumed, you ought to make a
> function out of it so you can avoid scattering this TODO everywhere.  You've
> already written that logic once.

Actually, I didn't write that logic until very recently. Because we
didn't have any tests that checked that the entries in the .kndx were
valid integers. But since I added them, I had to check :)

I also added 2 new tests that ensure the 'position' and 'size' variables
are valid integers or it raises KnitCorrupt.

> 
> [...]
>> +            index = <object>PyTuple_GetItem_void_void(cache_entry, 5)
> 
> My pyrex is rusty, but what happens with refcounting when you cast like this?
> It's not immediately obvious to me that this will do the right thing.
> 
> If it is correct, it's probably worth adding a brief comment explaining why.
> 

commented

>> +        # Find the next newline
>> +        last = strchr(start, c'\n')
> 
> As usual, strchr makes me worried.  Is it safe to assume these strings really
> are NULL terminated?

The only function I use to get the data from the global knit is
PyString_AsString, which:
http://www.python.org/doc/current/api/stringObjects.html

says returns a NUL terminated string. So yes, I'm positive that at the
end of the buffer will be a NUL.

> 
> Conversely, is it safe to assume that there are no NULLs in .kndx files that
> might confuse this logic in dangerous ways?

I went ahead and did a bit of an audit of what C functions I'm
importing, and I removed all of the ones I wasn't. It amounts to:

strtol:
	This is only used in string_to_int_safe which ensures that we
        consume exactly as many characters as we expect to consume.

PyString_AsString:
	Used to get the char* buffer out of a string.
	Also, I only use it in conjunction with:

PyString_Size:
	To get the total length of the string buffer, and set the
	'end_str' pointer. I've switched over to memchr() and all
	accesses should be bound either my self.end_str or something
	smaller (depending on what string I am currently processing.)

PyString_FromStringAndSize:
	Whenever I create a PyString object, I explicitly set the size
	of the memory buffer.

PyList_CheckExact:
	Used to make sure the kndx.history we are using is exactly a
	List object. Which is what makes it safe to use:

PyList_GET_ITEM:
	Used to directly access a List member. No error checking, but we
	check before we enter.

PyList_Append:
	I'm not sure if this does error checking, but I know it can
	raise an exception.

Py_INCREF:
	PyList_GET_ITEM doesn't increment the reference counter, so the
	one place that I use it, I make sure to Py_INCREF

PyDict_CheckExact:
	Used to make sure kndx.cache is exactly a dictionary so that I
	can use:

PyDict_GetItem_void:
	We use this form because it can return NULL if the key doesn't
	exist. This allows us to do the equivalent of:
	if x in dict:
	   y = dict[x]
	else:
	   something else

	without having to do 2 dictionary lookups. (The alternative is a
	try/except KeyError but that also requires handling the stack
	trace, etc.)

PyDict_SetItem
	Used to add entries to the cache

PyTuple_GetItem_void_void
	Used because we use PyDict_GetItem_void but we only care about
	one of the members. I think we could use
	PyTuple_GetItem_from_void, and then use Py_INCREF manually like
	I did for PyList_GetItem

> 
> I've only skimmed the non-.pyx files.
> 
> C code is *much* easier to screw up in dangerous ways than Python code.  Perhaps
> it would be good to consider adding some fuzz-style tests to try make sure we
> don't have any nasty bugs lurking.

I think we are pretty safe, just by auditing what C functions we are
using. Getting away from strchr is a good idea, and I think we are left
with relatively safe actions.

> 
> It's also easier to screw up memory management and have a leak.  It may be good
> to have test or script that just exercises this code over and over so we can
> check the memory usage doesn't climb forever.
> 
> -Andrew.
> 

I haven't done this yet. I would be willing to if we feel it is
important. There is always a tradeoff between how much effort you put
into it versus the risk of accepting it. (Arguably memory leaks are
approximately as common in Python code because you can have reference
cycles, etc, and there isn't an easy way to measure memory consumption
as it isn't something Python exposes).

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGhT63JdeBCYSNAAMRAppRAJ9EpNYuflKoVBTWMdJ7BDCEy8BQbQCdEkmp
tm7cqSej+8RzRQ7DW/jZrxs=
=MQWe
-----END PGP SIGNATURE-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: knit_index_pyrex.patch
Type: text/x-patch
Size: 224366 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070629/24d8cdc9/attachment-0001.bin