[MERGE] Pyrex RIO implementation

Thu May 14 21:04:54 BST 2009

On Thu, May 14, 2009 at 12:35:56PM -0500, John Arbash Meinel wrote:
> Jelmer Vernooij wrote:
> > This patch adds a Pyrex implementation of RIO that is faster than the
> > current implementation of RIO in "plain" Python. This is my first
> > Pyrex extension for Bazaar so I hope I've updated all the right places
> > :-)
> > 
> > The RIO tests pass with this patch, and with my RIOSerializer patch
> > and this patch merged I now get faster deserialization using RIO than
> > using XML:
> > 
> 
> I haven't done a full review, but at least this gives you stuff to work on.

Thanks.
> 
> If you are going to do this, then we need a
> 
> bzrlib/tests/test__rio.py
> 
> file which has permutation tests for both implementations. We have stuff
> like that in test__dirstate_helpers.py or test__chk_map.py,
> test__chunks_to_lines.py, etc.
> 
> We need to be extra careful with C extensions, and make sure to test all
> the edge cases, so we don't get core dumps, etc.
Done, I think I've covered most cases.

> So I would definitely want to see you get rid of 'strcmp' in favor of
> 'memcmp', or possibly 'strncmp'. But Python strings are allowed to have
> '\0' in them.
Thanks, I've now eliminated strcmp altogether.

> I think using 'line in line_iter' is going to be okay, though probably
> not the fastest possible. For compatibility, we can live with it.
Yeah, this seems reasonable.

> However, I would certainly avoid doing DecodeUTF8 on every line.
> Instead, I would build up a list of character buffers, concatenate them,
> and then call PyUnicode_DecodeUTF8 on that.
> 
> +            colon = <char *>strstr(c_line, ": ")
> ^- strstr is also not strictly safe. I think it would be better to do
> something with 'memchr' since that gives bounds on the search length.
> Since that only looks for a single byte, you may want to factor out a
> simple helper that searches for ':' and then checks that the next
> character is ' '.
Thanks, fixed now as well. I just added a function that does a split
based on ": " and returns a plain string for the left side and a
UTF8-based unicode string for the right side.

> I'm not positive whether ':' would be allowed in a tag name, but I don't
> think it should.
No, it's not allowed now either.

> +def _valid_tag(tag):
> +    cdef char *c_tag
> +    cdef Py_ssize_t c_len
> +    cdef int i
> +    if not PyString_CheckExact(tag):
> +        raise TypeError(tag)
> +    c_tag = PyString_AS_STRING(tag)
> +    c_len = PyString_GET_SIZE(tag)
> +    for i from 0 <= i < c_len:
> +        if (not isalnum(c_tag[i]) and not c_tag[i] == c'_' and
> 
> ^- I would probably be surprised if this is actually faster than the
> regex check.
Yeah, that was actually the reason I put it in. It matter quite a bit
(a few percent total).

> I'd also be somewhat concerned if 'isalnum' is designed as being 'locale
> aware' because we *don't* want that here.
> 
> Though I doubt this code path is actually critical. I'm just not sure if
> I trust 'isalnum()'.
Thanks, fixed now (explicit check, as you suggested).

> There is other stuff like using "PyList_Append(accum_value, X)" instead
> of accum_value.append(X).
Thanks, fixed too.

> If I really wanted it fast, I would probably allocate a single buffer at
> the top of a fixed size (8kB, for example). And then memcpy each lines
> information into it. If I ran out of space, then realloc(size*3/2) or
> something like that. And then when done, you can just do:
> 
> value = PyUnicode_DecodeUTF8(buffer, bytes_used-1, "strict")
> 
> and you don't have to worry about any of the list append calls, or
> decoding bits as you go, etc.
> 
> You also get to *re-use* this buffer for every key:value pair, since it
> is only used until you finish.
I've implemented this, too.

> You unfortunately don't get to share this buffer between Stanzas. You
> *could*, though I'd want to make sure to memory cap it to something
> reasonable, so that decoding a giant Revision text won't leave us with
> lots of wasted memory just sitting around.
I think the current improvements make things fast enough. Doing a
malloc (there'll usually just be one or two per revision) doesn't seem
like a huge problem to me.

> I think we'll want to optimize the 'read_stanza_unicode' specifically,
> because it will effect any nested structures, like the revision
> properties, which also happens to be where per-file commit info is
> stored for MySQL, which is going to be one of the big performance bits.
> (And is what showed the problem with the old 'accum_value += value')
I've done this as well, optimizing it similarly to the way is already 
done for utf8 strings.

Cheers,

Jelmer
-------------- next part --------------
A non-text attachment was scrubbed...
Name: rio-c2.diff
Type: text/x-diff
Size: 36015 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20090514/741fb6f0/attachment-0001.bin 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 315 bytes
Desc: Digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20090514/741fb6f0/attachment-0001.pgp