[MERGE] Pyrex RIO implementation

John Arbash Meinel john at arbash-meinel.com
Thu May 14 18:35:56 BST 2009


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jelmer Vernooij wrote:
> This patch adds a Pyrex implementation of RIO that is faster than the
> current implementation of RIO in "plain" Python. This is my first
> Pyrex extension for Bazaar so I hope I've updated all the right places
> :-)
> 
> The RIO tests pass with this patch, and with my RIOSerializer patch
> and this patch merged I now get faster deserialization using RIO than
> using XML:
> 

I haven't done a full review, but at least this gives you stuff to work on.

BB:resubmit

If you are going to do this, then we need a

bzrlib/tests/test__rio.py

file which has permutation tests for both implementations. We have stuff
like that in test__dirstate_helpers.py or test__chk_map.py,
test__chunks_to_lines.py, etc.

We need to be extra careful with C extensions, and make sure to test all
the edge cases, so we don't get core dumps, etc.

...

So I would definitely want to see you get rid of 'strcmp' in favor of
'memcmp', or possibly 'strncmp'. But Python strings are allowed to have
'\0' in them.

But if you look at your code, you can do:

+        if not PyString_CheckExact(line):
+            raise TypeError("%r is not a line" % line)
+        c_line = PyString_AS_STRING(line)
+        c_len = PyString_GET_SIZE(line)
+        if strcmp(c_line, "") == 0:
+            break       # end of file
+        if strcmp(c_line, "\n") == 0:
+            break       # end of stanza

instead

+        if not PyString_CheckExact(line):
+            raise TypeError("%r is not a line" % line)
+        c_line = PyString_AS_STRING(line)
+        c_len = PyString_GET_SIZE(line)
+        if c_len < 1:
+            break       # end of file
+        if c_len == 1 and c_line[0] == c'\n':
+            break       # end of stanza
+        if c_line[0] == c'\t': # continues previous value
+            if tag is None:
+                raise ValueError('invalid continuation line %r' % line)
+            new_value = PyUnicode_DecodeUTF8(c_line+1, c_len-1, "strict")


I think using 'line in line_iter' is going to be okay, though probably
not the fastest possible. For compatibility, we can live with it.

However, I would certainly avoid doing DecodeUTF8 on every line.
Instead, I would build up a list of character buffers, concatenate them,
and then call PyUnicode_DecodeUTF8 on that.

+            colon = <char *>strstr(c_line, ": ")
^- strstr is also not strictly safe. I think it would be better to do
something with 'memchr' since that gives bounds on the search length.
Since that only looks for a single byte, you may want to factor out a
simple helper that searches for ':' and then checks that the next
character is ' '.

I'm not positive whether ':' would be allowed in a tag name, but I don't
think it should.

+def _valid_tag(tag):
+    cdef char *c_tag
+    cdef Py_ssize_t c_len
+    cdef int i
+    if not PyString_CheckExact(tag):
+        raise TypeError(tag)
+    c_tag = PyString_AS_STRING(tag)
+    c_len = PyString_GET_SIZE(tag)
+    for i from 0 <= i < c_len:
+        if (not isalnum(c_tag[i]) and not c_tag[i] == c'_' and

^- I would probably be surprised if this is actually faster than the
regex check.

I'd also be somewhat concerned if 'isalnum' is designed as being 'locale
aware' because we *don't* want that here.

Another possible check would be:

cdef char c #'int' might actually be faster
...
c = c_tag[i]
# sorted in approximately order of likelyhood
if not ((c >= c'a' and c <= c'z')
	or c == c'_' or c == c'-'
	or (c >= c'A' and c <= c'Z')
	or (c >= c'0' and c <= c'9')):
  return False


Though I doubt this code path is actually critical. I'm just not sure if
I trust 'isalnum()'.

There is other stuff like using "PyList_Append(accum_value, X)" instead
of accum_value.append(X).

If I really wanted it fast, I would probably allocate a single buffer at
the top of a fixed size (8kB, for example). And then memcpy each lines
information into it. If I ran out of space, then realloc(size*3/2) or
something like that. And then when done, you can just do:

value = PyUnicode_DecodeUTF8(buffer, bytes_used-1, "strict")

and you don't have to worry about any of the list append calls, or
decoding bits as you go, etc.

You also get to *re-use* this buffer for every key:value pair, since it
is only used until you finish.

You unfortunately don't get to share this buffer between Stanzas. You
*could*, though I'd want to make sure to memory cap it to something
reasonable, so that decoding a giant Revision text won't leave us with
lots of wasted memory just sitting around.

I'd be a bit concerned about this structure:
+        if line is None or line == unicode(''):
+            break       # end of file

I think there is a Py_UNICODE cast around, and you can do similar things
like:

Py_UNICODE *buf
Py_ssize_t buf_len

buf = PyUnicode_AS_UNICODE(line)
buf_len = PyUnicode_GET_SIZE(line)

etc.

I think we'll want to optimize the 'read_stanza_unicode' specifically,
because it will effect any nested structures, like the revision
properties, which also happens to be where per-file commit info is
stored for MySQL, which is going to be one of the big performance bits.
(And is what showed the problem with the old 'accum_value += value')

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkoMVnwACgkQJdeBCYSNAANNtgCg1k0EMc/1AM78eIOblQrU0yW4
Qb8AoMusg/x7mB/S/I8Z7gXFZAww7bCt
=e/H4
-----END PGP SIGNATURE-----



More information about the bazaar mailing list