[MERGE] Pyrex RIO implementation

Thu May 14 21:53:21 BST 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...

>> value = PyUnicode_DecodeUTF8(buffer, bytes_used-1, "strict")
>>
>> and you don't have to worry about any of the list append calls, or
>> decoding bits as you go, etc.
>>
>> You also get to *re-use* this buffer for every key:value pair, since it
>> is only used until you finish.
> I've implemented this, too.
> 
>> You unfortunately don't get to share this buffer between Stanzas. You
>> *could*, though I'd want to make sure to memory cap it to something
>> reasonable, so that decoding a giant Revision text won't leave us with
>> lots of wasted memory just sitting around.
> I think the current improvements make things fast enough. Doing a
> malloc (there'll usually just be one or two per revision) doesn't seem
> like a huge problem to me.
> 

As you said in the other message:
> With this patch, parsing all revision texts from bzr.dev now takes
> ~1.35s with the XML serializer and ~0.91s with the RIO one.

0.91s certainly brings it out of the critical path for 'bzr log' time,
though I'm curious where the remaining time is. (It may be in Stanza =>
Revision, for example.)

>> I think we'll want to optimize the 'read_stanza_unicode' specifically,
>> because it will effect any nested structures, like the revision
>> properties, which also happens to be where per-file commit info is
>> stored for MySQL, which is going to be one of the big performance bits.
>> (And is what showed the problem with the old 'accum_value += value')
> I've done this as well, optimizing it similarly to the way is already 
> done for utf8 strings.
> 
> Cheers,
> 
> Jelmer
> 

+            while accum_len + c_len > accum_size:
+                accum_size = (accum_size * 3) / 2
+                accum_value = <char *>realloc(accum_value, accum_size)
+                if accum_value == NULL:
+                    raise MemoryError

^- I'm thinking that you don't really need to call realloc on each pass.
Possibly just add:
if accum_size < accum_len + c_len:
  accum_size = (accum_len + c_len)

Also, should we be freeing the old 'accum_value' in the case that the
realloc fails? I think you handle this in the 'finally' with the problem
that it has already been set to NULL.

So we need:

new_accum_value = <char *>realloc(...)
if new_accum_value == NULL:
  raise
accum_value = new_accum_value

+                accum_len += c_len-1;

^- Unfortunately older versions of Pyrex don't support '+=' '|=' etc. So
we need to rewrite these as
accum_len = accum_len + c_len - 1

Also, you don't need ';' at the end of lines...

+        # TODO: jam 20060922 This code should raise real errors rather than
+        #       using 'assert' to process user input, or raising ValueError
+        #       rather than a more specific error.

^- These TODO statements are no longer relevant.

if Py_UNICODE_ISLINEBREAK(c_line[0]):
^- I would guess this supports the X number of linebreak variants that
Unicode defines, while we probably want to be a bit stricter and only
support '\n'. I'm pretty sure you can just use c_line[0] == c'\n' here.

+            if Py_UNICODE_ISSPACE(c_line[0]): # continues previous value,

^- Again, unicode code points line up with Ascii, so we can still do
c_line[0] == c'\t'

And same changes for realloc, in the Unicode path.

+class TestValidTag(tests.TestCase):
+

We should also test that u"foo" gives a TypeError, and that "\xb5" is an
invalid tag.

...

+    def assertReadStanza(self, result, line_iter):
+        self.assertEquals(result, self.module._read_stanza_utf8(line_iter))
+

^- I think we want to add a check that all value fields are Unicode objects.

Otherwise it looks good to me.

BB:tweak

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkoMhMEACgkQJdeBCYSNAAPh/ACgkCwAhj2sC4ttmTduy4Eq2CUb
ov0AoM3dLU4pQVlRQJJE5f6zD9B0u4LX
=QAec
-----END PGP SIGNATURE-----