[MERGE] get_record_stream().get_bytes_as('chunked')
John Arbash Meinel
john at arbash-meinel.com
Thu Dec 11 14:06:40 GMT 2008
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Vincent Ladeuil wrote:
>>>>>> "jam" == John Arbash Meinel <john at arbash-meinel.com> writes:
>
...
Sorry, I hit send too early.
> jam> + def fail():
> jam> + raise IndexError
> jam> + try:
> jam> + # This is a bit ugly, but is the fastest way to check if all of the
> jam> + # chunks are individual lines.
> jam> + # You can't use function calls like .count(), .index(), or endswith()
> jam> + # because they incur too much python overhead.
> jam> + # It works because
> jam> + # if chunk is an empty string, it will raise IndexError, which will
> jam> + # be caught.
> jam> + # if chunk doesn't end with '\n' then we hit fail()
> jam> + # if there is more than one '\n' then we hit fail()
> jam> + # timing shows this loop to take 2.58ms rather than 3.18ms for
> jam> + # split_lines(''.join(chunks))
> jam> + # Further, it means we get to preserve the original lines, rather than
> jam> + # expanding memory
> jam> + if not chunks:
> jam> + return chunks
> jam> + [(chunk[-1] == '\n' and '\n' not in chunk[:-1]) or fail()
>
> Nit-picking: Since that is the only call to fail(), why not
> inlining it ?
I did try that, but you can't have a "raise Foo" in a list
comprehension. You can only have declarative statements.
File "C:\Users\jameinel\dev\bzr\work\bzrlib\_chunks_to_lines_py.py",
line 49
[(chunk[-1] == '\n' and '\n' not in chunk[:-1]) or raise IndexError()
^
SyntaxError: invalid syntax
Since we have an even faster extension, though, I'm definitely
considering going back to our rule of "make the python one
understandable, make the pyrex/c one fast".
...
> jam> +def chunks_to_lines(chunks):
> jam> + cdef char *c_str
> jam> + cdef char *newline
> jam> + cdef char *c_last
> jam> + cdef Py_ssize_t the_len
> jam> + cdef Py_ssize_t chunks_len
> jam> + cdef Py_ssize_t cur
> jam> +
> jam> + # Check to see if the chunks are already lines
> jam> + chunks_len = len(chunks)
> jam> + if chunks_len == 0:
> jam> + return chunks
> jam> + cur = 0
> jam> + for chunk in chunks:
> jam> + cur += 1
> jam> + PyString_AsStringAndSize(chunk, &c_str, &the_len)
> jam> + if the_len == 0:
> jam> + break
> jam> + c_last = c_str + the_len - 1
> jam> + newline = <char *>memchr(c_str, c'\n', the_len)
> jam> + if newline != c_last and not (newline == NULL and cur == chunks_len):
>
> Why don't you keep that construct in python ?
What construct is that? I basically kept iterating the python one until
I could find the fastest ops I could. Then I did a quick test pyrex
implementation, and realized it was 5x faster than the best python I
could write, so I spent more time on it.
I would be willing to go back and rewrite the python one if you prefer.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAklBHm8ACgkQJdeBCYSNAAMY7ACgxTrmfhb0eDGBk5HtpgWIuZVY
DpUAnjVesSjbrABC0sSs4ajs1jtbfo5f
=W+Xj
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list