[MERGE] get_record_stream().get_bytes_as('chunked')

Thu Dec 11 14:06:40 GMT 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Vincent Ladeuil wrote:
>>>>>> "jam" == John Arbash Meinel <john at arbash-meinel.com> writes:
> 

...

Sorry, I hit send too early.

>     jam> +    def fail():
>     jam> +        raise IndexError
>     jam> +    try:
>     jam> +        # This is a bit ugly, but is the fastest way to check if all of the
>     jam> +        # chunks are individual lines.
>     jam> +        # You can't use function calls like .count(), .index(), or endswith()
>     jam> +        # because they incur too much python overhead.
>     jam> +        # It works because
>     jam> +        #   if chunk is an empty string, it will raise IndexError, which will
>     jam> +        #       be caught.
>     jam> +        #   if chunk doesn't end with '\n' then we hit fail()
>     jam> +        #   if there is more than one '\n' then we hit fail()
>     jam> +        # timing shows this loop to take 2.58ms rather than 3.18ms for
>     jam> +        # split_lines(''.join(chunks))
>     jam> +        # Further, it means we get to preserve the original lines, rather than
>     jam> +        # expanding memory
>     jam> +        if not chunks:
>     jam> +            return chunks
>     jam> +        [(chunk[-1] == '\n' and '\n' not in chunk[:-1]) or fail()
> 
> Nit-picking: Since that is the only call to fail(), why not
> inlining it ?

I did try that, but you can't have a "raise Foo" in a list
comprehension. You can only have declarative statements.
  File "C:\Users\jameinel\dev\bzr\work\bzrlib\_chunks_to_lines_py.py",
line 49
     [(chunk[-1] == '\n' and '\n' not in chunk[:-1]) or raise IndexError()
                                                            ^
 SyntaxError: invalid syntax

Since we have an even faster extension, though, I'm definitely
considering going back to our rule of "make the python one
understandable, make the pyrex/c one fast".

...

>     jam> +def chunks_to_lines(chunks):
>     jam> +    cdef char *c_str
>     jam> +    cdef char *newline
>     jam> +    cdef char *c_last
>     jam> +    cdef Py_ssize_t the_len
>     jam> +    cdef Py_ssize_t chunks_len
>     jam> +    cdef Py_ssize_t cur
>     jam> +
>     jam> +    # Check to see if the chunks are already lines
>     jam> +    chunks_len = len(chunks)
>     jam> +    if chunks_len == 0:
>     jam> +        return chunks
>     jam> +    cur = 0
>     jam> +    for chunk in chunks:
>     jam> +        cur += 1
>     jam> +        PyString_AsStringAndSize(chunk, &c_str, &the_len)
>     jam> +        if the_len == 0:
>     jam> +            break
>     jam> +        c_last = c_str + the_len - 1
>     jam> +        newline = <char *>memchr(c_str, c'\n', the_len)
>     jam> +        if newline != c_last and not (newline == NULL and cur == chunks_len):
> 
> Why don't you keep that construct in python ?

What construct is that? I basically kept iterating the python one until
I could find the fastest ops I could. Then I did a quick test pyrex
implementation, and realized it was 5x faster than the best python I
could write, so I spent more time on it.

I would be willing to go back and rewrite the python one if you prefer.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAklBHm8ACgkQJdeBCYSNAAMY7ACgxTrmfhb0eDGBk5HtpgWIuZVY
DpUAnjVesSjbrABC0sSs4ajs1jtbfo5f
=W+Xj
-----END PGP SIGNATURE-----