[MERGE] get_record_stream().get_bytes_as('chunked')

Thu Dec 11 10:46:29 GMT 2008

>>>>> "jam" == John Arbash Meinel <john at arbash-meinel.com> writes:

<snip/>

    jam> At first, I thought I might be able to get away with
    jam> just a python implementation, but the best I could do
    jam> was 2.5ms for NEWS, while a compiled version was 0.45ms.

But on the overall is the python version still a win or a loss ?
Said otherwise: is it a win for C extensions users but a loss for
python ones or a win for both ?

    jam> I decided to work on this right now, because I found
    jam> that "repo.revision_trees(revs[:100])" was consuming
    jam> 300MB of RAM just for the byte strings of the inventory
    jam> texts. With this patch, the peak memory consumption is
    jam> just 59MB.

That's a net win which could justify a time loss.

<snip/>

    jam> +def chunks_to_lines(chunks):
    jam> +    """Ensure that chunks is split cleanly into lines.

Shouldn't that be: 'Ensure that chunks is split cleanly into
lines or split them otherwise'.

    jam> +
    jam> +
    jam> +    Each entry in the result should contain a single newline at the end. Except
    jam> +    for the last entry which may not have a final newline.
    jam> +
    jam> +    :param chunks: An list/tuple of strings. If chunks is already a list of
    jam> +        lines, then we will return it as-is.
    jam> +    :return: A list of strings.
    jam> +    """
    jam> +    # Optimize for a very common case when chunks are already lines

I'm not sure I understand the overall intent of the patch
here. It seems that storage_kind 'chunked' is used for both lines
and chunks and that you then need to check whether or not these
chunks are really lines. Why not make the distinction between
'chunks' and 'lines' as storage_kind and use the appropriate
get_lines or get_chunks then ?

    jam> +    def fail():
    jam> +        raise IndexError
    jam> +    try:
    jam> +        # This is a bit ugly, but is the fastest way to check if all of the
    jam> +        # chunks are individual lines.
    jam> +        # You can't use function calls like .count(), .index(), or endswith()
    jam> +        # because they incur too much python overhead.
    jam> +        # It works because
    jam> +        #   if chunk is an empty string, it will raise IndexError, which will
    jam> +        #       be caught.
    jam> +        #   if chunk doesn't end with '\n' then we hit fail()
    jam> +        #   if there is more than one '\n' then we hit fail()
    jam> +        # timing shows this loop to take 2.58ms rather than 3.18ms for
    jam> +        # split_lines(''.join(chunks))
    jam> +        # Further, it means we get to preserve the original lines, rather than
    jam> +        # expanding memory
    jam> +        if not chunks:
    jam> +            return chunks
    jam> +        [(chunk[-1] == '\n' and '\n' not in chunk[:-1]) or fail()

Nit-picking: Since that is the only call to fail(), why not
inlining it ?

    jam> +         for chunk in chunks[:-1]]
    jam> +        last = chunks[-1]
    jam> +        if last and '\n' not in last[:-1]:
    jam> +            return chunks
    jam> +    except IndexError:
    jam> +        pass
    jam> +    from bzrlib.osutils import split_lines
    jam> +    return split_lines(''.join(chunks))

    jam> === added file 'bzrlib/_chunks_to_lines_pyx.pyx'
    jam> --- bzrlib/_chunks_to_lines_pyx.pyx	1970-01-01 00:00:00 +0000
    jam> +++ bzrlib/_chunks_to_lines_pyx.pyx	2008-12-11 03:08:03 +0000
    jam> @@ -0,0 +1,66 @@
    jam> +# Copyright (C) 2008 Canonical Ltd
    jam> +#
    jam> +# This program is free software; you can redistribute it and/or modify
    jam> +# it under the terms of the GNU General Public License as published by
    jam> +# the Free Software Foundation; either version 2 of the License, or
    jam> +# (at your option) any later version.
    jam> +#
    jam> +# This program is distributed in the hope that it will be useful,
    jam> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
    jam> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    jam> +# GNU General Public License for more details.
    jam> +#
    jam> +# You should have received a copy of the GNU General Public License
    jam> +# along with this program; if not, write to the Free Software
    jam> +# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
    jam> +#
    jam> +
    jam> +"""Pyrex extensions for converting chunks to lines."""
    jam> +
    jam> +#python2.4 support
    jam> +cdef extern from "python-compat.h":
    jam> +    pass
    jam> +
    jam> +cdef extern from "stdlib.h":
    jam> +    ctypedef unsigned size_t
    jam> +
    jam> +cdef extern from "Python.h":
    jam> +    ctypedef int Py_ssize_t # Required for older pyrex versions
    jam> +    ctypedef struct PyObject:
    jam> +        pass
    jam> +    int PyList_Append(object lst, object item) except -1
    jam> +
    jam> +    char *PyString_AsString(object p) except NULL
    jam> +    int PyString_AsStringAndSize(object s, char **buf, Py_ssize_t *len) except -1
    jam> +
    jam> +cdef extern from "string.h":
    jam> +    void *memchr(void *s, int c, size_t n)
    jam> +
    jam> +
    jam> +def chunks_to_lines(chunks):
    jam> +    cdef char *c_str
    jam> +    cdef char *newline
    jam> +    cdef char *c_last
    jam> +    cdef Py_ssize_t the_len
    jam> +    cdef Py_ssize_t chunks_len
    jam> +    cdef Py_ssize_t cur
    jam> +
    jam> +    # Check to see if the chunks are already lines
    jam> +    chunks_len = len(chunks)
    jam> +    if chunks_len == 0:
    jam> +        return chunks
    jam> +    cur = 0
    jam> +    for chunk in chunks:
    jam> +        cur += 1
    jam> +        PyString_AsStringAndSize(chunk, &c_str, &the_len)
    jam> +        if the_len == 0:
    jam> +            break
    jam> +        c_last = c_str + the_len - 1
    jam> +        newline = <char *>memchr(c_str, c'\n', the_len)
    jam> +        if newline != c_last and not (newline == NULL and cur == chunks_len):

Why don't you keep that construct in python ?

    jam> +    
    jam> +            break else: return chunks

Nit-picking: break that line into three, the 'return chunks' part
is the most important line of the function, don't hide it.

BB:tweak

        Vincent