[MERGE] get_record_stream().get_bytes_as('chunked')
Vincent Ladeuil
v.ladeuil+lp at free.fr
Thu Dec 11 10:46:29 GMT 2008
>>>>> "jam" == John Arbash Meinel <john at arbash-meinel.com> writes:
<snip/>
jam> At first, I thought I might be able to get away with
jam> just a python implementation, but the best I could do
jam> was 2.5ms for NEWS, while a compiled version was 0.45ms.
But on the overall is the python version still a win or a loss ?
Said otherwise: is it a win for C extensions users but a loss for
python ones or a win for both ?
jam> I decided to work on this right now, because I found
jam> that "repo.revision_trees(revs[:100])" was consuming
jam> 300MB of RAM just for the byte strings of the inventory
jam> texts. With this patch, the peak memory consumption is
jam> just 59MB.
That's a net win which could justify a time loss.
<snip/>
jam> +def chunks_to_lines(chunks):
jam> + """Ensure that chunks is split cleanly into lines.
Shouldn't that be: 'Ensure that chunks is split cleanly into
lines or split them otherwise'.
jam> +
jam> +
jam> + Each entry in the result should contain a single newline at the end. Except
jam> + for the last entry which may not have a final newline.
jam> +
jam> + :param chunks: An list/tuple of strings. If chunks is already a list of
jam> + lines, then we will return it as-is.
jam> + :return: A list of strings.
jam> + """
jam> + # Optimize for a very common case when chunks are already lines
I'm not sure I understand the overall intent of the patch
here. It seems that storage_kind 'chunked' is used for both lines
and chunks and that you then need to check whether or not these
chunks are really lines. Why not make the distinction between
'chunks' and 'lines' as storage_kind and use the appropriate
get_lines or get_chunks then ?
jam> + def fail():
jam> + raise IndexError
jam> + try:
jam> + # This is a bit ugly, but is the fastest way to check if all of the
jam> + # chunks are individual lines.
jam> + # You can't use function calls like .count(), .index(), or endswith()
jam> + # because they incur too much python overhead.
jam> + # It works because
jam> + # if chunk is an empty string, it will raise IndexError, which will
jam> + # be caught.
jam> + # if chunk doesn't end with '\n' then we hit fail()
jam> + # if there is more than one '\n' then we hit fail()
jam> + # timing shows this loop to take 2.58ms rather than 3.18ms for
jam> + # split_lines(''.join(chunks))
jam> + # Further, it means we get to preserve the original lines, rather than
jam> + # expanding memory
jam> + if not chunks:
jam> + return chunks
jam> + [(chunk[-1] == '\n' and '\n' not in chunk[:-1]) or fail()
Nit-picking: Since that is the only call to fail(), why not
inlining it ?
jam> + for chunk in chunks[:-1]]
jam> + last = chunks[-1]
jam> + if last and '\n' not in last[:-1]:
jam> + return chunks
jam> + except IndexError:
jam> + pass
jam> + from bzrlib.osutils import split_lines
jam> + return split_lines(''.join(chunks))
jam> === added file 'bzrlib/_chunks_to_lines_pyx.pyx'
jam> --- bzrlib/_chunks_to_lines_pyx.pyx 1970-01-01 00:00:00 +0000
jam> +++ bzrlib/_chunks_to_lines_pyx.pyx 2008-12-11 03:08:03 +0000
jam> @@ -0,0 +1,66 @@
jam> +# Copyright (C) 2008 Canonical Ltd
jam> +#
jam> +# This program is free software; you can redistribute it and/or modify
jam> +# it under the terms of the GNU General Public License as published by
jam> +# the Free Software Foundation; either version 2 of the License, or
jam> +# (at your option) any later version.
jam> +#
jam> +# This program is distributed in the hope that it will be useful,
jam> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
jam> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
jam> +# GNU General Public License for more details.
jam> +#
jam> +# You should have received a copy of the GNU General Public License
jam> +# along with this program; if not, write to the Free Software
jam> +# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
jam> +#
jam> +
jam> +"""Pyrex extensions for converting chunks to lines."""
jam> +
jam> +#python2.4 support
jam> +cdef extern from "python-compat.h":
jam> + pass
jam> +
jam> +cdef extern from "stdlib.h":
jam> + ctypedef unsigned size_t
jam> +
jam> +cdef extern from "Python.h":
jam> + ctypedef int Py_ssize_t # Required for older pyrex versions
jam> + ctypedef struct PyObject:
jam> + pass
jam> + int PyList_Append(object lst, object item) except -1
jam> +
jam> + char *PyString_AsString(object p) except NULL
jam> + int PyString_AsStringAndSize(object s, char **buf, Py_ssize_t *len) except -1
jam> +
jam> +cdef extern from "string.h":
jam> + void *memchr(void *s, int c, size_t n)
jam> +
jam> +
jam> +def chunks_to_lines(chunks):
jam> + cdef char *c_str
jam> + cdef char *newline
jam> + cdef char *c_last
jam> + cdef Py_ssize_t the_len
jam> + cdef Py_ssize_t chunks_len
jam> + cdef Py_ssize_t cur
jam> +
jam> + # Check to see if the chunks are already lines
jam> + chunks_len = len(chunks)
jam> + if chunks_len == 0:
jam> + return chunks
jam> + cur = 0
jam> + for chunk in chunks:
jam> + cur += 1
jam> + PyString_AsStringAndSize(chunk, &c_str, &the_len)
jam> + if the_len == 0:
jam> + break
jam> + c_last = c_str + the_len - 1
jam> + newline = <char *>memchr(c_str, c'\n', the_len)
jam> + if newline != c_last and not (newline == NULL and cur == chunks_len):
Why don't you keep that construct in python ?
jam> +
jam> + break else: return chunks
Nit-picking: break that line into three, the 'return chunks' part
is the most important line of the function, don't hide it.
BB:tweak
Vincent
More information about the bazaar
mailing list