[MERGE] get_record_stream().get_bytes_as('chunked')

Thu Dec 11 14:00:04 GMT 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Vincent Ladeuil wrote:
>>>>>> "jam" == John Arbash Meinel <john at arbash-meinel.com> writes:
> 
> <snip/>
> 
>     jam> At first, I thought I might be able to get away with
>     jam> just a python implementation, but the best I could do
>     jam> was 2.5ms for NEWS, while a compiled version was 0.45ms.
> 
> But on the overall is the python version still a win or a loss ?
> Said otherwise: is it a win for C extensions users but a loss for
> python ones or a win for both ?
> 
>     jam> I decided to work on this right now, because I found
>     jam> that "repo.revision_trees(revs[:100])" was consuming
>     jam> 300MB of RAM just for the byte strings of the inventory
>     jam> texts. With this patch, the peak memory consumption is
>     jam> just 59MB.
> 
> That's a net win which could justify a time loss.
> 

I would guess it is a net win in all cases. The old code did:
   fulltext = ''.join(lines)
and then later did
  lines = osutils.split_lines(fulltext)

The new code does "lines = osutils.split_lines(''.join(chunks))" if it
doesn't think the chunks are already lines.

> <snip/>
> 
>     jam> +def chunks_to_lines(chunks):
>     jam> +    """Ensure that chunks is split cleanly into lines.
> 
> Shouldn't that be: 'Ensure that chunks is split cleanly into
> lines or split them otherwise'.
> 

I suppose. I was trying to say that we would produce lines, and if they
happened to already be then we would re-use them. I'm not entirely sure
if I should keep my somewhat hackish python implementation, rather than
just always having it return "split_lines(''.join(chunks))". It will be
a bit of a memory hog, but it is honestly far faster than I thought it
would be.

>     jam> +
>     jam> +
>     jam> +    Each entry in the result should contain a single newline at the end. Except
>     jam> +    for the last entry which may not have a final newline.
>     jam> +
>     jam> +    :param chunks: An list/tuple of strings. If chunks is already a list of
>     jam> +        lines, then we will return it as-is.
>     jam> +    :return: A list of strings.
>     jam> +    """
>     jam> +    # Optimize for a very common case when chunks are already lines
> 
> 
> I'm not sure I understand the overall intent of the patch
> here. It seems that storage_kind 'chunked' is used for both lines
> and chunks and that you then need to check whether or not these
> chunks are really lines. Why not make the distinction between
> 'chunks' and 'lines' as storage_kind and use the appropriate
> get_lines or get_chunks then ?

I would like us to get away from our "lines" based apis. I think it has
hung us up in various places, because it doesn't let us express what we
already have. "fulltext" has the same problem. The idea is that "chunks"
can be trivially both "fulltext" and "lines", and something in between.
So you have fewer times when you need to do "''.join(lines)" or
"split_lines(fulltext)".

For example, say you had a compressor that didn't pay attention to line
endings. It could return both less-than-line and more-than-one-line hunks.

This code also lets us write a different chunks_to_lines implementation,
that can re-use as much of the source as possible. So if you had:
   ["foo\n", "bar\nbaz\n", "bing\n"]

It could preserve the first and third strings, and only split the middle
one. I didn't bother implementing that yet, because we don't have data
that can use it. But it is something that I would like us to be able to do.

> 
>     jam> +    def fail():
>     jam> +        raise IndexError
>     jam> +    try:
>     jam> +        # This is a bit ugly, but is the fastest way to check if all of the
>     jam> +        # chunks are individual lines.
>     jam> +        # You can't use function calls like .count(), .index(), or endswith()
>     jam> +        # because they incur too much python overhead.
>     jam> +        # It works because
>     jam> +        #   if chunk is an empty string, it will raise IndexError, which will
>     jam> +        #       be caught.
>     jam> +        #   if chunk doesn't end with '\n' then we hit fail()
>     jam> +        #   if there is more than one '\n' then we hit fail()
>     jam> +        # timing shows this loop to take 2.58ms rather than 3.18ms for
>     jam> +        # split_lines(''.join(chunks))
>     jam> +        # Further, it means we get to preserve the original lines, rather than
>     jam> +        # expanding memory
>     jam> +        if not chunks:
>     jam> +            return chunks
>     jam> +        [(chunk[-1] == '\n' and '\n' not in chunk[:-1]) or fail()
> 
> Nit-picking: Since that is the only call to fail(), why not
> inlining it 

> 
>     jam> +         for chunk in chunks[:-1]]
>     jam> +        last = chunks[-1]
>     jam> +        if last and '\n' not in last[:-1]:
>     jam> +            return chunks
>     jam> +    except IndexError:
>     jam> +        pass
>     jam> +    from bzrlib.osutils import split_lines
>     jam> +    return split_lines(''.join(chunks))
> 
>     jam> === added file 'bzrlib/_chunks_to_lines_pyx.pyx'
>     jam> --- bzrlib/_chunks_to_lines_pyx.pyx	1970-01-01 00:00:00 +0000
>     jam> +++ bzrlib/_chunks_to_lines_pyx.pyx	2008-12-11 03:08:03 +0000
>     jam> @@ -0,0 +1,66 @@
>     jam> +# Copyright (C) 2008 Canonical Ltd
>     jam> +#
>     jam> +# This program is free software; you can redistribute it and/or modify
>     jam> +# it under the terms of the GNU General Public License as published by
>     jam> +# the Free Software Foundation; either version 2 of the License, or
>     jam> +# (at your option) any later version.
>     jam> +#
>     jam> +# This program is distributed in the hope that it will be useful,
>     jam> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
>     jam> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>     jam> +# GNU General Public License for more details.
>     jam> +#
>     jam> +# You should have received a copy of the GNU General Public License
>     jam> +# along with this program; if not, write to the Free Software
>     jam> +# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
>     jam> +#
>     jam> +
>     jam> +"""Pyrex extensions for converting chunks to lines."""
>     jam> +
>     jam> +#python2.4 support
>     jam> +cdef extern from "python-compat.h":
>     jam> +    pass
>     jam> +
>     jam> +cdef extern from "stdlib.h":
>     jam> +    ctypedef unsigned size_t
>     jam> +
>     jam> +cdef extern from "Python.h":
>     jam> +    ctypedef int Py_ssize_t # Required for older pyrex versions
>     jam> +    ctypedef struct PyObject:
>     jam> +        pass
>     jam> +    int PyList_Append(object lst, object item) except -1
>     jam> +
>     jam> +    char *PyString_AsString(object p) except NULL
>     jam> +    int PyString_AsStringAndSize(object s, char **buf, Py_ssize_t *len) except -1
>     jam> +
>     jam> +cdef extern from "string.h":
>     jam> +    void *memchr(void *s, int c, size_t n)
>     jam> +
>     jam> +
>     jam> +def chunks_to_lines(chunks):
>     jam> +    cdef char *c_str
>     jam> +    cdef char *newline
>     jam> +    cdef char *c_last
>     jam> +    cdef Py_ssize_t the_len
>     jam> +    cdef Py_ssize_t chunks_len
>     jam> +    cdef Py_ssize_t cur
>     jam> +
>     jam> +    # Check to see if the chunks are already lines
>     jam> +    chunks_len = len(chunks)
>     jam> +    if chunks_len == 0:
>     jam> +        return chunks
>     jam> +    cur = 0
>     jam> +    for chunk in chunks:
>     jam> +        cur += 1
>     jam> +        PyString_AsStringAndSize(chunk, &c_str, &the_len)
>     jam> +        if the_len == 0:
>     jam> +            break
>     jam> +        c_last = c_str + the_len - 1
>     jam> +        newline = <char *>memchr(c_str, c'\n', the_len)
>     jam> +        if newline != c_last and not (newline == NULL and cur == chunks_len):
> 
> Why don't you keep that construct in python ?
> 
>     jam> +    
>     jam> +            break else: return chunks
> 
> Nit-picking: break that line into three, the 'return chunks' part
> is the most important line of the function, don't hide it.
> 
> BB:tweak
> 
>         Vincent
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAklBHOQACgkQJdeBCYSNAAMzDQCeJOR+EcBDbog4w4/i9SIXGwJ0
zn0AoI7HJVp2hP5ZOYND97bLrP0bAqJz
=Mr7Z
-----END PGP SIGNATURE-----