Rev 3814: Merging the get_record_stream_chunked code drops peak memory noticeably. in http://bzr.arbash-meinel.com/branches/bzr/brisbane/xml_cache

Thu Dec 11 03:46:20 GMT 2008

At http://bzr.arbash-meinel.com/branches/bzr/brisbane/xml_cache

------------------------------------------------------------
revno: 3814
revision-id: john at arbash-meinel.com-20081211034550-cc2indrpb6a6rjn6
parent: john at arbash-meinel.com-20081211000604-kzutwqr3jkeez10s
parent: john at arbash-meinel.com-20081211031852-cmjpdf2ufno0okui
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: xml_cache
timestamp: Wed 2008-12-10 21:45:50 -0600
message:
  Merging the get_record_stream_chunked code drops peak memory noticeably.
  For the first 1k revisions, peak memory consumption drops by almost 100MB.
added:
  bzrlib/_chunks_to_lines_py.py  _chunks_to_lines_py.-20081211024848-6uc3mtuje8j14l60-1
  bzrlib/_chunks_to_lines_pyx.pyx _chunks_to_lines_pyx-20081211021736-op7n8vrxgrd8snfi-1
  bzrlib/tests/test__chunks_to_lines.py test__chunks_to_line-20081211024848-6uc3mtuje8j14l60-2
modified:
  .bzrignore                     bzrignore-20050311232317-81f7b71efa2db11a
  bzrlib/knit.py                 knit.py-20051212171256-f056ac8f0fbe1bd9
  bzrlib/merge.py                merge.py-20050513021216-953b65a438527106
  bzrlib/osutils.py              osutils.py-20050309040759-eeaff12fbf77ac86
  bzrlib/repository.py           rev_storage.py-20051111201905-119e9401e46257e3
  bzrlib/tests/__init__.py       selftest.py-20050531073622-8d0e3c8845c97a64
  bzrlib/tests/test_osutils.py   test_osutils.py-20051201224856-e48ee24c12182989
  bzrlib/tests/test_versionedfile.py test_versionedfile.py-20060222045249-db45c9ed14a1c2e5
  bzrlib/transform.py            transform.py-20060105172343-dd99e54394d91687
  bzrlib/versionedfile.py        versionedfile.py-20060222045106-5039c71ee3b65490
  bzrlib/weave.py                knit.py-20050627021749-759c29984154256b
  setup.py                       setup.py-20050314065409-02f8a0a6e3f9bc70
    ------------------------------------------------------------
    revno: 3735.139.17
    revision-id: john at arbash-meinel.com-20081211031852-cmjpdf2ufno0okui
    parent: john at arbash-meinel.com-20081211030803-gctunob7zsten3qg
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: get_record_stream_chunked
    timestamp: Wed 2008-12-10 21:18:52 -0600
    message:
      Start using osutils.chunks_as_lines rather than osutils.split_lines.
    modified:
      bzrlib/knit.py                 knit.py-20051212171256-f056ac8f0fbe1bd9
      bzrlib/merge.py                merge.py-20050513021216-953b65a438527106
      bzrlib/transform.py            transform.py-20060105172343-dd99e54394d91687
      bzrlib/versionedfile.py        versionedfile.py-20060222045106-5039c71ee3b65490
      bzrlib/weave.py                knit.py-20050627021749-759c29984154256b
    ------------------------------------------------------------
    revno: 3735.139.16
    revision-id: john at arbash-meinel.com-20081211030803-gctunob7zsten3qg
    parent: john at arbash-meinel.com-20081211021859-3ds8cwdqiq387t83
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: get_record_stream_chunked
    timestamp: Wed 2008-12-10 21:08:03 -0600
    message:
      Move everything into properly parameterized tests.
      
      Also add tests that we preserve the object when it is already lines.
      
      The compiled form takes 450us on a 7.6k line file (NEWS).
      So for common cases, we should have virtually no overhead.
    added:
      bzrlib/_chunks_to_lines_py.py  _chunks_to_lines_py.-20081211024848-6uc3mtuje8j14l60-1
      bzrlib/tests/test__chunks_to_lines.py test__chunks_to_line-20081211024848-6uc3mtuje8j14l60-2
    modified:
      bzrlib/_chunks_to_lines_pyx.pyx _chunks_to_lines_pyx-20081211021736-op7n8vrxgrd8snfi-1
      bzrlib/osutils.py              osutils.py-20050309040759-eeaff12fbf77ac86
      bzrlib/tests/__init__.py       selftest.py-20050531073622-8d0e3c8845c97a64
      bzrlib/tests/test_osutils.py   test_osutils.py-20051201224856-e48ee24c12182989
    ------------------------------------------------------------
    revno: 3735.139.15
    revision-id: john at arbash-meinel.com-20081211021859-3ds8cwdqiq387t83
    parent: john at arbash-meinel.com-20081211020207-rrgdcyqc344zo5q1
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: get_record_stream_chunked
    timestamp: Wed 2008-12-10 20:18:59 -0600
    message:
      A Pyrex extension is about 5x faster than the fastest python code I could write.
      
      Seems worth having after all.
    added:
      bzrlib/_chunks_to_lines_pyx.pyx _chunks_to_lines_pyx-20081211021736-op7n8vrxgrd8snfi-1
    modified:
      .bzrignore                     bzrignore-20050311232317-81f7b71efa2db11a
      bzrlib/osutils.py              osutils.py-20050309040759-eeaff12fbf77ac86
      setup.py                       setup.py-20050314065409-02f8a0a6e3f9bc70
    ------------------------------------------------------------
    revno: 3735.139.14
    revision-id: john at arbash-meinel.com-20081211020207-rrgdcyqc344zo5q1
    parent: john at arbash-meinel.com-20081211011419-vqtdjgpa04woqvm4
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: get_record_stream_chunked
    timestamp: Wed 2008-12-10 20:02:07 -0600
    message:
      Change name to 'chunks_to_lines', and find an optimized form.
      
      It is a little bit ugly, but it is faster than join & split, and means
      we get to leave the strings untouched.
    modified:
      bzrlib/osutils.py              osutils.py-20050309040759-eeaff12fbf77ac86
      bzrlib/tests/test_osutils.py   test_osutils.py-20051201224856-e48ee24c12182989
    ------------------------------------------------------------
    revno: 3735.139.13
    revision-id: john at arbash-meinel.com-20081211011419-vqtdjgpa04woqvm4
    parent: john at arbash-meinel.com-20081211011038-osioaxd7moquxxmy
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: get_record_stream_chunked
    timestamp: Wed 2008-12-10 19:14:19 -0600
    message:
      More tests for edge cases.
    modified:
      bzrlib/tests/test_osutils.py   test_osutils.py-20051201224856-e48ee24c12182989
    ------------------------------------------------------------
    revno: 3735.139.12
    revision-id: john at arbash-meinel.com-20081211011038-osioaxd7moquxxmy
    parent: john at arbash-meinel.com-20081211010104-3tcii2strejk5252
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: get_record_stream_chunked
    timestamp: Wed 2008-12-10 19:10:38 -0600
    message:
      Add a new function that can convert 'chunks' format to a 'lines' format.
    modified:
      bzrlib/osutils.py              osutils.py-20050309040759-eeaff12fbf77ac86
      bzrlib/tests/test_osutils.py   test_osutils.py-20051201224856-e48ee24c12182989
    ------------------------------------------------------------
    revno: 3735.139.11
    revision-id: john at arbash-meinel.com-20081211010104-3tcii2strejk5252
    parent: john at arbash-meinel.com-20081211005616-szoqqeabcyahy39u
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: get_record_stream_chunked
    timestamp: Wed 2008-12-10 19:01:04 -0600
    message:
      Use the 'chunked' interface to keep memory consumption minimal during revision_trees()
    modified:
      bzrlib/repository.py           rev_storage.py-20051111201905-119e9401e46257e3
    ------------------------------------------------------------
    revno: 3735.139.10
    revision-id: john at arbash-meinel.com-20081211005616-szoqqeabcyahy39u
    parent: john at arbash-meinel.com-20081211005436-a8bn72zw43b1vd9r
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: get_record_stream_chunked
    timestamp: Wed 2008-12-10 18:56:16 -0600
    message:
      Change the signature to report the storage kind as 'chunked'
    modified:
      bzrlib/tests/test_versionedfile.py test_versionedfile.py-20060222045249-db45c9ed14a1c2e5
      bzrlib/versionedfile.py        versionedfile.py-20060222045106-5039c71ee3b65490
    ------------------------------------------------------------
    revno: 3735.139.9
    revision-id: john at arbash-meinel.com-20081211005436-a8bn72zw43b1vd9r
    parent: pqm at pqm.ubuntu.com-20081210082822-li6ku9s3k63kjrpr
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: get_record_stream_chunked
    timestamp: Wed 2008-12-10 18:54:36 -0600
    message:
      Start working on a ChunkedContentFactory.
      
      This allows get_bytes_as('chunked') for both FulltextContentFactory,
      and for ChunkedContentFactory, as it is a trivial conversion to
      go between the two styles.
      We will also want to special case when converting 'chunked' into
      'lines'. But that is for future work.
    modified:
      bzrlib/knit.py                 knit.py-20051212171256-f056ac8f0fbe1bd9
      bzrlib/tests/test_versionedfile.py test_versionedfile.py-20060222045249-db45c9ed14a1c2e5
      bzrlib/versionedfile.py        versionedfile.py-20060222045106-5039c71ee3b65490
      bzrlib/weave.py                knit.py-20050627021749-759c29984154256b
-------------- next part --------------
=== modified file '.bzrignore'

--- a/.bzrignore	2008-09-23 23:28:27 +0000
+++ b/.bzrignore	2008-12-11 02:18:59 +0000
@@ -39,6 +39,7 @@
 doc/**/*.html
 doc/developers/performance.png
 bzrlib/_btree_serializer_c.c
+bzrlib/_chunks_to_lines_pyx.c
 bzrlib/_dirstate_helpers_c.c
 bzrlib/_knit_load_data_c.c
 bzrlib/_readdir_pyx.c

=== added file 'bzrlib/_chunks_to_lines_py.py'
--- a/bzrlib/_chunks_to_lines_py.py	1970-01-01 00:00:00 +0000
+++ b/bzrlib/_chunks_to_lines_py.py	2008-12-11 03:08:03 +0000
@@ -0,0 +1,57 @@
+# Copyright (C) 2008 Canonical Ltd
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+
+"""The python implementation of chunks_to_lines"""
+
+
+def chunks_to_lines(chunks):
+    """Ensure that chunks is split cleanly into lines.
+
+    Each entry in the result should contain a single newline at the end. Except
+    for the last entry which may not have a final newline.
+
+    :param chunks: An list/tuple of strings. If chunks is already a list of
+        lines, then we will return it as-is.
+    :return: A list of strings.
+    """
+    # Optimize for a very common case when chunks are already lines
+    def fail():
+        raise IndexError
+    try:
+        # This is a bit ugly, but is the fastest way to check if all of the
+        # chunks are individual lines.
+        # You can't use function calls like .count(), .index(), or endswith()
+        # because they incur too much python overhead.
+        # It works because
+        #   if chunk is an empty string, it will raise IndexError, which will
+        #       be caught.
+        #   if chunk doesn't end with '\n' then we hit fail()
+        #   if there is more than one '\n' then we hit fail()
+        # timing shows this loop to take 2.58ms rather than 3.18ms for
+        # split_lines(''.join(chunks))
+        # Further, it means we get to preserve the original lines, rather than
+        # expanding memory
+        if not chunks:
+            return chunks
+        [(chunk[-1] == '\n' and '\n' not in chunk[:-1]) or fail()
+         for chunk in chunks[:-1]]
+        last = chunks[-1]
+        if last and '\n' not in last[:-1]:
+            return chunks
+    except IndexError:
+        pass
+    from bzrlib.osutils import split_lines
+    return split_lines(''.join(chunks))

=== added file 'bzrlib/_chunks_to_lines_pyx.pyx'
--- a/bzrlib/_chunks_to_lines_pyx.pyx	1970-01-01 00:00:00 +0000
+++ b/bzrlib/_chunks_to_lines_pyx.pyx	2008-12-11 03:08:03 +0000
@@ -0,0 +1,66 @@
+# Copyright (C) 2008 Canonical Ltd
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+#
+
+"""Pyrex extensions for converting chunks to lines."""
+
+#python2.4 support
+cdef extern from "python-compat.h":
+    pass
+
+cdef extern from "stdlib.h":
+    ctypedef unsigned size_t
+
+cdef extern from "Python.h":
+    ctypedef int Py_ssize_t # Required for older pyrex versions
+    ctypedef struct PyObject:
+        pass
+    int PyList_Append(object lst, object item) except -1
+
+    char *PyString_AsString(object p) except NULL
+    int PyString_AsStringAndSize(object s, char **buf, Py_ssize_t *len) except -1
+
+cdef extern from "string.h":
+    void *memchr(void *s, int c, size_t n)
+
+
+def chunks_to_lines(chunks):
+    cdef char *c_str
+    cdef char *newline
+    cdef char *c_last
+    cdef Py_ssize_t the_len
+    cdef Py_ssize_t chunks_len
+    cdef Py_ssize_t cur
+
+    # Check to see if the chunks are already lines
+    chunks_len = len(chunks)
+    if chunks_len == 0:
+        return chunks
+    cur = 0
+    for chunk in chunks:
+        cur += 1
+        PyString_AsStringAndSize(chunk, &c_str, &the_len)
+        if the_len == 0:
+            break
+        c_last = c_str + the_len - 1
+        newline = <char *>memchr(c_str, c'\n', the_len)
+        if newline != c_last and not (newline == NULL and cur == chunks_len):
+            break
+    else:
+        return chunks
+
+    from bzrlib import osutils
+    return osutils.split_lines(''.join(chunks))

=== modified file 'bzrlib/knit.py'
--- a/bzrlib/knit.py	2008-12-07 16:46:56 +0000
+++ b/bzrlib/knit.py	2008-12-11 03:45:50 +0000
@@ -110,7 +110,7 @@
     adapter_registry,
     ConstantMapper,
     ContentFactory,
-    FulltextContentFactory,
+    ChunkedContentFactory,
     VersionedFile,
     VersionedFiles,
     )
@@ -196,7 +196,8 @@
             [compression_parent], 'unordered', True).next()
         if basis_entry.storage_kind == 'absent':
             raise errors.RevisionNotPresent(compression_parent, self._basis_vf)
-        basis_lines = split_lines(basis_entry.get_bytes_as('fulltext'))
+        basis_chunks = basis_entry.get_bytes_as('chunked')
+        basis_lines = osutils.chunks_to_lines(basis_chunks)
         # Manually apply the delta because we have one annotated content and
         # one plain.
         basis_content = PlainKnitContent(basis_lines, compression_parent)
@@ -229,7 +230,8 @@
             [compression_parent], 'unordered', True).next()
         if basis_entry.storage_kind == 'absent':
             raise errors.RevisionNotPresent(compression_parent, self._basis_vf)
-        basis_lines = split_lines(basis_entry.get_bytes_as('fulltext'))
+        basis_chunks = basis_entry.get_bytes_as('chunked')
+        basis_lines = osutils.chunks_to_lines(basis_chunks)
         basis_content = PlainKnitContent(basis_lines, compression_parent)
         # Manually apply the delta because we have one annotated content and
         # one plain.
@@ -276,11 +278,13 @@
     def get_bytes_as(self, storage_kind):
         if storage_kind == self.storage_kind:
             return self._raw_record
-        if storage_kind == 'fulltext' and self._knit is not None:
-            return self._knit.get_text(self.key[0])
-        else:
-            raise errors.UnavailableRepresentation(self.key, storage_kind,
-                self.storage_kind)
+        if self._knit is not None:
+            if storage_kind == 'chunked':
+                return self._knit.get_lines(self.key[0])
+            elif storage_kind == 'fulltext':
+                return self._knit.get_text(self.key[0])
+        raise errors.UnavailableRepresentation(self.key, storage_kind,
+            self.storage_kind)
 
 
 class KnitContent(object):
@@ -1025,7 +1029,7 @@
                 if record.storage_kind == 'absent':
                     continue
                 missing_keys.remove(record.key)
-                lines = split_lines(record.get_bytes_as('fulltext'))
+                lines = osutils.chunks_to_lines(record.get_bytes_as('chunked'))
                 text_map[record.key] = lines
                 content_map[record.key] = PlainKnitContent(lines, record.key)
                 if record.key in keys:
@@ -1293,9 +1297,8 @@
                 text_map, _ = self._get_content_maps(keys, non_local)
                 for key in keys:
                     lines = text_map.pop(key)
-                    text = ''.join(lines)
-                    yield FulltextContentFactory(key, global_map[key], None,
-                                                 text)
+                    yield ChunkedContentFactory(key, global_map[key], None,
+                                                lines)
         else:
             for source, keys in source_keys:
                 if source is parent_maps[0]:
@@ -1448,6 +1451,9 @@
                         buffered = True
                 if not buffered:
                     self._index.add_records([index_entry])
+            elif record.storage_kind == 'chunked':
+                self.add_lines(record.key, parents,
+                    osutils.chunks_to_lines(record.get_bytes_as('chunked')))
             elif record.storage_kind == 'fulltext':
                 self.add_lines(record.key, parents,
                     split_lines(record.get_bytes_as('fulltext')))
@@ -2957,7 +2963,7 @@
         reannotate = annotate.reannotate
         for record in self._knit.get_record_stream(keys, 'topological', True):
             key = record.key
-            fulltext = split_lines(record.get_bytes_as('fulltext'))
+            fulltext = osutils.chunks_to_lines(record.get_bytes_as('chunked'))
             parents = parent_map[key]
             if parents is not None:
                 parent_lines = [parent_cache[parent] for parent in parent_map[key]]

=== modified file 'bzrlib/merge.py'
--- a/bzrlib/merge.py	2008-10-10 11:55:03 +0000
+++ b/bzrlib/merge.py	2008-12-11 03:18:52 +0000
@@ -1579,7 +1579,7 @@
 
     def get_lines(self, revisions):
         """Get lines for revisions from the backing VersionedFiles.
-        
+
         :raises RevisionNotPresent: on absent texts.
         """
         keys = [(self._key_prefix + (rev,)) for rev in revisions]
@@ -1587,8 +1587,8 @@
         for record in self.vf.get_record_stream(keys, 'unordered', True):
             if record.storage_kind == 'absent':
                 raise errors.RevisionNotPresent(record.key, self.vf)
-            result[record.key[-1]] = osutils.split_lines(
-                record.get_bytes_as('fulltext'))
+            result[record.key[-1]] = osutils.chunks_to_lines(
+                record.get_bytes_as('chunked'))
         return result
 
     def plan_merge(self):

=== modified file 'bzrlib/osutils.py'
--- a/bzrlib/osutils.py	2008-10-17 03:49:08 +0000
+++ b/bzrlib/osutils.py	2008-12-11 03:08:03 +0000
@@ -812,6 +812,7 @@
             rps.append(f)
     return rps
 
+
 def joinpath(p):
     for f in p:
         if (f == '..') or (f is None) or (f == ''):
@@ -819,6 +820,12 @@
     return pathjoin(*p)
 
 
+try:
+    from bzrlib._chunks_to_lines_pyx import chunks_to_lines
+except ImportError:
+    from bzrlib._chunks_to_lines_py import chunks_to_lines
+
+
 def split_lines(s):
     """Split s into lines, but without removing the newline characters."""
     lines = s.split('\n')

=== modified file 'bzrlib/repository.py'
--- a/bzrlib/repository.py	2008-12-10 23:11:31 +0000
+++ b/bzrlib/repository.py	2008-12-11 03:45:50 +0000
@@ -1725,14 +1725,15 @@
     def _iter_inventory_xmls(self, revision_ids):
         keys = [(revision_id,) for revision_id in revision_ids]
         stream = self.inventories.get_record_stream(keys, 'unordered', True)
-        texts = {}
+        text_chunks = {}
         for record in stream:
             if record.storage_kind != 'absent':
-                texts[record.key] = record.get_bytes_as('fulltext')
+                text_chunks[record.key] = record.get_bytes_as('chunked')
             else:
                 raise errors.NoSuchRevision(self, record.key)
         for key in keys:
-            yield texts.pop(key), key[-1]
+            chunks = text_chunks.pop(key)
+            yield ''.join(chunks), key[-1]
 
     def deserialise_inventory(self, revision_id, xml):
         """Transform the xml into an inventory object. 

=== modified file 'bzrlib/tests/__init__.py'
--- a/bzrlib/tests/__init__.py	2008-12-10 23:11:31 +0000
+++ b/bzrlib/tests/__init__.py	2008-12-11 03:45:50 +0000
@@ -2790,6 +2790,7 @@
                    'bzrlib.tests.test_cache_utf8',
                    'bzrlib.tests.test_chk_map',
                    'bzrlib.tests.test_chunk_writer',
+                   'bzrlib.tests.test__chunks_to_lines',
                    'bzrlib.tests.test_commands',
                    'bzrlib.tests.test_commit',
                    'bzrlib.tests.test_commit_merge',

=== added file 'bzrlib/tests/test__chunks_to_lines.py'
--- a/bzrlib/tests/test__chunks_to_lines.py	1970-01-01 00:00:00 +0000
+++ b/bzrlib/tests/test__chunks_to_lines.py	2008-12-11 03:08:03 +0000
@@ -0,0 +1,112 @@
+# Copyright (C) 2008 Canonical Ltd
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+#
+
+"""Tests for chunks_to_lines."""
+
+from bzrlib import tests
+
+
+def load_tests(standard_tests, module, loader):
+    # parameterize all tests in this module
+    suite = loader.suiteClass()
+    applier = tests.TestScenarioApplier()
+    import bzrlib._chunks_to_lines_py as py_module
+    applier.scenarios = [('python', {'module': py_module})]
+    if CompiledChunksToLinesFeature.available():
+        import bzrlib._chunks_to_lines_pyx as c_module
+        applier.scenarios.append(('C', {'module': c_module}))
+    else:
+        # the compiled module isn't available, so we add a failing test
+        class FailWithoutFeature(tests.TestCase):
+            def test_fail(self):
+                self.requireFeature(CompiledChunksToLinesFeature)
+        suite.addTest(loader.loadTestsFromTestCase(FailWithoutFeature))
+    tests.adapt_tests(standard_tests, applier, suite)
+    return suite
+
+
+class _CompiledChunksToLinesFeature(tests.Feature):
+
+    def _probe(self):
+        try:
+            import bzrlib._chunks_to_lines_pyx
+        except ImportError:
+            return False
+        return True
+
+    def feature_name(self):
+        return 'bzrlib._chunks_to_lines_pyx'
+
+CompiledChunksToLinesFeature = _CompiledChunksToLinesFeature()
+
+
+class TestChunksToLines(tests.TestCase):
+
+    module = None # Filled in by test parameterization
+
+    def assertChunksToLines(self, lines, chunks, alreadly_lines=False):
+        result = self.module.chunks_to_lines(chunks)
+        self.assertEqual(lines, result)
+        if alreadly_lines:
+            self.assertIs(chunks, result)
+
+    def test_fulltext_chunk_to_lines(self):
+        self.assertChunksToLines(['foo\n', 'bar\r\n', 'ba\rz\n'],
+                                 ['foo\nbar\r\nba\rz\n'])
+        self.assertChunksToLines(['foobarbaz\n'], ['foobarbaz\n'],
+                                 alreadly_lines=True)
+
+    def test_lines_to_lines(self):
+        self.assertChunksToLines(['foo\n', 'bar\r\n', 'ba\rz\n'],
+                                 ['foo\n', 'bar\r\n', 'ba\rz\n'],
+                                 alreadly_lines=True)
+
+    def test_no_final_newline(self):
+        self.assertChunksToLines(['foo\n', 'bar\r\n', 'ba\rz'],
+                                 ['foo\nbar\r\nba\rz'])
+        self.assertChunksToLines(['foo\n', 'bar\r\n', 'ba\rz'],
+                                 ['foo\n', 'bar\r\n', 'ba\rz'],
+                                 alreadly_lines=True)
+        self.assertChunksToLines(('foo\n', 'bar\r\n', 'ba\rz'),
+                                 ('foo\n', 'bar\r\n', 'ba\rz'),
+                                 alreadly_lines=True)
+        self.assertChunksToLines([], [], alreadly_lines=True)
+        self.assertChunksToLines(['foobarbaz'], ['foobarbaz'],
+                                 alreadly_lines=True)
+        self.assertChunksToLines([], [''])
+
+    def test_mixed(self):
+        self.assertChunksToLines(['foo\n', 'bar\r\n', 'ba\rz'],
+                                 ['foo\n', 'bar\r\nba\r', 'z'])
+        self.assertChunksToLines(['foo\n', 'bar\r\n', 'ba\rz'],
+                                 ['foo\nb', 'a', 'r\r\nba\r', 'z'])
+        self.assertChunksToLines(['foo\n', 'bar\r\n', 'ba\rz'],
+                                 ['foo\nbar\r\nba', '\r', 'z'])
+
+        self.assertChunksToLines(['foo\n', 'bar\r\n', 'ba\rz'],
+                                 ['foo\n', '', 'bar\r\nba', '\r', 'z'])
+        self.assertChunksToLines(['foo\n', 'bar\r\n', 'ba\rz\n'],
+                                 ['foo\n', 'bar\r\n', 'ba\rz\n', ''])
+
+    def test_not_lines(self):
+        # We should raise a TypeError, not crash
+        self.assertRaises(TypeError, self.module.chunks_to_lines,
+                          object())
+        self.assertRaises(TypeError, self.module.chunks_to_lines,
+                          [object()])
+        self.assertRaises(TypeError, self.module.chunks_to_lines,
+                          ['foo', object()])

=== modified file 'bzrlib/tests/test_osutils.py'
--- a/bzrlib/tests/test_osutils.py	2008-10-01 07:56:03 +0000
+++ b/bzrlib/tests/test_osutils.py	2008-12-11 03:08:03 +0000
@@ -1,4 +1,4 @@
-# Copyright (C) 2005, 2006, 2007 Canonical Ltd
+# Copyright (C) 2005, 2006, 2007, 2008 Canonical Ltd
 #
 # This program is free software; you can redistribute it and/or modify
 # it under the terms of the GNU General Public License as published by
@@ -756,6 +756,23 @@
         self.assertEndsWith(osutils._mac_getcwd(), u'B\xe5gfors')
 
 
+class TestChunksToLines(TestCase):
+
+    def test_smoketest(self):
+        self.assertEqual(['foo\n', 'bar\n', 'baz\n'],
+                         osutils.chunks_to_lines(['foo\nbar', '\nbaz\n']))
+        self.assertEqual(['foo\n', 'bar\n', 'baz\n'],
+                         osutils.chunks_to_lines(['foo\n', 'bar\n', 'baz\n']))
+
+    def test_is_compiled(self):
+        from bzrlib.tests.test__chunks_to_lines import CompiledChunksToLinesFeature
+        if CompiledChunksToLinesFeature:
+            from bzrlib._chunks_to_lines_pyx import chunks_to_lines
+        else:
+            from bzrlib._chunks_to_lines_py import chunks_to_lines
+        self.assertIs(chunks_to_lines, osutils.chunks_to_lines)
+
+
 class TestSplitLines(TestCase):
 
     def test_split_unicode(self):

=== modified file 'bzrlib/tests/test_versionedfile.py'
--- a/bzrlib/tests/test_versionedfile.py	2008-12-07 16:46:56 +0000
+++ b/bzrlib/tests/test_versionedfile.py	2008-12-11 03:45:50 +0000
@@ -1622,8 +1622,9 @@
         """Assert that storage_kind is a valid storage_kind."""
         self.assertSubset([storage_kind],
             ['mpdiff', 'knit-annotated-ft', 'knit-annotated-delta',
-             'knit-ft', 'knit-delta', 'fulltext', 'knit-annotated-ft-gz',
-             'knit-annotated-delta-gz', 'knit-ft-gz', 'knit-delta-gz'])
+             'knit-ft', 'knit-delta', 'chunked', 'fulltext',
+             'knit-annotated-ft-gz', 'knit-annotated-delta-gz', 'knit-ft-gz',
+             'knit-delta-gz'])
 
     def capture_stream(self, f, entries, on_seen, parents):
         """Capture a stream for testing."""
@@ -1700,9 +1701,11 @@
                 [None, files.get_sha1s([factory.key])[factory.key]])
             self.assertEqual(parent_map[factory.key], factory.parents)
             # self.assertEqual(files.get_text(factory.key),
-            self.assertIsInstance(factory.get_bytes_as('fulltext'), str)
-            self.assertIsInstance(factory.get_bytes_as(factory.storage_kind),
-                str)
+            ft_bytes = factory.get_bytes_as('fulltext')
+            self.assertIsInstance(ft_bytes, str)
+            chunked_bytes = factory.get_bytes_as('chunked')
+            self.assertEqualDiff(ft_bytes, ''.join(chunked_bytes))
+
         self.assertStreamOrder(sort_order, seen, keys)
 
     def assertStreamOrder(self, sort_order, seen, keys):
@@ -2274,8 +2277,9 @@
         self._lines["A"] = ["FOO", "BAR"]
         it = self.texts.get_record_stream([("A",)], "unordered", True)
         record = it.next()
-        self.assertEquals("fulltext", record.storage_kind)
+        self.assertEquals("chunked", record.storage_kind)
         self.assertEquals("FOOBAR", record.get_bytes_as("fulltext"))
+        self.assertEquals(["FOO", "BAR"], record.get_bytes_as("chunked"))
 
     def test_get_record_stream_absent(self):
         it = self.texts.get_record_stream([("A",)], "unordered", True)

=== modified file 'bzrlib/transform.py'
--- a/bzrlib/transform.py	2008-10-28 10:31:32 +0000
+++ b/bzrlib/transform.py	2008-12-11 03:18:52 +0000
@@ -1177,7 +1177,7 @@
             if kind == 'file':
                 cur_file = open(self._limbo_name(trans_id), 'rb')
                 try:
-                    lines = osutils.split_lines(cur_file.read())
+                    lines = osutils.chunks_to_lines(cur_file.readlines())
                 finally:
                     cur_file.close()
                 parents = self._get_parents_lines(trans_id)

=== modified file 'bzrlib/versionedfile.py'
--- a/bzrlib/versionedfile.py	2008-12-07 16:46:56 +0000
+++ b/bzrlib/versionedfile.py	2008-12-11 03:45:50 +0000
@@ -59,6 +59,8 @@
     'bzrlib.knit', 'FTAnnotatedToUnannotated')
 adapter_registry.register_lazy(('knit-annotated-ft-gz', 'fulltext'),
     'bzrlib.knit', 'FTAnnotatedToFullText')
+# adapter_registry.register_lazy(('knit-annotated-ft-gz', 'chunked'),
+#     'bzrlib.knit', 'FTAnnotatedToChunked')
 
 
 class ContentFactory(object):
@@ -84,12 +86,46 @@
         self.parents = None
 
 
+class ChunkedContentFactory(ContentFactory):
+    """Static data content factory.
+
+    This takes a 'chunked' list of strings. The only requirement on 'chunked' is
+    that ''.join(lines) becomes a valid fulltext. A tuple of a single string
+    satisfies this, as does a list of lines.
+
+    :ivar sha1: None, or the sha1 of the content fulltext.
+    :ivar storage_kind: The native storage kind of this factory. Always
+        'chunked'
+    :ivar key: The key of this content. Each key is a tuple with a single
+        string in it.
+    :ivar parents: A tuple of parent keys for self.key. If the object has
+        no parent information, None (as opposed to () for an empty list of
+        parents).
+     """
+
+    def __init__(self, key, parents, sha1, chunks):
+        """Create a ContentFactory."""
+        self.sha1 = sha1
+        self.storage_kind = 'chunked'
+        self.key = key
+        self.parents = parents
+        self._chunks = chunks
+
+    def get_bytes_as(self, storage_kind):
+        if storage_kind == 'chunked':
+            return self._chunks
+        elif storage_kind == 'fulltext':
+            return ''.join(self._chunks)
+        raise errors.UnavailableRepresentation(self.key, storage_kind,
+            self.storage_kind)
+
+
 class FulltextContentFactory(ContentFactory):
     """Static data content factory.
 
     This takes a fulltext when created and just returns that during
     get_bytes_as('fulltext').
-    
+
     :ivar sha1: None, or the sha1 of the content fulltext.
     :ivar storage_kind: The native storage kind of this factory. Always
         'fulltext'.
@@ -111,6 +147,8 @@
     def get_bytes_as(self, storage_kind):
         if storage_kind == self.storage_kind:
             return self._text
+        elif storage_kind == 'chunked':
+            return (self._text,)
         raise errors.UnavailableRepresentation(self.key, storage_kind,
             self.storage_kind)
 
@@ -805,12 +843,12 @@
                                   if not mpvf.has_version(p))
         # It seems likely that adding all the present parents as fulltexts can
         # easily exhaust memory.
-        split_lines = osutils.split_lines
+        chunks_to_lines = osutils.chunks_to_lines
         for record in self.get_record_stream(needed_parents, 'unordered',
             True):
             if record.storage_kind == 'absent':
                 continue
-            mpvf.add_version(split_lines(record.get_bytes_as('fulltext')),
+            mpvf.add_version(chunks_to_lines(record.get_bytes_as('chunked')),
                 record.key, [])
         for (key, parent_keys, expected_sha1, mpdiff), lines in\
             zip(records, mpvf.get_line_list(versions)):
@@ -941,9 +979,9 @@
         ghosts = maybe_ghosts - set(self.get_parent_map(maybe_ghosts))
         knit_keys.difference_update(ghosts)
         lines = {}
-        split_lines = osutils.split_lines
+        chunks_to_lines = osutils.chunks_to_lines
         for record in self.get_record_stream(knit_keys, 'topological', True):
-            lines[record.key] = split_lines(record.get_bytes_as('fulltext'))
+            lines[record.key] = chunks_to_lines(record.get_bytes_as('chunked'))
             # line_block_dict = {}
             # for parent, blocks in record.extract_line_blocks():
             #   line_blocks[parent] = blocks
@@ -1252,8 +1290,7 @@
                 lines = self._lines[key]
                 parents = self._parents[key]
                 pending.remove(key)
-                yield FulltextContentFactory(key, parents, None,
-                    ''.join(lines))
+                yield ChunkedContentFactory(key, parents, None, lines)
         for versionedfile in self.fallback_versionedfiles:
             for record in versionedfile.get_record_stream(
                 pending, 'unordered', True):
@@ -1423,9 +1460,9 @@
             if lines is not None:
                 if not isinstance(lines, list):
                     raise AssertionError
-                yield FulltextContentFactory((k,), None, 
+                yield ChunkedContentFactory((k,), None,
                         sha1=osutils.sha_strings(lines),
-                        text=''.join(lines))
+                        chunks=lines)
             else:
                 yield AbsentContentFactory((k,))
 

=== modified file 'bzrlib/weave.py'
--- a/bzrlib/weave.py	2008-10-13 04:54:26 +0000
+++ b/bzrlib/weave.py	2008-12-11 03:45:50 +0000
@@ -79,6 +79,8 @@
 from bzrlib import tsort
 """)
 from bzrlib import (
+    errors,
+    osutils,
     progress,
     )
 from bzrlib.errors import (WeaveError, WeaveFormatError, WeaveParentMismatch,
@@ -88,7 +90,6 @@
         WeaveRevisionAlreadyPresent,
         WeaveRevisionNotPresent,
         )
-import bzrlib.errors as errors
 from bzrlib.osutils import dirname, sha, sha_strings, split_lines
 import bzrlib.patiencediff
 from bzrlib.revision import NULL_REVISION
@@ -122,6 +123,8 @@
     def get_bytes_as(self, storage_kind):
         if storage_kind == 'fulltext':
             return self._weave.get_text(self.key[-1])
+        elif storage_kind == 'chunked':
+            return self._weave.get_lines(self.key[-1])
         else:
             raise UnavailableRepresentation(self.key, storage_kind, 'fulltext')
 
@@ -357,9 +360,10 @@
                 raise RevisionNotPresent([record.key[0]], self)
             # adapt to non-tuple interface
             parents = [parent[0] for parent in record.parents]
-            if record.storage_kind == 'fulltext':
+            if (record.storage_kind == 'fulltext'
+                or record.storage_kind == 'chunked'):
                 self.add_lines(record.key[0], parents,
-                    split_lines(record.get_bytes_as('fulltext')))
+                    osutils.chunks_to_lines(record.get_bytes_as('chunked')))
             else:
                 adapter_key = record.storage_kind, 'fulltext'
                 try:

=== modified file 'setup.py'
--- a/setup.py	2008-10-16 03:58:42 +0000
+++ b/setup.py	2008-12-11 02:18:59 +0000
@@ -258,6 +258,7 @@
 
 
 add_pyrex_extension('bzrlib._btree_serializer_c')
+add_pyrex_extension('bzrlib._chunks_to_lines_pyx')
 add_pyrex_extension('bzrlib._knit_load_data_c')
 if sys.platform == 'win32':
     add_pyrex_extension('bzrlib._dirstate_helpers_c',