Rev 3814: Merging the get_record_stream_chunked code drops peak memory noticeably. in http://bzr.arbash-meinel.com/branches/bzr/brisbane/xml_cache
John Arbash Meinel
john at arbash-meinel.com
Thu Dec 11 03:46:20 GMT 2008
At http://bzr.arbash-meinel.com/branches/bzr/brisbane/xml_cache
------------------------------------------------------------
revno: 3814
revision-id: john at arbash-meinel.com-20081211034550-cc2indrpb6a6rjn6
parent: john at arbash-meinel.com-20081211000604-kzutwqr3jkeez10s
parent: john at arbash-meinel.com-20081211031852-cmjpdf2ufno0okui
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: xml_cache
timestamp: Wed 2008-12-10 21:45:50 -0600
message:
Merging the get_record_stream_chunked code drops peak memory noticeably.
For the first 1k revisions, peak memory consumption drops by almost 100MB.
added:
bzrlib/_chunks_to_lines_py.py _chunks_to_lines_py.-20081211024848-6uc3mtuje8j14l60-1
bzrlib/_chunks_to_lines_pyx.pyx _chunks_to_lines_pyx-20081211021736-op7n8vrxgrd8snfi-1
bzrlib/tests/test__chunks_to_lines.py test__chunks_to_line-20081211024848-6uc3mtuje8j14l60-2
modified:
.bzrignore bzrignore-20050311232317-81f7b71efa2db11a
bzrlib/knit.py knit.py-20051212171256-f056ac8f0fbe1bd9
bzrlib/merge.py merge.py-20050513021216-953b65a438527106
bzrlib/osutils.py osutils.py-20050309040759-eeaff12fbf77ac86
bzrlib/repository.py rev_storage.py-20051111201905-119e9401e46257e3
bzrlib/tests/__init__.py selftest.py-20050531073622-8d0e3c8845c97a64
bzrlib/tests/test_osutils.py test_osutils.py-20051201224856-e48ee24c12182989
bzrlib/tests/test_versionedfile.py test_versionedfile.py-20060222045249-db45c9ed14a1c2e5
bzrlib/transform.py transform.py-20060105172343-dd99e54394d91687
bzrlib/versionedfile.py versionedfile.py-20060222045106-5039c71ee3b65490
bzrlib/weave.py knit.py-20050627021749-759c29984154256b
setup.py setup.py-20050314065409-02f8a0a6e3f9bc70
------------------------------------------------------------
revno: 3735.139.17
revision-id: john at arbash-meinel.com-20081211031852-cmjpdf2ufno0okui
parent: john at arbash-meinel.com-20081211030803-gctunob7zsten3qg
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: get_record_stream_chunked
timestamp: Wed 2008-12-10 21:18:52 -0600
message:
Start using osutils.chunks_as_lines rather than osutils.split_lines.
modified:
bzrlib/knit.py knit.py-20051212171256-f056ac8f0fbe1bd9
bzrlib/merge.py merge.py-20050513021216-953b65a438527106
bzrlib/transform.py transform.py-20060105172343-dd99e54394d91687
bzrlib/versionedfile.py versionedfile.py-20060222045106-5039c71ee3b65490
bzrlib/weave.py knit.py-20050627021749-759c29984154256b
------------------------------------------------------------
revno: 3735.139.16
revision-id: john at arbash-meinel.com-20081211030803-gctunob7zsten3qg
parent: john at arbash-meinel.com-20081211021859-3ds8cwdqiq387t83
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: get_record_stream_chunked
timestamp: Wed 2008-12-10 21:08:03 -0600
message:
Move everything into properly parameterized tests.
Also add tests that we preserve the object when it is already lines.
The compiled form takes 450us on a 7.6k line file (NEWS).
So for common cases, we should have virtually no overhead.
added:
bzrlib/_chunks_to_lines_py.py _chunks_to_lines_py.-20081211024848-6uc3mtuje8j14l60-1
bzrlib/tests/test__chunks_to_lines.py test__chunks_to_line-20081211024848-6uc3mtuje8j14l60-2
modified:
bzrlib/_chunks_to_lines_pyx.pyx _chunks_to_lines_pyx-20081211021736-op7n8vrxgrd8snfi-1
bzrlib/osutils.py osutils.py-20050309040759-eeaff12fbf77ac86
bzrlib/tests/__init__.py selftest.py-20050531073622-8d0e3c8845c97a64
bzrlib/tests/test_osutils.py test_osutils.py-20051201224856-e48ee24c12182989
------------------------------------------------------------
revno: 3735.139.15
revision-id: john at arbash-meinel.com-20081211021859-3ds8cwdqiq387t83
parent: john at arbash-meinel.com-20081211020207-rrgdcyqc344zo5q1
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: get_record_stream_chunked
timestamp: Wed 2008-12-10 20:18:59 -0600
message:
A Pyrex extension is about 5x faster than the fastest python code I could write.
Seems worth having after all.
added:
bzrlib/_chunks_to_lines_pyx.pyx _chunks_to_lines_pyx-20081211021736-op7n8vrxgrd8snfi-1
modified:
.bzrignore bzrignore-20050311232317-81f7b71efa2db11a
bzrlib/osutils.py osutils.py-20050309040759-eeaff12fbf77ac86
setup.py setup.py-20050314065409-02f8a0a6e3f9bc70
------------------------------------------------------------
revno: 3735.139.14
revision-id: john at arbash-meinel.com-20081211020207-rrgdcyqc344zo5q1
parent: john at arbash-meinel.com-20081211011419-vqtdjgpa04woqvm4
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: get_record_stream_chunked
timestamp: Wed 2008-12-10 20:02:07 -0600
message:
Change name to 'chunks_to_lines', and find an optimized form.
It is a little bit ugly, but it is faster than join & split, and means
we get to leave the strings untouched.
modified:
bzrlib/osutils.py osutils.py-20050309040759-eeaff12fbf77ac86
bzrlib/tests/test_osutils.py test_osutils.py-20051201224856-e48ee24c12182989
------------------------------------------------------------
revno: 3735.139.13
revision-id: john at arbash-meinel.com-20081211011419-vqtdjgpa04woqvm4
parent: john at arbash-meinel.com-20081211011038-osioaxd7moquxxmy
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: get_record_stream_chunked
timestamp: Wed 2008-12-10 19:14:19 -0600
message:
More tests for edge cases.
modified:
bzrlib/tests/test_osutils.py test_osutils.py-20051201224856-e48ee24c12182989
------------------------------------------------------------
revno: 3735.139.12
revision-id: john at arbash-meinel.com-20081211011038-osioaxd7moquxxmy
parent: john at arbash-meinel.com-20081211010104-3tcii2strejk5252
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: get_record_stream_chunked
timestamp: Wed 2008-12-10 19:10:38 -0600
message:
Add a new function that can convert 'chunks' format to a 'lines' format.
modified:
bzrlib/osutils.py osutils.py-20050309040759-eeaff12fbf77ac86
bzrlib/tests/test_osutils.py test_osutils.py-20051201224856-e48ee24c12182989
------------------------------------------------------------
revno: 3735.139.11
revision-id: john at arbash-meinel.com-20081211010104-3tcii2strejk5252
parent: john at arbash-meinel.com-20081211005616-szoqqeabcyahy39u
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: get_record_stream_chunked
timestamp: Wed 2008-12-10 19:01:04 -0600
message:
Use the 'chunked' interface to keep memory consumption minimal during revision_trees()
modified:
bzrlib/repository.py rev_storage.py-20051111201905-119e9401e46257e3
------------------------------------------------------------
revno: 3735.139.10
revision-id: john at arbash-meinel.com-20081211005616-szoqqeabcyahy39u
parent: john at arbash-meinel.com-20081211005436-a8bn72zw43b1vd9r
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: get_record_stream_chunked
timestamp: Wed 2008-12-10 18:56:16 -0600
message:
Change the signature to report the storage kind as 'chunked'
modified:
bzrlib/tests/test_versionedfile.py test_versionedfile.py-20060222045249-db45c9ed14a1c2e5
bzrlib/versionedfile.py versionedfile.py-20060222045106-5039c71ee3b65490
------------------------------------------------------------
revno: 3735.139.9
revision-id: john at arbash-meinel.com-20081211005436-a8bn72zw43b1vd9r
parent: pqm at pqm.ubuntu.com-20081210082822-li6ku9s3k63kjrpr
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: get_record_stream_chunked
timestamp: Wed 2008-12-10 18:54:36 -0600
message:
Start working on a ChunkedContentFactory.
This allows get_bytes_as('chunked') for both FulltextContentFactory,
and for ChunkedContentFactory, as it is a trivial conversion to
go between the two styles.
We will also want to special case when converting 'chunked' into
'lines'. But that is for future work.
modified:
bzrlib/knit.py knit.py-20051212171256-f056ac8f0fbe1bd9
bzrlib/tests/test_versionedfile.py test_versionedfile.py-20060222045249-db45c9ed14a1c2e5
bzrlib/versionedfile.py versionedfile.py-20060222045106-5039c71ee3b65490
bzrlib/weave.py knit.py-20050627021749-759c29984154256b
-------------- next part --------------
=== modified file '.bzrignore'
--- a/.bzrignore 2008-09-23 23:28:27 +0000
+++ b/.bzrignore 2008-12-11 02:18:59 +0000
@@ -39,6 +39,7 @@
doc/**/*.html
doc/developers/performance.png
bzrlib/_btree_serializer_c.c
+bzrlib/_chunks_to_lines_pyx.c
bzrlib/_dirstate_helpers_c.c
bzrlib/_knit_load_data_c.c
bzrlib/_readdir_pyx.c
=== added file 'bzrlib/_chunks_to_lines_py.py'
--- a/bzrlib/_chunks_to_lines_py.py 1970-01-01 00:00:00 +0000
+++ b/bzrlib/_chunks_to_lines_py.py 2008-12-11 03:08:03 +0000
@@ -0,0 +1,57 @@
+# Copyright (C) 2008 Canonical Ltd
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+
+"""The python implementation of chunks_to_lines"""
+
+
+def chunks_to_lines(chunks):
+ """Ensure that chunks is split cleanly into lines.
+
+ Each entry in the result should contain a single newline at the end. Except
+ for the last entry which may not have a final newline.
+
+ :param chunks: An list/tuple of strings. If chunks is already a list of
+ lines, then we will return it as-is.
+ :return: A list of strings.
+ """
+ # Optimize for a very common case when chunks are already lines
+ def fail():
+ raise IndexError
+ try:
+ # This is a bit ugly, but is the fastest way to check if all of the
+ # chunks are individual lines.
+ # You can't use function calls like .count(), .index(), or endswith()
+ # because they incur too much python overhead.
+ # It works because
+ # if chunk is an empty string, it will raise IndexError, which will
+ # be caught.
+ # if chunk doesn't end with '\n' then we hit fail()
+ # if there is more than one '\n' then we hit fail()
+ # timing shows this loop to take 2.58ms rather than 3.18ms for
+ # split_lines(''.join(chunks))
+ # Further, it means we get to preserve the original lines, rather than
+ # expanding memory
+ if not chunks:
+ return chunks
+ [(chunk[-1] == '\n' and '\n' not in chunk[:-1]) or fail()
+ for chunk in chunks[:-1]]
+ last = chunks[-1]
+ if last and '\n' not in last[:-1]:
+ return chunks
+ except IndexError:
+ pass
+ from bzrlib.osutils import split_lines
+ return split_lines(''.join(chunks))
=== added file 'bzrlib/_chunks_to_lines_pyx.pyx'
--- a/bzrlib/_chunks_to_lines_pyx.pyx 1970-01-01 00:00:00 +0000
+++ b/bzrlib/_chunks_to_lines_pyx.pyx 2008-12-11 03:08:03 +0000
@@ -0,0 +1,66 @@
+# Copyright (C) 2008 Canonical Ltd
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+#
+
+"""Pyrex extensions for converting chunks to lines."""
+
+#python2.4 support
+cdef extern from "python-compat.h":
+ pass
+
+cdef extern from "stdlib.h":
+ ctypedef unsigned size_t
+
+cdef extern from "Python.h":
+ ctypedef int Py_ssize_t # Required for older pyrex versions
+ ctypedef struct PyObject:
+ pass
+ int PyList_Append(object lst, object item) except -1
+
+ char *PyString_AsString(object p) except NULL
+ int PyString_AsStringAndSize(object s, char **buf, Py_ssize_t *len) except -1
+
+cdef extern from "string.h":
+ void *memchr(void *s, int c, size_t n)
+
+
+def chunks_to_lines(chunks):
+ cdef char *c_str
+ cdef char *newline
+ cdef char *c_last
+ cdef Py_ssize_t the_len
+ cdef Py_ssize_t chunks_len
+ cdef Py_ssize_t cur
+
+ # Check to see if the chunks are already lines
+ chunks_len = len(chunks)
+ if chunks_len == 0:
+ return chunks
+ cur = 0
+ for chunk in chunks:
+ cur += 1
+ PyString_AsStringAndSize(chunk, &c_str, &the_len)
+ if the_len == 0:
+ break
+ c_last = c_str + the_len - 1
+ newline = <char *>memchr(c_str, c'\n', the_len)
+ if newline != c_last and not (newline == NULL and cur == chunks_len):
+ break
+ else:
+ return chunks
+
+ from bzrlib import osutils
+ return osutils.split_lines(''.join(chunks))
=== modified file 'bzrlib/knit.py'
--- a/bzrlib/knit.py 2008-12-07 16:46:56 +0000
+++ b/bzrlib/knit.py 2008-12-11 03:45:50 +0000
@@ -110,7 +110,7 @@
adapter_registry,
ConstantMapper,
ContentFactory,
- FulltextContentFactory,
+ ChunkedContentFactory,
VersionedFile,
VersionedFiles,
)
@@ -196,7 +196,8 @@
[compression_parent], 'unordered', True).next()
if basis_entry.storage_kind == 'absent':
raise errors.RevisionNotPresent(compression_parent, self._basis_vf)
- basis_lines = split_lines(basis_entry.get_bytes_as('fulltext'))
+ basis_chunks = basis_entry.get_bytes_as('chunked')
+ basis_lines = osutils.chunks_to_lines(basis_chunks)
# Manually apply the delta because we have one annotated content and
# one plain.
basis_content = PlainKnitContent(basis_lines, compression_parent)
@@ -229,7 +230,8 @@
[compression_parent], 'unordered', True).next()
if basis_entry.storage_kind == 'absent':
raise errors.RevisionNotPresent(compression_parent, self._basis_vf)
- basis_lines = split_lines(basis_entry.get_bytes_as('fulltext'))
+ basis_chunks = basis_entry.get_bytes_as('chunked')
+ basis_lines = osutils.chunks_to_lines(basis_chunks)
basis_content = PlainKnitContent(basis_lines, compression_parent)
# Manually apply the delta because we have one annotated content and
# one plain.
@@ -276,11 +278,13 @@
def get_bytes_as(self, storage_kind):
if storage_kind == self.storage_kind:
return self._raw_record
- if storage_kind == 'fulltext' and self._knit is not None:
- return self._knit.get_text(self.key[0])
- else:
- raise errors.UnavailableRepresentation(self.key, storage_kind,
- self.storage_kind)
+ if self._knit is not None:
+ if storage_kind == 'chunked':
+ return self._knit.get_lines(self.key[0])
+ elif storage_kind == 'fulltext':
+ return self._knit.get_text(self.key[0])
+ raise errors.UnavailableRepresentation(self.key, storage_kind,
+ self.storage_kind)
class KnitContent(object):
@@ -1025,7 +1029,7 @@
if record.storage_kind == 'absent':
continue
missing_keys.remove(record.key)
- lines = split_lines(record.get_bytes_as('fulltext'))
+ lines = osutils.chunks_to_lines(record.get_bytes_as('chunked'))
text_map[record.key] = lines
content_map[record.key] = PlainKnitContent(lines, record.key)
if record.key in keys:
@@ -1293,9 +1297,8 @@
text_map, _ = self._get_content_maps(keys, non_local)
for key in keys:
lines = text_map.pop(key)
- text = ''.join(lines)
- yield FulltextContentFactory(key, global_map[key], None,
- text)
+ yield ChunkedContentFactory(key, global_map[key], None,
+ lines)
else:
for source, keys in source_keys:
if source is parent_maps[0]:
@@ -1448,6 +1451,9 @@
buffered = True
if not buffered:
self._index.add_records([index_entry])
+ elif record.storage_kind == 'chunked':
+ self.add_lines(record.key, parents,
+ osutils.chunks_to_lines(record.get_bytes_as('chunked')))
elif record.storage_kind == 'fulltext':
self.add_lines(record.key, parents,
split_lines(record.get_bytes_as('fulltext')))
@@ -2957,7 +2963,7 @@
reannotate = annotate.reannotate
for record in self._knit.get_record_stream(keys, 'topological', True):
key = record.key
- fulltext = split_lines(record.get_bytes_as('fulltext'))
+ fulltext = osutils.chunks_to_lines(record.get_bytes_as('chunked'))
parents = parent_map[key]
if parents is not None:
parent_lines = [parent_cache[parent] for parent in parent_map[key]]
=== modified file 'bzrlib/merge.py'
--- a/bzrlib/merge.py 2008-10-10 11:55:03 +0000
+++ b/bzrlib/merge.py 2008-12-11 03:18:52 +0000
@@ -1579,7 +1579,7 @@
def get_lines(self, revisions):
"""Get lines for revisions from the backing VersionedFiles.
-
+
:raises RevisionNotPresent: on absent texts.
"""
keys = [(self._key_prefix + (rev,)) for rev in revisions]
@@ -1587,8 +1587,8 @@
for record in self.vf.get_record_stream(keys, 'unordered', True):
if record.storage_kind == 'absent':
raise errors.RevisionNotPresent(record.key, self.vf)
- result[record.key[-1]] = osutils.split_lines(
- record.get_bytes_as('fulltext'))
+ result[record.key[-1]] = osutils.chunks_to_lines(
+ record.get_bytes_as('chunked'))
return result
def plan_merge(self):
=== modified file 'bzrlib/osutils.py'
--- a/bzrlib/osutils.py 2008-10-17 03:49:08 +0000
+++ b/bzrlib/osutils.py 2008-12-11 03:08:03 +0000
@@ -812,6 +812,7 @@
rps.append(f)
return rps
+
def joinpath(p):
for f in p:
if (f == '..') or (f is None) or (f == ''):
@@ -819,6 +820,12 @@
return pathjoin(*p)
+try:
+ from bzrlib._chunks_to_lines_pyx import chunks_to_lines
+except ImportError:
+ from bzrlib._chunks_to_lines_py import chunks_to_lines
+
+
def split_lines(s):
"""Split s into lines, but without removing the newline characters."""
lines = s.split('\n')
=== modified file 'bzrlib/repository.py'
--- a/bzrlib/repository.py 2008-12-10 23:11:31 +0000
+++ b/bzrlib/repository.py 2008-12-11 03:45:50 +0000
@@ -1725,14 +1725,15 @@
def _iter_inventory_xmls(self, revision_ids):
keys = [(revision_id,) for revision_id in revision_ids]
stream = self.inventories.get_record_stream(keys, 'unordered', True)
- texts = {}
+ text_chunks = {}
for record in stream:
if record.storage_kind != 'absent':
- texts[record.key] = record.get_bytes_as('fulltext')
+ text_chunks[record.key] = record.get_bytes_as('chunked')
else:
raise errors.NoSuchRevision(self, record.key)
for key in keys:
- yield texts.pop(key), key[-1]
+ chunks = text_chunks.pop(key)
+ yield ''.join(chunks), key[-1]
def deserialise_inventory(self, revision_id, xml):
"""Transform the xml into an inventory object.
=== modified file 'bzrlib/tests/__init__.py'
--- a/bzrlib/tests/__init__.py 2008-12-10 23:11:31 +0000
+++ b/bzrlib/tests/__init__.py 2008-12-11 03:45:50 +0000
@@ -2790,6 +2790,7 @@
'bzrlib.tests.test_cache_utf8',
'bzrlib.tests.test_chk_map',
'bzrlib.tests.test_chunk_writer',
+ 'bzrlib.tests.test__chunks_to_lines',
'bzrlib.tests.test_commands',
'bzrlib.tests.test_commit',
'bzrlib.tests.test_commit_merge',
=== added file 'bzrlib/tests/test__chunks_to_lines.py'
--- a/bzrlib/tests/test__chunks_to_lines.py 1970-01-01 00:00:00 +0000
+++ b/bzrlib/tests/test__chunks_to_lines.py 2008-12-11 03:08:03 +0000
@@ -0,0 +1,112 @@
+# Copyright (C) 2008 Canonical Ltd
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+#
+
+"""Tests for chunks_to_lines."""
+
+from bzrlib import tests
+
+
+def load_tests(standard_tests, module, loader):
+ # parameterize all tests in this module
+ suite = loader.suiteClass()
+ applier = tests.TestScenarioApplier()
+ import bzrlib._chunks_to_lines_py as py_module
+ applier.scenarios = [('python', {'module': py_module})]
+ if CompiledChunksToLinesFeature.available():
+ import bzrlib._chunks_to_lines_pyx as c_module
+ applier.scenarios.append(('C', {'module': c_module}))
+ else:
+ # the compiled module isn't available, so we add a failing test
+ class FailWithoutFeature(tests.TestCase):
+ def test_fail(self):
+ self.requireFeature(CompiledChunksToLinesFeature)
+ suite.addTest(loader.loadTestsFromTestCase(FailWithoutFeature))
+ tests.adapt_tests(standard_tests, applier, suite)
+ return suite
+
+
+class _CompiledChunksToLinesFeature(tests.Feature):
+
+ def _probe(self):
+ try:
+ import bzrlib._chunks_to_lines_pyx
+ except ImportError:
+ return False
+ return True
+
+ def feature_name(self):
+ return 'bzrlib._chunks_to_lines_pyx'
+
+CompiledChunksToLinesFeature = _CompiledChunksToLinesFeature()
+
+
+class TestChunksToLines(tests.TestCase):
+
+ module = None # Filled in by test parameterization
+
+ def assertChunksToLines(self, lines, chunks, alreadly_lines=False):
+ result = self.module.chunks_to_lines(chunks)
+ self.assertEqual(lines, result)
+ if alreadly_lines:
+ self.assertIs(chunks, result)
+
+ def test_fulltext_chunk_to_lines(self):
+ self.assertChunksToLines(['foo\n', 'bar\r\n', 'ba\rz\n'],
+ ['foo\nbar\r\nba\rz\n'])
+ self.assertChunksToLines(['foobarbaz\n'], ['foobarbaz\n'],
+ alreadly_lines=True)
+
+ def test_lines_to_lines(self):
+ self.assertChunksToLines(['foo\n', 'bar\r\n', 'ba\rz\n'],
+ ['foo\n', 'bar\r\n', 'ba\rz\n'],
+ alreadly_lines=True)
+
+ def test_no_final_newline(self):
+ self.assertChunksToLines(['foo\n', 'bar\r\n', 'ba\rz'],
+ ['foo\nbar\r\nba\rz'])
+ self.assertChunksToLines(['foo\n', 'bar\r\n', 'ba\rz'],
+ ['foo\n', 'bar\r\n', 'ba\rz'],
+ alreadly_lines=True)
+ self.assertChunksToLines(('foo\n', 'bar\r\n', 'ba\rz'),
+ ('foo\n', 'bar\r\n', 'ba\rz'),
+ alreadly_lines=True)
+ self.assertChunksToLines([], [], alreadly_lines=True)
+ self.assertChunksToLines(['foobarbaz'], ['foobarbaz'],
+ alreadly_lines=True)
+ self.assertChunksToLines([], [''])
+
+ def test_mixed(self):
+ self.assertChunksToLines(['foo\n', 'bar\r\n', 'ba\rz'],
+ ['foo\n', 'bar\r\nba\r', 'z'])
+ self.assertChunksToLines(['foo\n', 'bar\r\n', 'ba\rz'],
+ ['foo\nb', 'a', 'r\r\nba\r', 'z'])
+ self.assertChunksToLines(['foo\n', 'bar\r\n', 'ba\rz'],
+ ['foo\nbar\r\nba', '\r', 'z'])
+
+ self.assertChunksToLines(['foo\n', 'bar\r\n', 'ba\rz'],
+ ['foo\n', '', 'bar\r\nba', '\r', 'z'])
+ self.assertChunksToLines(['foo\n', 'bar\r\n', 'ba\rz\n'],
+ ['foo\n', 'bar\r\n', 'ba\rz\n', ''])
+
+ def test_not_lines(self):
+ # We should raise a TypeError, not crash
+ self.assertRaises(TypeError, self.module.chunks_to_lines,
+ object())
+ self.assertRaises(TypeError, self.module.chunks_to_lines,
+ [object()])
+ self.assertRaises(TypeError, self.module.chunks_to_lines,
+ ['foo', object()])
=== modified file 'bzrlib/tests/test_osutils.py'
--- a/bzrlib/tests/test_osutils.py 2008-10-01 07:56:03 +0000
+++ b/bzrlib/tests/test_osutils.py 2008-12-11 03:08:03 +0000
@@ -1,4 +1,4 @@
-# Copyright (C) 2005, 2006, 2007 Canonical Ltd
+# Copyright (C) 2005, 2006, 2007, 2008 Canonical Ltd
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
@@ -756,6 +756,23 @@
self.assertEndsWith(osutils._mac_getcwd(), u'B\xe5gfors')
+class TestChunksToLines(TestCase):
+
+ def test_smoketest(self):
+ self.assertEqual(['foo\n', 'bar\n', 'baz\n'],
+ osutils.chunks_to_lines(['foo\nbar', '\nbaz\n']))
+ self.assertEqual(['foo\n', 'bar\n', 'baz\n'],
+ osutils.chunks_to_lines(['foo\n', 'bar\n', 'baz\n']))
+
+ def test_is_compiled(self):
+ from bzrlib.tests.test__chunks_to_lines import CompiledChunksToLinesFeature
+ if CompiledChunksToLinesFeature:
+ from bzrlib._chunks_to_lines_pyx import chunks_to_lines
+ else:
+ from bzrlib._chunks_to_lines_py import chunks_to_lines
+ self.assertIs(chunks_to_lines, osutils.chunks_to_lines)
+
+
class TestSplitLines(TestCase):
def test_split_unicode(self):
=== modified file 'bzrlib/tests/test_versionedfile.py'
--- a/bzrlib/tests/test_versionedfile.py 2008-12-07 16:46:56 +0000
+++ b/bzrlib/tests/test_versionedfile.py 2008-12-11 03:45:50 +0000
@@ -1622,8 +1622,9 @@
"""Assert that storage_kind is a valid storage_kind."""
self.assertSubset([storage_kind],
['mpdiff', 'knit-annotated-ft', 'knit-annotated-delta',
- 'knit-ft', 'knit-delta', 'fulltext', 'knit-annotated-ft-gz',
- 'knit-annotated-delta-gz', 'knit-ft-gz', 'knit-delta-gz'])
+ 'knit-ft', 'knit-delta', 'chunked', 'fulltext',
+ 'knit-annotated-ft-gz', 'knit-annotated-delta-gz', 'knit-ft-gz',
+ 'knit-delta-gz'])
def capture_stream(self, f, entries, on_seen, parents):
"""Capture a stream for testing."""
@@ -1700,9 +1701,11 @@
[None, files.get_sha1s([factory.key])[factory.key]])
self.assertEqual(parent_map[factory.key], factory.parents)
# self.assertEqual(files.get_text(factory.key),
- self.assertIsInstance(factory.get_bytes_as('fulltext'), str)
- self.assertIsInstance(factory.get_bytes_as(factory.storage_kind),
- str)
+ ft_bytes = factory.get_bytes_as('fulltext')
+ self.assertIsInstance(ft_bytes, str)
+ chunked_bytes = factory.get_bytes_as('chunked')
+ self.assertEqualDiff(ft_bytes, ''.join(chunked_bytes))
+
self.assertStreamOrder(sort_order, seen, keys)
def assertStreamOrder(self, sort_order, seen, keys):
@@ -2274,8 +2277,9 @@
self._lines["A"] = ["FOO", "BAR"]
it = self.texts.get_record_stream([("A",)], "unordered", True)
record = it.next()
- self.assertEquals("fulltext", record.storage_kind)
+ self.assertEquals("chunked", record.storage_kind)
self.assertEquals("FOOBAR", record.get_bytes_as("fulltext"))
+ self.assertEquals(["FOO", "BAR"], record.get_bytes_as("chunked"))
def test_get_record_stream_absent(self):
it = self.texts.get_record_stream([("A",)], "unordered", True)
=== modified file 'bzrlib/transform.py'
--- a/bzrlib/transform.py 2008-10-28 10:31:32 +0000
+++ b/bzrlib/transform.py 2008-12-11 03:18:52 +0000
@@ -1177,7 +1177,7 @@
if kind == 'file':
cur_file = open(self._limbo_name(trans_id), 'rb')
try:
- lines = osutils.split_lines(cur_file.read())
+ lines = osutils.chunks_to_lines(cur_file.readlines())
finally:
cur_file.close()
parents = self._get_parents_lines(trans_id)
=== modified file 'bzrlib/versionedfile.py'
--- a/bzrlib/versionedfile.py 2008-12-07 16:46:56 +0000
+++ b/bzrlib/versionedfile.py 2008-12-11 03:45:50 +0000
@@ -59,6 +59,8 @@
'bzrlib.knit', 'FTAnnotatedToUnannotated')
adapter_registry.register_lazy(('knit-annotated-ft-gz', 'fulltext'),
'bzrlib.knit', 'FTAnnotatedToFullText')
+# adapter_registry.register_lazy(('knit-annotated-ft-gz', 'chunked'),
+# 'bzrlib.knit', 'FTAnnotatedToChunked')
class ContentFactory(object):
@@ -84,12 +86,46 @@
self.parents = None
+class ChunkedContentFactory(ContentFactory):
+ """Static data content factory.
+
+ This takes a 'chunked' list of strings. The only requirement on 'chunked' is
+ that ''.join(lines) becomes a valid fulltext. A tuple of a single string
+ satisfies this, as does a list of lines.
+
+ :ivar sha1: None, or the sha1 of the content fulltext.
+ :ivar storage_kind: The native storage kind of this factory. Always
+ 'chunked'
+ :ivar key: The key of this content. Each key is a tuple with a single
+ string in it.
+ :ivar parents: A tuple of parent keys for self.key. If the object has
+ no parent information, None (as opposed to () for an empty list of
+ parents).
+ """
+
+ def __init__(self, key, parents, sha1, chunks):
+ """Create a ContentFactory."""
+ self.sha1 = sha1
+ self.storage_kind = 'chunked'
+ self.key = key
+ self.parents = parents
+ self._chunks = chunks
+
+ def get_bytes_as(self, storage_kind):
+ if storage_kind == 'chunked':
+ return self._chunks
+ elif storage_kind == 'fulltext':
+ return ''.join(self._chunks)
+ raise errors.UnavailableRepresentation(self.key, storage_kind,
+ self.storage_kind)
+
+
class FulltextContentFactory(ContentFactory):
"""Static data content factory.
This takes a fulltext when created and just returns that during
get_bytes_as('fulltext').
-
+
:ivar sha1: None, or the sha1 of the content fulltext.
:ivar storage_kind: The native storage kind of this factory. Always
'fulltext'.
@@ -111,6 +147,8 @@
def get_bytes_as(self, storage_kind):
if storage_kind == self.storage_kind:
return self._text
+ elif storage_kind == 'chunked':
+ return (self._text,)
raise errors.UnavailableRepresentation(self.key, storage_kind,
self.storage_kind)
@@ -805,12 +843,12 @@
if not mpvf.has_version(p))
# It seems likely that adding all the present parents as fulltexts can
# easily exhaust memory.
- split_lines = osutils.split_lines
+ chunks_to_lines = osutils.chunks_to_lines
for record in self.get_record_stream(needed_parents, 'unordered',
True):
if record.storage_kind == 'absent':
continue
- mpvf.add_version(split_lines(record.get_bytes_as('fulltext')),
+ mpvf.add_version(chunks_to_lines(record.get_bytes_as('chunked')),
record.key, [])
for (key, parent_keys, expected_sha1, mpdiff), lines in\
zip(records, mpvf.get_line_list(versions)):
@@ -941,9 +979,9 @@
ghosts = maybe_ghosts - set(self.get_parent_map(maybe_ghosts))
knit_keys.difference_update(ghosts)
lines = {}
- split_lines = osutils.split_lines
+ chunks_to_lines = osutils.chunks_to_lines
for record in self.get_record_stream(knit_keys, 'topological', True):
- lines[record.key] = split_lines(record.get_bytes_as('fulltext'))
+ lines[record.key] = chunks_to_lines(record.get_bytes_as('chunked'))
# line_block_dict = {}
# for parent, blocks in record.extract_line_blocks():
# line_blocks[parent] = blocks
@@ -1252,8 +1290,7 @@
lines = self._lines[key]
parents = self._parents[key]
pending.remove(key)
- yield FulltextContentFactory(key, parents, None,
- ''.join(lines))
+ yield ChunkedContentFactory(key, parents, None, lines)
for versionedfile in self.fallback_versionedfiles:
for record in versionedfile.get_record_stream(
pending, 'unordered', True):
@@ -1423,9 +1460,9 @@
if lines is not None:
if not isinstance(lines, list):
raise AssertionError
- yield FulltextContentFactory((k,), None,
+ yield ChunkedContentFactory((k,), None,
sha1=osutils.sha_strings(lines),
- text=''.join(lines))
+ chunks=lines)
else:
yield AbsentContentFactory((k,))
=== modified file 'bzrlib/weave.py'
--- a/bzrlib/weave.py 2008-10-13 04:54:26 +0000
+++ b/bzrlib/weave.py 2008-12-11 03:45:50 +0000
@@ -79,6 +79,8 @@
from bzrlib import tsort
""")
from bzrlib import (
+ errors,
+ osutils,
progress,
)
from bzrlib.errors import (WeaveError, WeaveFormatError, WeaveParentMismatch,
@@ -88,7 +90,6 @@
WeaveRevisionAlreadyPresent,
WeaveRevisionNotPresent,
)
-import bzrlib.errors as errors
from bzrlib.osutils import dirname, sha, sha_strings, split_lines
import bzrlib.patiencediff
from bzrlib.revision import NULL_REVISION
@@ -122,6 +123,8 @@
def get_bytes_as(self, storage_kind):
if storage_kind == 'fulltext':
return self._weave.get_text(self.key[-1])
+ elif storage_kind == 'chunked':
+ return self._weave.get_lines(self.key[-1])
else:
raise UnavailableRepresentation(self.key, storage_kind, 'fulltext')
@@ -357,9 +360,10 @@
raise RevisionNotPresent([record.key[0]], self)
# adapt to non-tuple interface
parents = [parent[0] for parent in record.parents]
- if record.storage_kind == 'fulltext':
+ if (record.storage_kind == 'fulltext'
+ or record.storage_kind == 'chunked'):
self.add_lines(record.key[0], parents,
- split_lines(record.get_bytes_as('fulltext')))
+ osutils.chunks_to_lines(record.get_bytes_as('chunked')))
else:
adapter_key = record.storage_kind, 'fulltext'
try:
=== modified file 'setup.py'
--- a/setup.py 2008-10-16 03:58:42 +0000
+++ b/setup.py 2008-12-11 02:18:59 +0000
@@ -258,6 +258,7 @@
add_pyrex_extension('bzrlib._btree_serializer_c')
+add_pyrex_extension('bzrlib._chunks_to_lines_pyx')
add_pyrex_extension('bzrlib._knit_load_data_c')
if sys.platform == 'win32':
add_pyrex_extension('bzrlib._dirstate_helpers_c',
More information about the bazaar-commits
mailing list