Rev 3895: (jam) Add ContentFactory.get_bytes_as('chunked') and in file:///home/pqm/archives/thelove/bzr/%2Btrunk/
Canonical.com Patch Queue Manager
pqm at pqm.ubuntu.com
Thu Dec 11 20:23:06 GMT 2008
At file:///home/pqm/archives/thelove/bzr/%2Btrunk/
------------------------------------------------------------
revno: 3895
revision-id: pqm at pqm.ubuntu.com-20081211202300-6dz1vo3phfsc23pj
parent: pqm at pqm.ubuntu.com-20081211174647-l45s6xsw669ovgsa
parent: john at arbash-meinel.com-20081211193706-7qz4e5f9a8c5w4b1
committer: Canonical.com Patch Queue Manager <pqm at pqm.ubuntu.com>
branch nick: +trunk
timestamp: Thu 2008-12-11 20:23:00 +0000
message:
(jam) Add ContentFactory.get_bytes_as('chunked') and
osutils.chunks_to_lines()
added:
bzrlib/_chunks_to_lines_py.py _chunks_to_lines_py.-20081211024848-6uc3mtuje8j14l60-1
bzrlib/_chunks_to_lines_pyx.pyx _chunks_to_lines_pyx-20081211021736-op7n8vrxgrd8snfi-1
bzrlib/tests/test__chunks_to_lines.py test__chunks_to_line-20081211024848-6uc3mtuje8j14l60-2
modified:
.bzrignore bzrignore-20050311232317-81f7b71efa2db11a
NEWS NEWS-20050323055033-4e00b5db738777ff
bzrlib/knit.py knit.py-20051212171256-f056ac8f0fbe1bd9
bzrlib/merge.py merge.py-20050513021216-953b65a438527106
bzrlib/osutils.py osutils.py-20050309040759-eeaff12fbf77ac86
bzrlib/repository.py rev_storage.py-20051111201905-119e9401e46257e3
bzrlib/tests/__init__.py selftest.py-20050531073622-8d0e3c8845c97a64
bzrlib/tests/test_osutils.py test_osutils.py-20051201224856-e48ee24c12182989
bzrlib/tests/test_versionedfile.py test_versionedfile.py-20060222045249-db45c9ed14a1c2e5
bzrlib/transform.py transform.py-20060105172343-dd99e54394d91687
bzrlib/versionedfile.py versionedfile.py-20060222045106-5039c71ee3b65490
bzrlib/weave.py knit.py-20050627021749-759c29984154256b
setup.py setup.py-20050314065409-02f8a0a6e3f9bc70
------------------------------------------------------------
revno: 3890.2.18
revision-id: john at arbash-meinel.com-20081211193706-7qz4e5f9a8c5w4b1
parent: john at arbash-meinel.com-20081211193101-q0utq7jeh79vpmgr
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: get_record_stream_chunked
timestamp: Thu 2008-12-11 13:37:06 -0600
message:
Implement osutils.split_lines() in terms of chunks_to_lines if possible.
chunks_to_lines([fulltext]) is about 2x faster than the original split_lines implementation.
modified:
bzrlib/_chunks_to_lines_py.py _chunks_to_lines_py.-20081211024848-6uc3mtuje8j14l60-1
bzrlib/osutils.py osutils.py-20050309040759-eeaff12fbf77ac86
------------------------------------------------------------
revno: 3890.2.17
revision-id: john at arbash-meinel.com-20081211193101-q0utq7jeh79vpmgr
parent: john at arbash-meinel.com-20081211182616-l9m9rjnea3bebaor
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: get_record_stream_chunked
timestamp: Thu 2008-12-11 13:31:01 -0600
message:
Add a few more corner cases, some suggested by Robert.
modified:
bzrlib/tests/test__chunks_to_lines.py test__chunks_to_line-20081211024848-6uc3mtuje8j14l60-2
------------------------------------------------------------
revno: 3890.2.16
revision-id: john at arbash-meinel.com-20081211182616-l9m9rjnea3bebaor
parent: john at arbash-meinel.com-20081211182023-sr6hi6owbbzozhkn
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: get_record_stream_chunked
timestamp: Thu 2008-12-11 12:26:16 -0600
message:
If we split into 2 loops, we get 440us for already lines, and the
same time when it is not.
The only downside is that it requires looping over the same data twice.
modified:
bzrlib/_chunks_to_lines_pyx.pyx _chunks_to_lines_pyx-20081211021736-op7n8vrxgrd8snfi-1
------------------------------------------------------------
revno: 3890.2.15
revision-id: john at arbash-meinel.com-20081211182023-sr6hi6owbbzozhkn
parent: john at arbash-meinel.com-20081211175903-gtuvyewwr1eehauq
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: get_record_stream_chunked
timestamp: Thu 2008-12-11 12:20:23 -0600
message:
Update to do a single iteration over the chunks.
This costs 600us versus 430us for the case where the object is
already a list of lines. However it is only 1.2ms rather than 3ms
when everything is in a single buffer.
The biggest advantage is that 'chunks' *could* be an iterator,
rather than requiring it to already have all the results.
modified:
bzrlib/_chunks_to_lines_pyx.pyx _chunks_to_lines_pyx-20081211021736-op7n8vrxgrd8snfi-1
bzrlib/tests/test__chunks_to_lines.py test__chunks_to_line-20081211024848-6uc3mtuje8j14l60-2
------------------------------------------------------------
revno: 3890.2.14
revision-id: john at arbash-meinel.com-20081211175903-gtuvyewwr1eehauq
parent: john at arbash-meinel.com-20081211175431-s89ujzp4w4l51x34
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: get_record_stream_chunked
timestamp: Thu 2008-12-11 11:59:03 -0600
message:
Restore correctness.
modified:
bzrlib/_chunks_to_lines_pyx.pyx _chunks_to_lines_pyx-20081211021736-op7n8vrxgrd8snfi-1
------------------------------------------------------------
revno: 3890.2.13
revision-id: john at arbash-meinel.com-20081211175431-s89ujzp4w4l51x34
parent: john at arbash-meinel.com-20081211174407-6sz5ooqz40m30xc2
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: get_record_stream_chunked
timestamp: Thu 2008-12-11 11:54:31 -0600
message:
Add a NEWS entry.
modified:
NEWS NEWS-20050323055033-4e00b5db738777ff
------------------------------------------------------------
revno: 3890.2.12
revision-id: john at arbash-meinel.com-20081211174407-6sz5ooqz40m30xc2
parent: john at arbash-meinel.com-20081211174330-31to8tzq6k4ewii4
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: get_record_stream_chunked
timestamp: Thu 2008-12-11 11:44:07 -0600
message:
Remove the extra comment, it probably isn't useful to most people.
modified:
bzrlib/_chunks_to_lines_pyx.pyx _chunks_to_lines_pyx-20081211021736-op7n8vrxgrd8snfi-1
------------------------------------------------------------
revno: 3890.2.11
revision-id: john at arbash-meinel.com-20081211174330-31to8tzq6k4ewii4
parent: john at arbash-meinel.com-20081211170336-70oi6rnsgkyh3z2o
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: get_record_stream_chunked
timestamp: Thu 2008-12-11 11:43:30 -0600
message:
A bit more tweaking of the pyrex version. Shave off another 10% by
using PyString_CheckExact.
modified:
bzrlib/_chunks_to_lines_pyx.pyx _chunks_to_lines_pyx-20081211021736-op7n8vrxgrd8snfi-1
bzrlib/tests/test__chunks_to_lines.py test__chunks_to_line-20081211024848-6uc3mtuje8j14l60-2
------------------------------------------------------------
revno: 3890.2.10
revision-id: john at arbash-meinel.com-20081211170336-70oi6rnsgkyh3z2o
parent: john at arbash-meinel.com-20081211031852-cmjpdf2ufno0okui
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: get_record_stream_chunked
timestamp: Thu 2008-12-11 11:03:36 -0600
message:
Change the python implementation to a friendlier implementation.
It is only a little bit slower, because we still avoid function calls.
Redo the Pyrex version for clarity as well. May need to revisit as it might be
a little bit slower.
modified:
bzrlib/_chunks_to_lines_py.py _chunks_to_lines_py.-20081211024848-6uc3mtuje8j14l60-1
bzrlib/_chunks_to_lines_pyx.pyx _chunks_to_lines_pyx-20081211021736-op7n8vrxgrd8snfi-1
------------------------------------------------------------
revno: 3890.2.9
revision-id: john at arbash-meinel.com-20081211031852-cmjpdf2ufno0okui
parent: john at arbash-meinel.com-20081211030803-gctunob7zsten3qg
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: get_record_stream_chunked
timestamp: Wed 2008-12-10 21:18:52 -0600
message:
Start using osutils.chunks_as_lines rather than osutils.split_lines.
modified:
bzrlib/knit.py knit.py-20051212171256-f056ac8f0fbe1bd9
bzrlib/merge.py merge.py-20050513021216-953b65a438527106
bzrlib/transform.py transform.py-20060105172343-dd99e54394d91687
bzrlib/versionedfile.py versionedfile.py-20060222045106-5039c71ee3b65490
bzrlib/weave.py knit.py-20050627021749-759c29984154256b
------------------------------------------------------------
revno: 3890.2.8
revision-id: john at arbash-meinel.com-20081211030803-gctunob7zsten3qg
parent: john at arbash-meinel.com-20081211021859-3ds8cwdqiq387t83
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: get_record_stream_chunked
timestamp: Wed 2008-12-10 21:08:03 -0600
message:
Move everything into properly parameterized tests.
Also add tests that we preserve the object when it is already lines.
The compiled form takes 450us on a 7.6k line file (NEWS).
So for common cases, we should have virtually no overhead.
added:
bzrlib/_chunks_to_lines_py.py _chunks_to_lines_py.-20081211024848-6uc3mtuje8j14l60-1
bzrlib/tests/test__chunks_to_lines.py test__chunks_to_line-20081211024848-6uc3mtuje8j14l60-2
modified:
bzrlib/_chunks_to_lines_pyx.pyx _chunks_to_lines_pyx-20081211021736-op7n8vrxgrd8snfi-1
bzrlib/osutils.py osutils.py-20050309040759-eeaff12fbf77ac86
bzrlib/tests/__init__.py selftest.py-20050531073622-8d0e3c8845c97a64
bzrlib/tests/test_osutils.py test_osutils.py-20051201224856-e48ee24c12182989
------------------------------------------------------------
revno: 3890.2.7
revision-id: john at arbash-meinel.com-20081211021859-3ds8cwdqiq387t83
parent: john at arbash-meinel.com-20081211020207-rrgdcyqc344zo5q1
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: get_record_stream_chunked
timestamp: Wed 2008-12-10 20:18:59 -0600
message:
A Pyrex extension is about 5x faster than the fastest python code I could write.
Seems worth having after all.
added:
bzrlib/_chunks_to_lines_pyx.pyx _chunks_to_lines_pyx-20081211021736-op7n8vrxgrd8snfi-1
modified:
.bzrignore bzrignore-20050311232317-81f7b71efa2db11a
bzrlib/osutils.py osutils.py-20050309040759-eeaff12fbf77ac86
setup.py setup.py-20050314065409-02f8a0a6e3f9bc70
------------------------------------------------------------
revno: 3890.2.6
revision-id: john at arbash-meinel.com-20081211020207-rrgdcyqc344zo5q1
parent: john at arbash-meinel.com-20081211011419-vqtdjgpa04woqvm4
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: get_record_stream_chunked
timestamp: Wed 2008-12-10 20:02:07 -0600
message:
Change name to 'chunks_to_lines', and find an optimized form.
It is a little bit ugly, but it is faster than join & split, and means
we get to leave the strings untouched.
modified:
bzrlib/osutils.py osutils.py-20050309040759-eeaff12fbf77ac86
bzrlib/tests/test_osutils.py test_osutils.py-20051201224856-e48ee24c12182989
------------------------------------------------------------
revno: 3890.2.5
revision-id: john at arbash-meinel.com-20081211011419-vqtdjgpa04woqvm4
parent: john at arbash-meinel.com-20081211011038-osioaxd7moquxxmy
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: get_record_stream_chunked
timestamp: Wed 2008-12-10 19:14:19 -0600
message:
More tests for edge cases.
modified:
bzrlib/tests/test_osutils.py test_osutils.py-20051201224856-e48ee24c12182989
------------------------------------------------------------
revno: 3890.2.4
revision-id: john at arbash-meinel.com-20081211011038-osioaxd7moquxxmy
parent: john at arbash-meinel.com-20081211010104-3tcii2strejk5252
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: get_record_stream_chunked
timestamp: Wed 2008-12-10 19:10:38 -0600
message:
Add a new function that can convert 'chunks' format to a 'lines' format.
modified:
bzrlib/osutils.py osutils.py-20050309040759-eeaff12fbf77ac86
bzrlib/tests/test_osutils.py test_osutils.py-20051201224856-e48ee24c12182989
------------------------------------------------------------
revno: 3890.2.3
revision-id: john at arbash-meinel.com-20081211010104-3tcii2strejk5252
parent: john at arbash-meinel.com-20081211005616-szoqqeabcyahy39u
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: get_record_stream_chunked
timestamp: Wed 2008-12-10 19:01:04 -0600
message:
Use the 'chunked' interface to keep memory consumption minimal during revision_trees()
modified:
bzrlib/repository.py rev_storage.py-20051111201905-119e9401e46257e3
------------------------------------------------------------
revno: 3890.2.2
revision-id: john at arbash-meinel.com-20081211005616-szoqqeabcyahy39u
parent: john at arbash-meinel.com-20081211005436-a8bn72zw43b1vd9r
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: get_record_stream_chunked
timestamp: Wed 2008-12-10 18:56:16 -0600
message:
Change the signature to report the storage kind as 'chunked'
modified:
bzrlib/tests/test_versionedfile.py test_versionedfile.py-20060222045249-db45c9ed14a1c2e5
bzrlib/versionedfile.py versionedfile.py-20060222045106-5039c71ee3b65490
------------------------------------------------------------
revno: 3890.2.1
revision-id: john at arbash-meinel.com-20081211005436-a8bn72zw43b1vd9r
parent: pqm at pqm.ubuntu.com-20081210082822-li6ku9s3k63kjrpr
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: get_record_stream_chunked
timestamp: Wed 2008-12-10 18:54:36 -0600
message:
Start working on a ChunkedContentFactory.
This allows get_bytes_as('chunked') for both FulltextContentFactory,
and for ChunkedContentFactory, as it is a trivial conversion to
go between the two styles.
We will also want to special case when converting 'chunked' into
'lines'. But that is for future work.
modified:
bzrlib/knit.py knit.py-20051212171256-f056ac8f0fbe1bd9
bzrlib/tests/test_versionedfile.py test_versionedfile.py-20060222045249-db45c9ed14a1c2e5
bzrlib/versionedfile.py versionedfile.py-20060222045106-5039c71ee3b65490
bzrlib/weave.py knit.py-20050627021749-759c29984154256b
=== modified file '.bzrignore'
--- a/.bzrignore 2008-09-23 23:28:27 +0000
+++ b/.bzrignore 2008-12-11 02:18:59 +0000
@@ -39,6 +39,7 @@
doc/**/*.html
doc/developers/performance.png
bzrlib/_btree_serializer_c.c
+bzrlib/_chunks_to_lines_pyx.c
bzrlib/_dirstate_helpers_c.c
bzrlib/_knit_load_data_c.c
bzrlib/_readdir_pyx.c
=== modified file 'NEWS'
--- a/NEWS 2008-12-11 03:07:27 +0000
+++ b/NEWS 2008-12-11 20:23:00 +0000
@@ -84,6 +84,15 @@
advantage of pycurl is that it checks ssl certificates.)
(John Arbash Meinel)
+ * ``VersionedFiles.get_record_stream()`` can now return objects with a
+ storage_kind of ``chunked``. This is a collection (list/tuple) of
+ strings. You can use ``osutils.chunks_to_lines()`` to turn them into
+ guaranteed 'lines' or you can use ``''.join(chunks)`` to turn it
+ into a fulltext. This allows for some very good memory savings when
+ asking for many texts that share ancestry, as the individual chunks
+ can be shared between versions of the file. (John Arbash Meinel)
+
+
bzr 1.10 2008-12-05
-------------------
=== added file 'bzrlib/_chunks_to_lines_py.py'
--- a/bzrlib/_chunks_to_lines_py.py 1970-01-01 00:00:00 +0000
+++ b/bzrlib/_chunks_to_lines_py.py 2008-12-11 19:37:06 +0000
@@ -0,0 +1,57 @@
+# Copyright (C) 2008 Canonical Ltd
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+
+"""The python implementation of chunks_to_lines"""
+
+
+def chunks_to_lines(chunks):
+ """Re-split chunks into simple lines.
+
+ Each entry in the result should contain a single newline at the end. Except
+ for the last entry which may not have a final newline. If chunks is already
+ a simple list of lines, we return it directly.
+
+ :param chunks: An list/tuple of strings. If chunks is already a list of
+ lines, then we will return it as-is.
+ :return: A list of strings.
+ """
+ # Optimize for a very common case when chunks are already lines
+ last_no_newline = False
+ for chunk in chunks:
+ if last_no_newline:
+ # Only the last chunk is allowed to not have a trailing newline
+ # Getting here means the last chunk didn't have a newline, and we
+ # have a chunk following it
+ break
+ if not chunk:
+ # Empty strings are never valid lines
+ break
+ elif '\n' in chunk[:-1]:
+ # This chunk has an extra '\n', so we will have to split it
+ break
+ elif chunk[-1] != '\n':
+ # This chunk does not have a trailing newline
+ last_no_newline = True
+ else:
+ # All of the lines (but possibly the last) have a single newline at the
+ # end of the string.
+ # For the last one, we allow it to not have a trailing newline, but it
+ # is not allowed to be an empty string.
+ return chunks
+
+ # These aren't simple lines, just join and split again.
+ from bzrlib import osutils
+ return osutils._split_lines(''.join(chunks))
=== added file 'bzrlib/_chunks_to_lines_pyx.pyx'
--- a/bzrlib/_chunks_to_lines_pyx.pyx 1970-01-01 00:00:00 +0000
+++ b/bzrlib/_chunks_to_lines_pyx.pyx 2008-12-11 18:26:16 +0000
@@ -0,0 +1,130 @@
+# Copyright (C) 2008 Canonical Ltd
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+#
+
+"""Pyrex extensions for converting chunks to lines."""
+
+#python2.4 support
+cdef extern from "python-compat.h":
+ pass
+
+cdef extern from "stdlib.h":
+ ctypedef unsigned size_t
+
+cdef extern from "Python.h":
+ ctypedef int Py_ssize_t # Required for older pyrex versions
+ ctypedef struct PyObject:
+ pass
+ int PyList_Append(object lst, object item) except -1
+
+ int PyString_CheckExact(object p)
+ char *PyString_AS_STRING(object p)
+ Py_ssize_t PyString_GET_SIZE(object p)
+ object PyString_FromStringAndSize(char *c_str, Py_ssize_t len)
+
+cdef extern from "string.h":
+ void *memchr(void *s, int c, size_t n)
+
+
+def chunks_to_lines(chunks):
+ """Re-split chunks into simple lines.
+
+ Each entry in the result should contain a single newline at the end. Except
+ for the last entry which may not have a final newline. If chunks is already
+ a simple list of lines, we return it directly.
+
+ :param chunks: An list/tuple of strings. If chunks is already a list of
+ lines, then we will return it as-is.
+ :return: A list of strings.
+ """
+ cdef char *c_str
+ cdef char *newline
+ cdef char *c_last
+ cdef Py_ssize_t the_len
+ cdef int last_no_newline
+
+ # Check to see if the chunks are already lines
+ last_no_newline = 0
+ for chunk in chunks:
+ if last_no_newline:
+ # We have a chunk which followed a chunk without a newline, so this
+ # is not a simple list of lines.
+ break
+ # Switching from PyString_AsStringAndSize to PyString_CheckExact and
+ # then the macros GET_SIZE and AS_STRING saved us 40us / 470us.
+ # It seems PyString_AsStringAndSize can actually trigger a conversion,
+ # which we don't want anyway.
+ if not PyString_CheckExact(chunk):
+ raise TypeError('chunk is not a string')
+ the_len = PyString_GET_SIZE(chunk)
+ if the_len == 0:
+ # An empty string is never a valid line
+ break
+ c_str = PyString_AS_STRING(chunk)
+ c_last = c_str + the_len - 1
+ newline = <char *>memchr(c_str, c'\n', the_len)
+ if newline != c_last:
+ if newline == NULL:
+ # Missing a newline. Only valid as the last line
+ last_no_newline = 1
+ else:
+ # There is a newline in the middle, we must resplit
+ break
+ else:
+ # Everything was already a list of lines
+ return chunks
+
+ # We know we need to create a new list of lines
+ lines = []
+ tail = None # Any remainder from the previous chunk
+ for chunk in chunks:
+ if tail is not None:
+ chunk = tail + chunk
+ tail = None
+ if not PyString_CheckExact(chunk):
+ raise TypeError('chunk is not a string')
+ the_len = PyString_GET_SIZE(chunk)
+ if the_len == 0:
+ # An empty string is never a valid line, and we don't need to
+ # append anything
+ continue
+ c_str = PyString_AS_STRING(chunk)
+ c_last = c_str + the_len - 1
+ newline = <char *>memchr(c_str, c'\n', the_len)
+ if newline == c_last:
+ # A simple line
+ PyList_Append(lines, chunk)
+ elif newline == NULL:
+ # A chunk without a newline, if this is the last entry, then we
+ # allow it
+ tail = chunk
+ else:
+ # We have a newline in the middle, loop until we've consumed all
+ # lines
+ while newline != NULL:
+ line = PyString_FromStringAndSize(c_str, newline - c_str + 1)
+ PyList_Append(lines, line)
+ c_str = newline + 1
+ if c_str > c_last: # We are done
+ break
+ the_len = c_last - c_str + 1
+ newline = <char *>memchr(c_str, c'\n', the_len)
+ if newline == NULL:
+ tail = PyString_FromStringAndSize(c_str, the_len)
+ break
+ if tail is not None:
+ PyList_Append(lines, tail)
+ return lines
=== modified file 'bzrlib/knit.py'
--- a/bzrlib/knit.py 2008-12-05 15:34:02 +0000
+++ b/bzrlib/knit.py 2008-12-11 03:18:52 +0000
@@ -110,7 +110,7 @@
adapter_registry,
ConstantMapper,
ContentFactory,
- FulltextContentFactory,
+ ChunkedContentFactory,
VersionedFile,
VersionedFiles,
)
@@ -196,7 +196,8 @@
[compression_parent], 'unordered', True).next()
if basis_entry.storage_kind == 'absent':
raise errors.RevisionNotPresent(compression_parent, self._basis_vf)
- basis_lines = split_lines(basis_entry.get_bytes_as('fulltext'))
+ basis_chunks = basis_entry.get_bytes_as('chunked')
+ basis_lines = osutils.chunks_to_lines(basis_chunks)
# Manually apply the delta because we have one annotated content and
# one plain.
basis_content = PlainKnitContent(basis_lines, compression_parent)
@@ -229,7 +230,8 @@
[compression_parent], 'unordered', True).next()
if basis_entry.storage_kind == 'absent':
raise errors.RevisionNotPresent(compression_parent, self._basis_vf)
- basis_lines = split_lines(basis_entry.get_bytes_as('fulltext'))
+ basis_chunks = basis_entry.get_bytes_as('chunked')
+ basis_lines = osutils.chunks_to_lines(basis_chunks)
basis_content = PlainKnitContent(basis_lines, compression_parent)
# Manually apply the delta because we have one annotated content and
# one plain.
@@ -276,11 +278,13 @@
def get_bytes_as(self, storage_kind):
if storage_kind == self.storage_kind:
return self._raw_record
- if storage_kind == 'fulltext' and self._knit is not None:
- return self._knit.get_text(self.key[0])
- else:
- raise errors.UnavailableRepresentation(self.key, storage_kind,
- self.storage_kind)
+ if self._knit is not None:
+ if storage_kind == 'chunked':
+ return self._knit.get_lines(self.key[0])
+ elif storage_kind == 'fulltext':
+ return self._knit.get_text(self.key[0])
+ raise errors.UnavailableRepresentation(self.key, storage_kind,
+ self.storage_kind)
class KnitContent(object):
@@ -1020,7 +1024,7 @@
if record.storage_kind == 'absent':
continue
missing_keys.remove(record.key)
- lines = split_lines(record.get_bytes_as('fulltext'))
+ lines = osutils.chunks_to_lines(record.get_bytes_as('chunked'))
text_map[record.key] = lines
content_map[record.key] = PlainKnitContent(lines, record.key)
if record.key in keys:
@@ -1288,9 +1292,8 @@
text_map, _ = self._get_content_maps(keys, non_local)
for key in keys:
lines = text_map.pop(key)
- text = ''.join(lines)
- yield FulltextContentFactory(key, global_map[key], None,
- text)
+ yield ChunkedContentFactory(key, global_map[key], None,
+ lines)
else:
for source, keys in source_keys:
if source is parent_maps[0]:
@@ -1443,6 +1446,9 @@
buffered = True
if not buffered:
self._index.add_records([index_entry])
+ elif record.storage_kind == 'chunked':
+ self.add_lines(record.key, parents,
+ osutils.chunks_to_lines(record.get_bytes_as('chunked')))
elif record.storage_kind == 'fulltext':
self.add_lines(record.key, parents,
split_lines(record.get_bytes_as('fulltext')))
@@ -2952,7 +2958,7 @@
reannotate = annotate.reannotate
for record in self._knit.get_record_stream(keys, 'topological', True):
key = record.key
- fulltext = split_lines(record.get_bytes_as('fulltext'))
+ fulltext = osutils.chunks_to_lines(record.get_bytes_as('chunked'))
parents = parent_map[key]
if parents is not None:
parent_lines = [parent_cache[parent] for parent in parent_map[key]]
=== modified file 'bzrlib/merge.py'
--- a/bzrlib/merge.py 2008-10-10 11:55:03 +0000
+++ b/bzrlib/merge.py 2008-12-11 03:18:52 +0000
@@ -1579,7 +1579,7 @@
def get_lines(self, revisions):
"""Get lines for revisions from the backing VersionedFiles.
-
+
:raises RevisionNotPresent: on absent texts.
"""
keys = [(self._key_prefix + (rev,)) for rev in revisions]
@@ -1587,8 +1587,8 @@
for record in self.vf.get_record_stream(keys, 'unordered', True):
if record.storage_kind == 'absent':
raise errors.RevisionNotPresent(record.key, self.vf)
- result[record.key[-1]] = osutils.split_lines(
- record.get_bytes_as('fulltext'))
+ result[record.key[-1]] = osutils.chunks_to_lines(
+ record.get_bytes_as('chunked'))
return result
def plan_merge(self):
=== modified file 'bzrlib/osutils.py'
--- a/bzrlib/osutils.py 2008-10-17 03:49:08 +0000
+++ b/bzrlib/osutils.py 2008-12-11 19:37:06 +0000
@@ -812,6 +812,7 @@
rps.append(f)
return rps
+
def joinpath(p):
for f in p:
if (f == '..') or (f is None) or (f == ''):
@@ -819,8 +820,28 @@
return pathjoin(*p)
+try:
+ from bzrlib._chunks_to_lines_pyx import chunks_to_lines
+except ImportError:
+ from bzrlib._chunks_to_lines_py import chunks_to_lines
+
+
def split_lines(s):
"""Split s into lines, but without removing the newline characters."""
+ # Trivially convert a fulltext into a 'chunked' representation, and let
+ # chunks_to_lines do the heavy lifting.
+ if isinstance(s, str):
+ # chunks_to_lines only supports 8-bit strings
+ return chunks_to_lines([s])
+ else:
+ return _split_lines(s)
+
+
+def _split_lines(s):
+ """Split s into lines, but without removing the newline characters.
+
+ This supports Unicode or plain string objects.
+ """
lines = s.split('\n')
result = [line + '\n' for line in lines[:-1]]
if lines[-1]:
=== modified file 'bzrlib/repository.py'
--- a/bzrlib/repository.py 2008-12-10 04:34:21 +0000
+++ b/bzrlib/repository.py 2008-12-11 01:01:04 +0000
@@ -1680,14 +1680,15 @@
def _iter_inventory_xmls(self, revision_ids):
keys = [(revision_id,) for revision_id in revision_ids]
stream = self.inventories.get_record_stream(keys, 'unordered', True)
- texts = {}
+ text_chunks = {}
for record in stream:
if record.storage_kind != 'absent':
- texts[record.key] = record.get_bytes_as('fulltext')
+ text_chunks[record.key] = record.get_bytes_as('chunked')
else:
raise errors.NoSuchRevision(self, record.key)
for key in keys:
- yield texts[key], key[-1]
+ chunks = text_chunks.pop(key)
+ yield ''.join(chunks), key[-1]
def deserialise_inventory(self, revision_id, xml):
"""Transform the xml into an inventory object.
=== modified file 'bzrlib/tests/__init__.py'
--- a/bzrlib/tests/__init__.py 2008-12-09 21:35:49 +0000
+++ b/bzrlib/tests/__init__.py 2008-12-11 03:08:03 +0000
@@ -2788,6 +2788,7 @@
'bzrlib.tests.test_bzrdir',
'bzrlib.tests.test_cache_utf8',
'bzrlib.tests.test_chunk_writer',
+ 'bzrlib.tests.test__chunks_to_lines',
'bzrlib.tests.test_commands',
'bzrlib.tests.test_commit',
'bzrlib.tests.test_commit_merge',
=== added file 'bzrlib/tests/test__chunks_to_lines.py'
--- a/bzrlib/tests/test__chunks_to_lines.py 1970-01-01 00:00:00 +0000
+++ b/bzrlib/tests/test__chunks_to_lines.py 2008-12-11 19:31:01 +0000
@@ -0,0 +1,128 @@
+# Copyright (C) 2008 Canonical Ltd
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+#
+
+"""Tests for chunks_to_lines."""
+
+from bzrlib import tests
+
+
+def load_tests(standard_tests, module, loader):
+ # parameterize all tests in this module
+ suite = loader.suiteClass()
+ applier = tests.TestScenarioApplier()
+ import bzrlib._chunks_to_lines_py as py_module
+ applier.scenarios = [('python', {'module': py_module})]
+ if CompiledChunksToLinesFeature.available():
+ import bzrlib._chunks_to_lines_pyx as c_module
+ applier.scenarios.append(('C', {'module': c_module}))
+ else:
+ # the compiled module isn't available, so we add a failing test
+ class FailWithoutFeature(tests.TestCase):
+ def test_fail(self):
+ self.requireFeature(CompiledChunksToLinesFeature)
+ suite.addTest(loader.loadTestsFromTestCase(FailWithoutFeature))
+ tests.adapt_tests(standard_tests, applier, suite)
+ return suite
+
+
+class _CompiledChunksToLinesFeature(tests.Feature):
+
+ def _probe(self):
+ try:
+ import bzrlib._chunks_to_lines_pyx
+ except ImportError:
+ return False
+ return True
+
+ def feature_name(self):
+ return 'bzrlib._chunks_to_lines_pyx'
+
+CompiledChunksToLinesFeature = _CompiledChunksToLinesFeature()
+
+
+class TestChunksToLines(tests.TestCase):
+
+ module = None # Filled in by test parameterization
+
+ def assertChunksToLines(self, lines, chunks, alreadly_lines=False):
+ result = self.module.chunks_to_lines(chunks)
+ self.assertEqual(lines, result)
+ if alreadly_lines:
+ self.assertIs(chunks, result)
+
+ def test_fulltext_chunk_to_lines(self):
+ self.assertChunksToLines(['foo\n', 'bar\r\n', 'ba\rz\n'],
+ ['foo\nbar\r\nba\rz\n'])
+ self.assertChunksToLines(['foobarbaz\n'], ['foobarbaz\n'],
+ alreadly_lines=True)
+ self.assertChunksToLines(['foo\n', 'bar\n', '\n', 'baz\n', '\n', '\n'],
+ ['foo\nbar\n\nbaz\n\n\n'])
+ self.assertChunksToLines(['foobarbaz'], ['foobarbaz'],
+ alreadly_lines=True)
+ self.assertChunksToLines(['foobarbaz'], ['foo', 'bar', 'baz'])
+
+ def test_newlines(self):
+ self.assertChunksToLines(['\n'], ['\n'], alreadly_lines=True)
+ self.assertChunksToLines(['\n'], ['', '\n', ''])
+ self.assertChunksToLines(['\n'], ['\n', ''])
+ self.assertChunksToLines(['\n'], ['', '\n'])
+ self.assertChunksToLines(['\n', '\n', '\n'], ['\n\n\n'])
+ self.assertChunksToLines(['\n', '\n', '\n'], ['\n', '\n', '\n'],
+ alreadly_lines=True)
+
+ def test_lines_to_lines(self):
+ self.assertChunksToLines(['foo\n', 'bar\r\n', 'ba\rz\n'],
+ ['foo\n', 'bar\r\n', 'ba\rz\n'],
+ alreadly_lines=True)
+
+ def test_no_final_newline(self):
+ self.assertChunksToLines(['foo\n', 'bar\r\n', 'ba\rz'],
+ ['foo\nbar\r\nba\rz'])
+ self.assertChunksToLines(['foo\n', 'bar\r\n', 'ba\rz'],
+ ['foo\n', 'bar\r\n', 'ba\rz'],
+ alreadly_lines=True)
+ self.assertChunksToLines(('foo\n', 'bar\r\n', 'ba\rz'),
+ ('foo\n', 'bar\r\n', 'ba\rz'),
+ alreadly_lines=True)
+ self.assertChunksToLines([], [], alreadly_lines=True)
+ self.assertChunksToLines(['foobarbaz'], ['foobarbaz'],
+ alreadly_lines=True)
+ self.assertChunksToLines([], [''])
+
+ def test_mixed(self):
+ self.assertChunksToLines(['foo\n', 'bar\r\n', 'ba\rz'],
+ ['foo\n', 'bar\r\nba\r', 'z'])
+ self.assertChunksToLines(['foo\n', 'bar\r\n', 'ba\rz'],
+ ['foo\nb', 'a', 'r\r\nba\r', 'z'])
+ self.assertChunksToLines(['foo\n', 'bar\r\n', 'ba\rz'],
+ ['foo\nbar\r\nba', '\r', 'z'])
+
+ self.assertChunksToLines(['foo\n', 'bar\r\n', 'ba\rz'],
+ ['foo\n', '', 'bar\r\nba', '\r', 'z'])
+ self.assertChunksToLines(['foo\n', 'bar\r\n', 'ba\rz\n'],
+ ['foo\n', 'bar\r\n', 'ba\rz\n', ''])
+ self.assertChunksToLines(['foo\n', 'bar\r\n', 'ba\rz\n'],
+ ['foo\n', 'bar', '\r\n', 'ba\rz\n'])
+
+ def test_not_lines(self):
+ # We should raise a TypeError, not crash
+ self.assertRaises(TypeError, self.module.chunks_to_lines,
+ object())
+ self.assertRaises(TypeError, self.module.chunks_to_lines,
+ [object()])
+ self.assertRaises(TypeError, self.module.chunks_to_lines,
+ ['foo', object()])
=== modified file 'bzrlib/tests/test_osutils.py'
--- a/bzrlib/tests/test_osutils.py 2008-10-01 07:56:03 +0000
+++ b/bzrlib/tests/test_osutils.py 2008-12-11 03:08:03 +0000
@@ -1,4 +1,4 @@
-# Copyright (C) 2005, 2006, 2007 Canonical Ltd
+# Copyright (C) 2005, 2006, 2007, 2008 Canonical Ltd
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
@@ -756,6 +756,23 @@
self.assertEndsWith(osutils._mac_getcwd(), u'B\xe5gfors')
+class TestChunksToLines(TestCase):
+
+ def test_smoketest(self):
+ self.assertEqual(['foo\n', 'bar\n', 'baz\n'],
+ osutils.chunks_to_lines(['foo\nbar', '\nbaz\n']))
+ self.assertEqual(['foo\n', 'bar\n', 'baz\n'],
+ osutils.chunks_to_lines(['foo\n', 'bar\n', 'baz\n']))
+
+ def test_is_compiled(self):
+ from bzrlib.tests.test__chunks_to_lines import CompiledChunksToLinesFeature
+ if CompiledChunksToLinesFeature:
+ from bzrlib._chunks_to_lines_pyx import chunks_to_lines
+ else:
+ from bzrlib._chunks_to_lines_py import chunks_to_lines
+ self.assertIs(chunks_to_lines, osutils.chunks_to_lines)
+
+
class TestSplitLines(TestCase):
def test_split_unicode(self):
=== modified file 'bzrlib/tests/test_versionedfile.py'
--- a/bzrlib/tests/test_versionedfile.py 2008-12-03 21:05:01 +0000
+++ b/bzrlib/tests/test_versionedfile.py 2008-12-11 00:56:16 +0000
@@ -1558,8 +1558,9 @@
"""Assert that storage_kind is a valid storage_kind."""
self.assertSubset([storage_kind],
['mpdiff', 'knit-annotated-ft', 'knit-annotated-delta',
- 'knit-ft', 'knit-delta', 'fulltext', 'knit-annotated-ft-gz',
- 'knit-annotated-delta-gz', 'knit-ft-gz', 'knit-delta-gz'])
+ 'knit-ft', 'knit-delta', 'chunked', 'fulltext',
+ 'knit-annotated-ft-gz', 'knit-annotated-delta-gz', 'knit-ft-gz',
+ 'knit-delta-gz'])
def capture_stream(self, f, entries, on_seen, parents):
"""Capture a stream for testing."""
@@ -1636,9 +1637,11 @@
[None, files.get_sha1s([factory.key])[factory.key]])
self.assertEqual(parent_map[factory.key], factory.parents)
# self.assertEqual(files.get_text(factory.key),
- self.assertIsInstance(factory.get_bytes_as('fulltext'), str)
- self.assertIsInstance(factory.get_bytes_as(factory.storage_kind),
- str)
+ ft_bytes = factory.get_bytes_as('fulltext')
+ self.assertIsInstance(ft_bytes, str)
+ chunked_bytes = factory.get_bytes_as('chunked')
+ self.assertEqualDiff(ft_bytes, ''.join(chunked_bytes))
+
self.assertStreamOrder(sort_order, seen, keys)
def assertStreamOrder(self, sort_order, seen, keys):
@@ -2210,8 +2213,9 @@
self._lines["A"] = ["FOO", "BAR"]
it = self.texts.get_record_stream([("A",)], "unordered", True)
record = it.next()
- self.assertEquals("fulltext", record.storage_kind)
+ self.assertEquals("chunked", record.storage_kind)
self.assertEquals("FOOBAR", record.get_bytes_as("fulltext"))
+ self.assertEquals(["FOO", "BAR"], record.get_bytes_as("chunked"))
def test_get_record_stream_absent(self):
it = self.texts.get_record_stream([("A",)], "unordered", True)
=== modified file 'bzrlib/transform.py'
--- a/bzrlib/transform.py 2008-10-28 10:31:32 +0000
+++ b/bzrlib/transform.py 2008-12-11 03:18:52 +0000
@@ -1177,7 +1177,7 @@
if kind == 'file':
cur_file = open(self._limbo_name(trans_id), 'rb')
try:
- lines = osutils.split_lines(cur_file.read())
+ lines = osutils.chunks_to_lines(cur_file.readlines())
finally:
cur_file.close()
parents = self._get_parents_lines(trans_id)
=== modified file 'bzrlib/versionedfile.py'
--- a/bzrlib/versionedfile.py 2008-12-03 21:05:01 +0000
+++ b/bzrlib/versionedfile.py 2008-12-11 03:18:52 +0000
@@ -59,6 +59,8 @@
'bzrlib.knit', 'FTAnnotatedToUnannotated')
adapter_registry.register_lazy(('knit-annotated-ft-gz', 'fulltext'),
'bzrlib.knit', 'FTAnnotatedToFullText')
+# adapter_registry.register_lazy(('knit-annotated-ft-gz', 'chunked'),
+# 'bzrlib.knit', 'FTAnnotatedToChunked')
class ContentFactory(object):
@@ -84,12 +86,46 @@
self.parents = None
+class ChunkedContentFactory(ContentFactory):
+ """Static data content factory.
+
+ This takes a 'chunked' list of strings. The only requirement on 'chunked' is
+ that ''.join(lines) becomes a valid fulltext. A tuple of a single string
+ satisfies this, as does a list of lines.
+
+ :ivar sha1: None, or the sha1 of the content fulltext.
+ :ivar storage_kind: The native storage kind of this factory. Always
+ 'chunked'
+ :ivar key: The key of this content. Each key is a tuple with a single
+ string in it.
+ :ivar parents: A tuple of parent keys for self.key. If the object has
+ no parent information, None (as opposed to () for an empty list of
+ parents).
+ """
+
+ def __init__(self, key, parents, sha1, chunks):
+ """Create a ContentFactory."""
+ self.sha1 = sha1
+ self.storage_kind = 'chunked'
+ self.key = key
+ self.parents = parents
+ self._chunks = chunks
+
+ def get_bytes_as(self, storage_kind):
+ if storage_kind == 'chunked':
+ return self._chunks
+ elif storage_kind == 'fulltext':
+ return ''.join(self._chunks)
+ raise errors.UnavailableRepresentation(self.key, storage_kind,
+ self.storage_kind)
+
+
class FulltextContentFactory(ContentFactory):
"""Static data content factory.
This takes a fulltext when created and just returns that during
get_bytes_as('fulltext').
-
+
:ivar sha1: None, or the sha1 of the content fulltext.
:ivar storage_kind: The native storage kind of this factory. Always
'fulltext'.
@@ -111,6 +147,8 @@
def get_bytes_as(self, storage_kind):
if storage_kind == self.storage_kind:
return self._text
+ elif storage_kind == 'chunked':
+ return (self._text,)
raise errors.UnavailableRepresentation(self.key, storage_kind,
self.storage_kind)
@@ -804,12 +842,12 @@
if not mpvf.has_version(p))
# It seems likely that adding all the present parents as fulltexts can
# easily exhaust memory.
- split_lines = osutils.split_lines
+ chunks_to_lines = osutils.chunks_to_lines
for record in self.get_record_stream(needed_parents, 'unordered',
True):
if record.storage_kind == 'absent':
continue
- mpvf.add_version(split_lines(record.get_bytes_as('fulltext')),
+ mpvf.add_version(chunks_to_lines(record.get_bytes_as('chunked')),
record.key, [])
for (key, parent_keys, expected_sha1, mpdiff), lines in\
zip(records, mpvf.get_line_list(versions)):
@@ -940,9 +978,9 @@
ghosts = maybe_ghosts - set(self.get_parent_map(maybe_ghosts))
knit_keys.difference_update(ghosts)
lines = {}
- split_lines = osutils.split_lines
+ chunks_to_lines = osutils.chunks_to_lines
for record in self.get_record_stream(knit_keys, 'topological', True):
- lines[record.key] = split_lines(record.get_bytes_as('fulltext'))
+ lines[record.key] = chunks_to_lines(record.get_bytes_as('chunked'))
# line_block_dict = {}
# for parent, blocks in record.extract_line_blocks():
# line_blocks[parent] = blocks
@@ -1251,8 +1289,7 @@
lines = self._lines[key]
parents = self._parents[key]
pending.remove(key)
- yield FulltextContentFactory(key, parents, None,
- ''.join(lines))
+ yield ChunkedContentFactory(key, parents, None, lines)
for versionedfile in self.fallback_versionedfiles:
for record in versionedfile.get_record_stream(
pending, 'unordered', True):
@@ -1422,9 +1459,9 @@
if lines is not None:
if not isinstance(lines, list):
raise AssertionError
- yield FulltextContentFactory((k,), None,
+ yield ChunkedContentFactory((k,), None,
sha1=osutils.sha_strings(lines),
- text=''.join(lines))
+ chunks=lines)
else:
yield AbsentContentFactory((k,))
=== modified file 'bzrlib/weave.py'
--- a/bzrlib/weave.py 2008-10-01 05:40:45 +0000
+++ b/bzrlib/weave.py 2008-12-11 03:18:52 +0000
@@ -79,6 +79,8 @@
from bzrlib import tsort
""")
from bzrlib import (
+ errors,
+ osutils,
progress,
)
from bzrlib.errors import (WeaveError, WeaveFormatError, WeaveParentMismatch,
@@ -88,7 +90,6 @@
WeaveRevisionAlreadyPresent,
WeaveRevisionNotPresent,
)
-import bzrlib.errors as errors
from bzrlib.osutils import dirname, sha, sha_strings, split_lines
import bzrlib.patiencediff
from bzrlib.revision import NULL_REVISION
@@ -122,6 +123,8 @@
def get_bytes_as(self, storage_kind):
if storage_kind == 'fulltext':
return self._weave.get_text(self.key[-1])
+ elif storage_kind == 'chunked':
+ return self._weave.get_lines(self.key[-1])
else:
raise UnavailableRepresentation(self.key, storage_kind, 'fulltext')
@@ -357,9 +360,10 @@
raise RevisionNotPresent([record.key[0]], self)
# adapt to non-tuple interface
parents = [parent[0] for parent in record.parents]
- if record.storage_kind == 'fulltext':
+ if (record.storage_kind == 'fulltext'
+ or record.storage_kind == 'chunked'):
self.add_lines(record.key[0], parents,
- split_lines(record.get_bytes_as('fulltext')))
+ osutils.chunks_to_lines(record.get_bytes_as('chunked')))
else:
adapter_key = record.storage_kind, 'fulltext'
try:
=== modified file 'setup.py'
--- a/setup.py 2008-10-16 03:58:42 +0000
+++ b/setup.py 2008-12-11 02:18:59 +0000
@@ -258,6 +258,7 @@
add_pyrex_extension('bzrlib._btree_serializer_c')
+add_pyrex_extension('bzrlib._chunks_to_lines_pyx')
add_pyrex_extension('bzrlib._knit_load_data_c')
if sys.platform == 'win32':
add_pyrex_extension('bzrlib._dirstate_helpers_c',
More information about the bazaar-commits
mailing list