[MERGE][bug #173010] initial branch over http: can show no ui activity for a very long time
Andrew Bennetts
andrew at canonical.com
Thu Dec 6 07:12:55 GMT 2007
Vincent Ladeuil wrote:
> "Gimme my blinkenlights back !" said the user...
>
> Here you are.
>
> The attached fix is a complete rewrite of the http response
> handling.
[...]
This is pretty cool stuff. I'm going to be a wuss for the moment and make this
vote:
bb:comment
But I intend to change that to something more helpful shortly :) This is just
so you can start dealing with my comments while I finish thinking about this
code.
I'm a little bit scared about this for 1.0, to be honest. While I think the
code is pretty good quality, I fear that that stressing HTTP servers with more
ambitious range requests is going to encounter new problems in various HTTP
servers and proxies. But on the other hand it'd be very nice to have this...
Anyway, here's my initial review. I've tried to not to repeat comments John's
already made.
-Andrew.
> # Begin patch
> === modified file 'NEWS'
> --- NEWS 2007-12-02 14:35:01 +0000
> +++ NEWS 2007-12-03 20:33:05 +0000
> @@ -14,6 +14,11 @@
>
> INTERNALS:
>
> + * readv urllib http implementation is now a real iterator above the
> + underlying socket and deliver data as soon as it arrives. 'get' still
> + wraps its ouput in a StringIO.
> + (Vincent Ladeuil)
Typo: “ouput” -> “output”
> === modified file 'bzrlib/errors.py'
> --- bzrlib/errors.py 2007-11-30 07:45:56 +0000
> +++ bzrlib/errors.py 2007-12-03 08:33:06 +0000
> @@ -1455,11 +1455,10 @@
>
> class InvalidRange(TransportError):
>
> - _fmt = "Invalid range access in %(path)s at %(offset)s."
> -
> - def __init__(self, path, offset):
> - TransportError.__init__(self, ("Invalid range access in %s at %d"
> - % (path, offset)))
> + _fmt = "Invalid range access in %(path)s at %(offset)s: %(msg)s"
> +
> + def __init__(self, path, offset, msg=None):
> + TransportError.__init__(self, msg)
> self.path = path
> self.offset = offset
Hmm, no corresponding change to test_errors.py? Not a blocker, but it'd be nice
to have a test for the formatting of this exception with both two and three
arguments.
> === modified file 'bzrlib/tests/test_http_response.py'
> --- bzrlib/tests/test_http_response.py 2006-10-16 01:50:48 +0000
> +++ bzrlib/tests/test_http_response.py 2007-12-03 17:53:35 +0000
> @@ -1,4 +1,4 @@
> -# Copyright (C) 2005, 2006 Canonical Ltd
> +# Copyright (C) 2005, 2006, 2007 Canonical Ltd
> #
> # This program is free software; you can redistribute it and/or modify
> # it under the terms of the GNU General Public License as published by
> @@ -17,7 +17,7 @@
> """Tests from HTTP response parsing."""
>
> from cStringIO import StringIO
> -import mimetools
> +import httplib
>
> from bzrlib import errors
> from bzrlib.transport import http
> @@ -25,282 +25,176 @@
> from bzrlib.tests import TestCase
>
>
> -class TestResponseRange(TestCase):
> - """Test the ResponseRange class."""
> -
> - def test_cmp(self):
> - RR = response.ResponseRange
> - r1 = RR(0, 10, 0)
> - r2 = RR(15, 20, 10)
> - self.assertTrue(r1 < r2)
> - self.assertFalse(r1 > r2)
> - self.assertTrue(r1 < 5)
> - self.assertFalse(r2 < 5)
> -
> - self.assertEqual(RR(0, 10, 5), RR(0, 10, 5))
> - self.assertNotEqual(RR(0, 10, 5), RR(0, 8, 5))
> - self.assertNotEqual(RR(0, 10, 5), RR(0, 10, 6))
> -
> - def test_sort_list(self):
> - """Ensure longer ranges are sorted after shorter ones"""
> - RR = response.ResponseRange
> - lst = [RR(3, 8, 0), 5, RR(3, 7, 0), 6]
> - lst.sort()
> - self.assertEqual([RR(3,7,0), RR(3,8,0), 5, 6], lst)
> -
> -
> -class TestRangeFile(TestCase):
> +class TestRangeFileAccess(TestCase):
> """Test RangeFile."""
The docstring is very terse, and no longer obviously corresponds to the class
name. Could you please expand the docstring a little to explain which aspect of
RangeFile it is that you're testing here? (I assume this TestCase isn't just
for testing RangeFile in general, because otherwise you wouldn't have changed
the name.)
> def setUp(self):
> - content = "abcdefghijklmnopqrstuvwxyz"
> - self.fp = response.RangeFile('foo', StringIO(content))
> - self.fp._add_range(0, 9, 0)
> - self.fp._add_range(20, 29, 10)
> - self.fp._add_range(30, 39, 15)
> + self.alpha = 'abcdefghijklmnopqrstuvwxyz'
> + # Each file is defined as a tuple (builder, start), 'builder' is a
> + # callable returning a RangeFile and 'start' the start of the first (or
> + # unique) range.
> + self.files = [(self._file_size_unknown, 0),
> + (self._file_size_known, 0),
> + (self._file_single_range, 10),
> + (self._file_multi_ranges, 10),]
I'm not convinced by the self.files variable. You almost always slice it.
You're basically using it as an ad hoc way to paramaterise some tests. It might
be simpler to just explicitly write those simple test methods mulitple times.
It might be helpful to have a list of all canned RangeFiles (or a helper method
to return such a list), but I don't think you should ever slice it, because that
makes it hard to add new fixtures to that list.
Btw, the term the xUnit patterns book uses for what you've called “builders” is
“creation method” (more or less). See
<http://xunitpatterns.com/Creation%20Method.html>.
> + def _file_size_unknown(self):
> + return response.RangeFile('Whole_file_size_unknown',
> + StringIO(self.alpha))
> +
> + def _file_size_known(self):
> + alpha = self.alpha
> + f = response.RangeFile('Whole_file_size_known', StringIO(alpha))
> + f.set_range(0, len(alpha))
> + return f
> +
> + def _file_single_range(self):
> + alpha = self.alpha
> + f = response.RangeFile('Single_range_file', StringIO(alpha))
> + f.set_range(10, len(alpha))
> + return f
> +
> + def _file_multi_ranges(self):
> + alpha = self.alpha
> +
> + boundary = 'separation'
> + bline = '--' + boundary + '\r\n'
> + content = []
> + content += bline
> + file_size = 200
> + for (start, part) in [(10, alpha), (100, alpha), (126, alpha.upper())]:
> + plen = len(part)
> + content += 'Content-Range: bytes %d-%d/%d\r\n' % (start,
> + start+plen-1,
> + file_size)
> + content += '\r\n'
> + content += part
> + content += bline
> +
> + data = ''.join(content)
> + f = response.RangeFile('Multiple_ranges_file', StringIO(data))
> + # Ranges are set by decoding the headers
> + f.set_boundary(boundary)
> + return f
Why bother with the content variable in this helper method? It's just a test
helper, micro-optimisations are less important than clarity. Just start with
“data = ''” rather than “content = []" and directly append strings on that,
rather than onto a list you later have to join. That makes one less variable
for the reader to understand.
> + def _check_accesses_inside_range(self, f, start=0):
> + self.assertEquals(start, f.tell())
> + self.assertEquals('abc', f.read(3))
> + self.assertEquals('def', f.read(3))
> + self.assertEquals(start + 6, f.tell())
> + f.seek(start + 10)
> + self.assertEquals('klm', f.read(3))
> + self.assertEquals('no', f.read(2))
> + self.assertEquals(start + 15, f.tell())
> + # Unbounded read, should not cross range
> + self.assertEquals('pqrstuvwxyz', f.read())
This looks like a custom assertion method to me, so you ought simply name it
“assert...” rather than “_check...”.
I don't really understand what this is for. For example, why is it interesting
to assert that f.read(3) + f.read(3) gives 'abcdef', rather than simply
f.read(6)? Is it trying to make sure various edge cases of tell, read and seek
work? If so, it would be good to spell out exactly what those edge cases are
(like you do for the final assertion in the method).
And for that matter, it'd be good perhaps to say if there's any conditions
you're intentionally not trying to test — for instance, why not test seeking
backwards? I guess I'm not sure what the intent here is, exactly. You appear
to be asserting that “f” is a file-like object with contents of self.alpha, but
it's not clear just exactly how close to an ordinary file you want to demand. A
statement like “check accesses inside range” (or even “assert that f is
file-like”) seems like an open-ended thing, rather than a concrete claim.
> def test_valid_accesses(self):
[...]
> + """Test valid accesses: inside one or more ranges"""
> + alpha = 'abcdefghijklmnopqrstuvwxyz'
> +
> + for builder, start in self.files[:3]:
> + self._check_accesses_inside_range(builder(), start)
> +
> + f = self._file_multi_ranges()
> + self._check_accesses_inside_range(f, start=10)
> + f.seek(100) # Will trigger the decoding and setting of the second range
> + self._check_accesses_inside_range(f, 100)
> +
> + f = self._file_multi_ranges()
> + f.seek(10)
> + # Seeking to a point between two ranges is possible (only once) but
> + # reading there is forbidden
> + f.seek(40)
> + # We crossed a range boundary, so now the file is positioned at the
> + # start of the new range (i.e. trying to seek below 100 will error out)
> + f.seek(100)
> + f.seek(125)
> +
> + f = self._file_multi_ranges()
> + self.assertEquals(self.alpha, f.read()) # Read first range
> + f.seek(100)
> + self.assertEquals(self.alpha, f.read()) # Read second range
> + self.assertEquals(126, f.tell())
> + f.seek(126) # Start of third range which is also the current pos !
> + self.assertEquals('A', f.read(1))
This looks like several different tests squished into a single method. I think
it'd be clearer as several smaller test methods.
First you check that RangeFiles of a whole file with unknown size, a whole file
with known size, and a single range of a file conform to whatever specification
_check_accesses_inside_range is testing for.
(Hmm, it's not clear to me why _file_single_range would pass
_check_accesses_inside_range either; doesn't starting at byte 10 mean the first
bytes should not be 'abc'?)
Then you do a much more complicated test of a much more complicated RangeFile,
the one built by _file_multi_ranges. This is inherently more involved, because
you want to exercise multiple ranges, and should definitely be a separate test.
I'd be tempted to simply make four test methods out of this, and remove the loop
entirely.
> + def _check_file_boundaries(self, f, start=0):
> + f.seek(start)
> + self.assertRaises(errors.InvalidRange, f.read, 27)
> + # Will seek past the range and then errors out
> + self.assertRaises(errors.InvalidRange, f.seek, start + 27)
> +
> + def _check_beyond_range(self, builder, start):
> + f = builder()
> + f.seek(start + 20)
> + # Will try to read past the end of the range
> + self.assertRaises(errors.InvalidRange, f.read, 10)
> +
> + def _check_seek_backwards(self, f, start=0):
> + f.read(start + 12)
> + # Can't seek backwards
> + self.assertRaises(errors.InvalidRange, f.seek, start + 5)
Again, I'd prefer custom assertion methods to be called “assert...”, following
the existing convention.
> def test_invalid_accesses(self):
[...]
> + """Test errors triggered by invalid accesses."""
> +
> + f = self._file_size_unknown()
You have an excess space after the equals sign.
> + self.assertRaises(errors.InvalidRange, f.seek, -1, 2)
> +
> + for builder, start in self.files:
> + self._check_seek_backwards(builder(), start)
> +
> + for builder, start in self.files[1:3]:
> + self._check_file_boundaries(builder(), start)
Here's some of that opaque slicing. It doesn't really make for readable tests,
I don't want to have to memorize the precise contents of self.files to be able
to understand these tests. I think the way our existing *_implementations tests
work, with explicit parameterisation and the occasional conditional “raise
TestNotApplicable” makes for a much clearer test suite than this ad hoc method.
I think the main issue really is long test methods that try to exercise a whole
class of conditions, rather than having small, focused methods that exercise one
condition each. I find that makes tests much easier to understand. If you want
to group tests together by type of problem (e.g. “invalid accesses”) maybe
grouping them as methods in a TestCase would be better than in one big test
method.
> class TestHandleResponse(TestCase):
> -
> +
> + def _build_HTTPMessage(self, raw_headers):
> + status_and_headers = StringIO(raw_headers)
> + # Get read of the status line
> + status_and_headers.readline()
I think you mean “Get rid of” here?
> === modified file 'bzrlib/transport/http/__init__.py'
[...]
> class HttpTransportBase(ConnectedTransport, medium.SmartClientMedium):
> """Base class for http implementations.
>
> @@ -175,7 +132,12 @@
> :param relpath: The relative path to the file
> """
> code, response_file = self._get(relpath, None)
> - return response_file
> + # FIXME: some callers want an iterable... One step forward, three steps
> + # backwards :-/ And not only an iterable, but an iterable that can be
> + # seeked backwards, so we will never be able to do that. One such
> + # known client is bzrlib.bundle.serializer.v4.get_bundle_reader. At the
> + # time of this writing it's even the only known client -- vila20071203
> + return StringIO(response_file.read())
>
> def _get(self, relpath, ranges, tail_amount=0):
> """Get a file, or part of a file.
> @@ -230,7 +192,7 @@
> try_again = False
> try:
> code, f = self._get(relpath, ranges)
> - except errors.InvalidRange, e:
> + except errors.InvalidHttpRange, e:
> if exc_info is None:
> exc_info = sys.exc_info()
> self._degrade_range_hint(relpath, ranges, exc_info)
> @@ -249,7 +211,8 @@
> # specified.
> _bytes_to_read_before_seek = 128
> # No limit on the offset number that get combined into one, we are trying
> - # to avoid downloading the whole file.
> + # to avoid downloading the whole file. But see _pycurl.py for a different
> + # use.
> _max_readv_combine = 0
> # By default Apache has a limit of ~400 ranges before replying with a 400
> # Bad Request. So we go underneath that amount to be safe.
> @@ -300,7 +263,14 @@
> if data_len != size:
> raise errors.ShortReadvError(relpath, start, size,
> actual=data_len)
> - data_map[(start, size)] = data
> + if (start, size) == cur_offset_and_size:
> + # The offset requested are sorted as the coalesced
> + # ones, not need to cache. Win !
Nitpick: the grammar seems slightly wrong in that comment. “as in the” and “no
need to” perhaps?
[...]
> - # Note that the following may raise errors.InvalidRange. It's the
> - # caller responsability to decide how to retry since it may provide
> - # different coalesced offsets.
> + # Note that the following may raise errors.InvalidHttpRange. It's
> + # the caller responsability to decide how to retry since it may
> + # provide different coalesced offsets.
Another nitpick: If you like, you can fix the spelling and grammar to say “the
caller's responsibility” while you're there :)
> === modified file 'bzrlib/transport/http/_pycurl.py'
[...]
>
> + # The mother class use 0 to minimize the requests, but since we can't
The normal term is “parent class” or “base class”. For some reason “mother
class” makes me think of motherships...
> + # exploit the results as soon as they are received (pycurl limitation) we'd
> + # better issue more requests and provide a more responsive UI do the cost
> + # of more latency costs.
> + # If you modify this think about modifying the comment in http/__init__.py
> + # too.
> + _max_readv_combine = 25
Given the numbers that John reported, I wonder if we've gotten the balance right
here? Roundtrips are expensive!
> + def _parse_headers(self, status_and_headers):
> + """Transform the headers provided by curl into an HTTPMessage"""
> + status_and_headers.seek(0)
> + # Ignore status line
> + status_and_headers.readline()
> + msg = httplib.HTTPMessage(status_and_headers)
> + return msg
Interesting that you've duplicated this helper in the tests too. I wonder if
there's a better place for this? It's short enough that I'm not too bothered by
it.
> elif e[0] == CURLE_PARTIAL_FILE:
> - # Pycurl itself has detected a short read. We do
> - # not have all the information for the
> - # ShortReadvError, but that should be enough
> + # Pycurl itself has detected a short read. We do not have all
> + # the information for the ShortReadvError, but that should be
> + # enough
> raise errors.ShortReadvError(url,
> offset='unknown', length='unknown',
> actual='unknown',
> extra='Server aborted the request')
(Ah, so that's why I saw lots of "unknowns" in an error a while ago.)
> === modified file 'bzrlib/transport/http/_urllib2_wrappers.py'
> --- bzrlib/transport/http/_urllib2_wrappers.py 2007-11-30 09:51:22 +0000
> +++ bzrlib/transport/http/_urllib2_wrappers.py 2007-12-03 13:52:36 +0000
> @@ -1,4 +1,4 @@
> -# Copyright (C) 2006 Canonical Ltd
> +# Copyright (C) 2006,2007 Canonical Ltd
> #
> # This program is free software; you can redistribute it and/or modify
> # it under the terms of the GNU General Public License as published by
> @@ -20,7 +20,7 @@
>
> For instance, we create a new HTTPConnection and HTTPSConnection that inherit
> from the original urllib2.HTTP(s)Connection objects, but also have a new base
> -which implements a custom getresponse and fake_close handlers.
> +which implements a custom getresponse and cleanup_pipe handlers.
>
> And then we implement custom HTTPHandler and HTTPSHandler classes, that use
> the custom HTTPConnection classes.
> @@ -123,6 +123,20 @@
> # below we keep the socket with the server opened.
> self.will_close = False
>
> + def finish(self):
> + """Finish reading the bdy.
Typo: “body”
> +
> + In some cases, the client may have leave some bytes to read in the
Nitpick: “leave” -> “left”
> + body. That will block the next request to succeed if we use a
> + persistent connection. If we don't use a persitent connection, well,
Typo: “persitent” -> “persistent”
> === modified file 'bzrlib/transport/http/response.py'
> +import httplib
> +
> +from bzrlib import (
> + errors,
> + trace,
> + )
> +
> +
> +# A RangeFile should respect the following grammar (simplified to outline the
> +# assumptions we rely upon).
A RangeFile reads the response to an HTTP request AIUI, so “respect” seems like
an odd verb here, as I'd associate that more with generating conformant output.
Do you mean “A RangeFile expects the following grammar ...”?
> +
> +# file: whole_file
> +# | single_range
> +# | multiple_range
> +
> +# whole_file: [content_length_header] data
> +
> +# single_range: content_range_header data
> +
> +# multiple_range: boundary_header boundary (content_range_header data boundary)+
>
> class RangeFile(object):
> """File-like object that allow access to partial available data.
>
> - Specified by a set of ranges.
> + All accesses should happen sequentially since the acquisition occurs during
> + an http response reception (as sockets can't be seeked, we simulate the
> + seek by just reading and discarding the data).
> +
> + The access pattern is defined by a set of ranges discovered as reading
> + progress. Only one range is available at a given time, so all accesses
> + should happen with monotonically increasing offsets.
> """
>
> - def __init__(self, path, input_file):
> + def __init__(self, path, infile):
> + """Constructor.
> +
> + :param path: File url, for error reports.
> + :param infile: File-like socket set at body start.
> + """
> self._path = path
> - self._pos = 0
> - self._len = 0
> - self._ranges = []
> - self._data = input_file.read()
> -
> - def _add_range(self, ent_start, ent_end, data_start):
> - """Add an entity range.
> -
> - :param ent_start: Start offset of entity
> - :param ent_end: End offset of entity (inclusive)
> - :param data_start: Start offset of data in data stream.
> - """
> - self._ranges.append(ResponseRange(ent_start, ent_end, data_start))
> - self._len = max(self._len, ent_end)
> -
> - def _finish_ranges(self):
> - self._ranges.sort()
> -
> - def read(self, size):
> + self._file = infile
> + self._boundary = None
> + # Default to the whole file of unspecified size
> + self.set_range(0, -1)
> +
> + def set_range(self, start, size):
> + """Change the range mapping"""
> + self._start = start
> + self._size = size
> + # Set the new _pos since that's what we want to expose
> + self._pos = self._start
> +
> + def set_boundary(self, boundary):
> + """Define the boundary used in a multi parts message.
> +
> + The file should be at the beggining of the body, the first range
Typo: “beginning”
> + definition is read and taken into account.
> + """
> + self._boundary = boundary
> + # Decode the headers and setup the first range
> + self.read_boundary()
> + self.read_range_definition()
> +
> + def read_boundary(self):
> + """Read the boundary headers defining a new range"""
> + boundary_line = '\r\n'
> + while boundary_line == '\r\n':
> + # RFC2616 19.2 Additional CRLFs msy preceed the first boundary
Typo: “may”
> + # string entity.
> + # To be on the safe side we allow it before any boundary line
> + boundary_line = self._file.readline()
> + if boundary_line != '--' + self._boundary + '\r\n':
> + raise errors.InvalidHttpResponse(
> + self._path,
> + "Expected a boundary (%s) line, got '%s'" % (self._boundary,
> + boundary_line))
> +
> + def read_range_definition(self):
> + """Read a new range definition in a multi parts message.
> +
> + Parse the headers including the empty line following them so that we
> + are ready to read the data itself.
> + """
> + msg = httplib.HTTPMessage(self._file, seekable=0)
> + # Extract the range definition
> + content_range = msg.getheader('content-range', None)
> + if content_range is None:
> + raise errors.InvalidHttpResponse(
> + self._path,
> + 'Content-Range header missing in a multi-part response')
> + self.set_range_from_header(content_range)
> +
> + def set_range_from_header(self, content_range):
> + """Create a new range from its description in the headers"""
> + try:
> + rtype, values = content_range.split()
> + except:
> + raise errors.InvalidHttpRange(self._path, content_range,
> + "Malformed Content-Range header '%s'"
> + % content_range)
> + if rtype != 'bytes':
PEP-8: Excess whitespace before “!=”.
[...]
> + if self._size > 0:
> + # Don't read past the range definition
> + limited = self._start + self._size - self._pos
> + if size > 0:
> + limited = min(limited, size)
> + data = self._file.read(limited)
> + else:
> + data = self._file.read(size)
If size is <= 0, why would calling self._file.read(size) be sensible? That
looks odd to me.
-Andrew.
More information about the bazaar
mailing list