[merge] http multirange support
Martin Pool
mbp at canonical.com
Tue Jul 18 09:16:08 BST 2006
On 13 Jul 2006, John Arbash Meinel <john at arbash-meinel.com> wrote:
> The attached diff is an implementation of multi range support for http
> requests. It enables them on both http and urllib. And then goes one
> step further to re-use the curl connections to enable keep-alive support.
>
> Thanks go to Michael Ellerman for writing the multi-range support, and
> Johan Rydberg for some of the test suite. I finished up the test suite,
> and have been abuse testing it for a little while.
>
> The branch is available from here:
> http://bzr.arbash-meinel.com/branches/bzr/http/
Well, that's a very nice performance improvement to see. On the other
hand failing to follow redirects is a substantial regression,
particularly as people may still be trying to pull from bazaar-ng.org,
which currently does a redirect. (I suppose we could, and possibly
should, change it so that /bzr just passes through, since they're still
the same machine.)
I presume there must be some reason why you had to turn off curl
FOLLOWLOCATION but what was it? If that won't work, perhaps we can
raise a special exception class on the redirect and redo it in the
Transport?
Anyhow I've read the patch and it looks reasonable, with a few comments.
It looks like it would put us in a good position to do a fast pure
python transport. I'll check it out on the network and against thttpd.
I do have some concerns about putting it in 0.9 with redirects broken.
> +def handle_response(url, code, headers, data):
> + """Interpret the code & headers and return a HTTP response.
> +
> + This is a factory method which returns an appropriate HTTP response
> + based on the code & headers it's given.
> +
> + :param url: The url being processed. Mostly for error reporting
> + :param code: The integer HTTP response code
> + :param headers: A dict-like object that contains the HTTP response headers
> + :param data: A file-like object that can be read() to get the
> + requested data
> + :return: A file-like object that can seek()+read() the
> + ranges indicated by the headers.
> + """
> +
> + if code == 206:
> + try:
> + content_type = headers['Content-Type']
> + except KeyError:
> + raise errors.InvalidHttpContentType(url, '',
> + msg = 'Missing Content-Type')
There shouldn't be any space around the equals sign in keyword
arguments.
> +
> + if _is_multipart(content_type):
> + # Full fledged multipart response
> + return HttpMultipartRangeResponse(url, content_type, data)
> + else:
> + # A response to a range request, but not multipart
> + try:
> + content_range = headers['Content-Range']
> + except KeyError:
> + raise errors.InvalidHttpResponse(url,
> + 'Missing the Content-Range header in a 206 range response')
> + return HttpRangeResponse(url, content_range, data)
> + elif code == 200:
> + # A regular non-range response, unfortunately the result from
> + # urllib doesn't support seek, so we wrap it in a StringIO
> + tell = getattr(data, 'tell', None)
> + if tell is None:
> + return StringIO(data.read())
> + return data
Why not just do 'if not hasattr'...?
> + elif code == 404:
> + raise errors.NoSuchFile(url)
> +
> + # TODO: jam 20060713 Properly handle redirects (302 Found, etc)
> + # The '_get' code says to follow redirects, we probably
> + # should actually handle the return values
> +
> + raise errors.InvalidHttpResponse(url, "Unknown response code %s" % (code,))
> +
>
Just for obviousness I would suggest to put the 'raise' in an else block
immediately after the last elif...
> def has(self, relpath):
> - curl = pycurl.Curl()
> + """See Transport.has()"""
> + # We set NO BODY=0 in _get_full, so it should be safe
> + # to re-use the non-range curl object
> + curl = self._base_curl
> abspath = self._real_abspath(relpath)
> curl.setopt(pycurl.URL, abspath)
> - curl.setopt(pycurl.FOLLOWLOCATION, 1) # follow redirect responses
> self._set_curl_options(curl)
> # don't want the body - ie just do a HEAD request
> + # This means "NO BODY" not 'nobody'
> curl.setopt(pycurl.NOBODY, 1)
> self._curl_perform(curl)
> code = curl.getinfo(pycurl.HTTP_CODE)
> @@ -92,38 +104,80 @@
> return False
> elif code in (200, 302): # "ok", "found"
> return True
> - elif code == 0:
> - self._raise_curl_connection_error(curl)
> else:
> self._raise_curl_http_error(curl)
I see you now call _raise_curl_connection_error from other places - is
it really no longer required from here? There are some other places
too.
> + def _get_full(self, relpath):
> + """Make a request for the entire file"""
> + curl = self._base_curl
> + abspath, data, header = self._setup_get_request(curl, relpath)
> self._curl_perform(curl)
> +
> code = curl.getinfo(pycurl.HTTP_CODE)
> + data.seek(0)
> +
> if code == 404:
> raise NoSuchFile(abspath)
> - elif code == 200:
> - sio.seek(0)
> - return code, sio
> - elif code == 206 and (ranges is not None):
> - sio.seek(0)
> - return code, sio
> - elif code == 0:
> - self._raise_curl_connection_error(curl)
> - else:
> - self._raise_curl_http_error(curl)
> + if code != 200:
> + raise errors.InvalidHttpResponse(abspath,
> + 'Expected a 200 or 404 response, not: %s' % code)
> +
> + return code, data
Raising directly from here seems like a regression from calling
_raise_curl_http_error, which centralizes it and includes the eventual
url, which is often useful.
--
Martin
More information about the bazaar
mailing list