[merge] http multirange support

Tue Jul 18 09:16:08 BST 2006

On 13 Jul 2006, John Arbash Meinel <john at arbash-meinel.com> wrote:

> The attached diff is an implementation of multi range support for http
> requests. It enables them on both http and urllib. And then goes one
> step further to re-use the curl connections to enable keep-alive support.
> 
> Thanks go to Michael Ellerman for writing the multi-range support, and
> Johan Rydberg for some of the test suite. I finished up the test suite,
> and have been abuse testing it for a little while.
> 
> The branch is available from here:
> http://bzr.arbash-meinel.com/branches/bzr/http/

Well, that's a very nice performance improvement to see.  On the other
hand failing to follow redirects is a substantial regression,
particularly as people may still be trying to pull from bazaar-ng.org,
which currently does a redirect.  (I suppose we could, and possibly
should, change it so that /bzr just passes through, since they're still
the same machine.)  

I presume there must be some reason why you had to turn off curl
FOLLOWLOCATION but what was it?  If that won't work, perhaps we can
raise a special exception class on the redirect and redo it in the
Transport?

Anyhow I've read the patch and it looks reasonable, with a few comments.
It looks like it would put us in a good position to do a fast pure
python transport.  I'll check it out on the network and against thttpd.
I do have some concerns about putting it in 0.9 with redirects broken.

> +def handle_response(url, code, headers, data):
> +    """Interpret the code & headers and return a HTTP response.
> +
> +    This is a factory method which returns an appropriate HTTP response
> +    based on the code & headers it's given.
> +
> +    :param url: The url being processed. Mostly for error reporting
> +    :param code: The integer HTTP response code
> +    :param headers: A dict-like object that contains the HTTP response headers
> +    :param data: A file-like object that can be read() to get the
> +                 requested data
> +    :return: A file-like object that can seek()+read() the 
> +             ranges indicated by the headers.
> +    """
> +
> +    if code == 206:
> +        try:
> +            content_type = headers['Content-Type']
> +        except KeyError:
> +            raise errors.InvalidHttpContentType(url, '',
> +                msg = 'Missing Content-Type')

There shouldn't be any space around the equals sign in keyword
arguments.

> +
> +        if _is_multipart(content_type):
> +            # Full fledged multipart response
> +            return HttpMultipartRangeResponse(url, content_type, data)
> +        else:
> +            # A response to a range request, but not multipart
> +            try:
> +                content_range = headers['Content-Range']
> +            except KeyError:
> +                raise errors.InvalidHttpResponse(url,
> +                    'Missing the Content-Range header in a 206 range response')
> +            return HttpRangeResponse(url, content_range, data)
> +    elif code == 200:
> +        # A regular non-range response, unfortunately the result from
> +        # urllib doesn't support seek, so we wrap it in a StringIO
> +        tell = getattr(data, 'tell', None)
> +        if tell is None:
> +            return StringIO(data.read())
> +        return data

Why not just do 'if not hasattr'...?

> +    elif code == 404:
> +        raise errors.NoSuchFile(url)
> +
> +    # TODO: jam 20060713 Properly handle redirects (302 Found, etc)
> +    #       The '_get' code says to follow redirects, we probably 
> +    #       should actually handle the return values
> +
> +    raise errors.InvalidHttpResponse(url, "Unknown response code %s" % (code,))
> +
> 

Just for obviousness I would suggest to put the 'raise' in an else block
immediately after the last elif...

>      def has(self, relpath):
> -        curl = pycurl.Curl()
> +        """See Transport.has()"""
> +        # We set NO BODY=0 in _get_full, so it should be safe
> +        # to re-use the non-range curl object
> +        curl = self._base_curl
>          abspath = self._real_abspath(relpath)
>          curl.setopt(pycurl.URL, abspath)
> -        curl.setopt(pycurl.FOLLOWLOCATION, 1) # follow redirect responses
>          self._set_curl_options(curl)
>          # don't want the body - ie just do a HEAD request
> +        # This means "NO BODY" not 'nobody'
>          curl.setopt(pycurl.NOBODY, 1)
>          self._curl_perform(curl)
>          code = curl.getinfo(pycurl.HTTP_CODE)
> @@ -92,38 +104,80 @@
>              return False
>          elif code in (200, 302): # "ok", "found"
>              return True
> -        elif code == 0:
> -            self._raise_curl_connection_error(curl)
>          else:
>              self._raise_curl_http_error(curl)

I see you now call _raise_curl_connection_error from other places - is
it really no longer required from here?  There are some other places
too.

> +    def _get_full(self, relpath):
> +        """Make a request for the entire file"""
> +        curl = self._base_curl
> +        abspath, data, header = self._setup_get_request(curl, relpath)
>          self._curl_perform(curl)
> +
>          code = curl.getinfo(pycurl.HTTP_CODE)
> +        data.seek(0)
> +
>          if code == 404:
>              raise NoSuchFile(abspath)
> -        elif code == 200:
> -            sio.seek(0)
> -            return code, sio
> -        elif code == 206 and (ranges is not None):
> -            sio.seek(0)
> -            return code, sio
> -        elif code == 0:
> -            self._raise_curl_connection_error(curl)
> -        else:
> -            self._raise_curl_http_error(curl)
> +        if code != 200:
> +            raise errors.InvalidHttpResponse(abspath,
> +                'Expected a 200 or 404 response, not: %s' % code)
> +
> +        return code, data

Raising directly from here seems like a regression from calling
_raise_curl_http_error, which centralizes it and includes the eventual
url, which is often useful.

-- 
Martin