[MERGE] http+pycurl supports redirects

Wed Jul 19 03:02:19 BST 2006

On 7/19/06, John Arbash Meinel <john at arbash-meinel.com> wrote:
> The reason we failed to support pycurl redirects is because pycurl would
> put both the original redirect headers, and the final content request
> headers into the 'header file'.
>
> The attached patch just updates the header parser, so that it checks to
> see if there is anything after the initial headers. If there is, it
> extracts that as the real header and returns it.
>
> The attached patch contains the actual response headers from trying to
> download from bazaar-ng.org. And tests that we parse it correctly.
>
> I also verified that we can do a plain download from bazaar-ng.org.
>
> I think I'd like to file a bug that we don't *know* when we are getting
> a redirect, and alert the user that they really should update their url.
> (downloading over a redirect will cause an extra round-trip for each
> request).
>
> Either that, or we could allow some way for transports to change their
> base URL on a redirect. But it kind of complicates things, because we
> don't know about the redirect until the first request.
>
> Also, when it was proposed a while ago, it was discussed that someone
> might just redirect the branch location, but not the repository
> location. Anyway, I think we probably want a warning if we get a
> redirect. But right now we don't have a good way to extract that
> information from pycurl or urllib, because both of them auto-follow the
> redirect for us.
>
> So with this patch, I think I've satisfied all of the regressions about
> the new http support. I'll wait for reviews, but rather than waiting
> another 2 weeks, I'll merge it tomorrow if nobody complains.

Thanks for fixing it up, I haven't had the time lately.

=== modified file 'bzrlib/transport/http/__init__.py'

--- bzrlib/transport/http/__init__.py	2006-07-14 19:25:24 +0000
+++ bzrlib/transport/http/__init__.py	2006-07-18 20:57:53 +0000
@@ -75,7 +74,7 @@
     return url


-def _extract_headers(header_file, skip_first=True):
+def _extract_headers(header_file, skip_first=True, body_is_header=True):
     """Extract the mapping for an rfc822 header

     This is a helper function for the test suite, and for _pycurl.
@@ -84,12 +83,24 @@
     :param header_file: A file-like object to read
     :param skip_first: HTTP headers start with the HTTP response as
                        the first line. Skip this line while parsing
+    :param body_is_header: When pycurl gets a redirect request, it saves
+            both the redirect headers and the final headers in the header
+            file. Which means we really want the latter headers, not the
+            former.
     :return: mimetools.Message object
     """
     header_file.seek(0, 0)
     if skip_first:
         header_file.readline()
     m = mimetools.Message(header_file)
+    if body_is_header:
+        m.rewindbody()
+        remaining = header_file.read()
+        # Ignore some extra whitespace, but if we have acutal content
+        # lines, then the later content superceeds the eariler.
+        if remaining.strip() != '':
+            return _extract_headers(StringIO(remaining), skip_first=skip_first,
+                                    body_is_header=True)
     return m


I don't understand what "body_is_header" is for, we only ever want the
headers associated with the final response don't we?

I also wonder if it should be a little stricter, and look for
"\r\n\r\nHTTP" as the sign that a new set of headers is starting? I
also think it should be a loop so we can handle a double/triple
redirect.

+    def test_redirect_body_is_body(self):
+        """Check that we parse the right portion if body_is_header is True"""
+        self.use_response(_redirect_response, body_is_header=False)
+        self.assertRaises(KeyError, self.headers.__getitem__, 'Content-Range')
+        self.check_header('Content-Type', 'text/html; charset=iso-8859-1')
+        self.check_header('Location',
+            'http://bazaar-vcs.org/bzr/bzr.dev/.bzr/repository/inventory.knit')

The docstring mentions body_is_header=True, but then you pass False ?

cheers