[MERGE][bug #173010] initial branch over http: can show no ui activity for a very long time

Tue Dec 4 03:13:42 GMT 2007

John Arbash Meinel has voted comment.
Status is now: Semi-approved
Comment:
To start with, this definitely helps the interactivity of fetching over
http. And I would really like to see this in 1.0. Though the final call 
is
for Martin to decide.

I'm getting a bit tired now, so I'll have to defer the rest of the 
review
until later. This is something I would like to see in 1.0, but I really
think it is the sort of thing that needs to cook for a bit in bzr.dev
before being released. readv() is pretty much how we handle all of our
data over http. And we need to know that we will work correctly against
all of the http servers out there. Including some that are a bit broken
(I'm looking at *you* Cherokee v?, they did get better at least).

If Martin says that bzr.dev is open for future work, and he will
cherry-pick stuff into 1.0, then this is probably tweak. I need to
fine-tooth it to make sure I've gotten everything I can see. But in
general you've done the testing, and the code is good, etc.

Maybe when I wake up tomorrow, I'll recognize the sheer genious of the
change, and just give you a straight "go merge it request"...




Some numbers, doing a full branch of:
   http://people.ubuntu.com/~jameinel/test/bzr.dev/
(a pack repository with my local repo copied up. So it has 15 pack
files of various sizes, and lots of plugins mixed with the bzr code.
100ms ping from here)

bzr.dev + urllib:
     608s (10m8s) peak memory 119.6MB
vila + urllib
     580s (9m40s) peak memory  98.7MB
vila + pycurl
     862s (14m22s) peak memory 100.7MB

So the new form slows down pycurl a little bit (but makes it *much* more
interactive). (The progress bar hangs for a little bit, then updates, 
then
hangs, but it used to *just* hang for a long time.)

The HTTPTransport._readv() function seems to be getting pretty large. Is
it possible to factor more of it out into helper functions?

You use 'file' as a variable. While valid, I prefer to avoid using
variable names that shadow builtins. (partially because ViM will 
highlight
them funny, but also it is usually bad form to shadow a builtin.)


This doesn't seem right:
          self.send_header("Content-Type",
                           "multipart/byteranges; boundary=%s" % 
boundary)
          self.end_headers()
+        self.wfile.write("--%s\r\n" % boundary)
          for (start, end) in ranges:
-            self.wfile.write("--%s\r\n" % boundary)
              self.send_header("Content-type", 
'application/octet-stream')
              self.send_header("Content-Range", "bytes %d-%d/%d" % 
(start,
                                                                    end,

^- Isn't each range delimited by a boundary? (I haven't checked in a
while, though.)


For pycurl, what actually makes more sense is to make the max number of
ranges smaller, but the max combine larger. It will decrease the Header
overhead, while leaving the same number of data bytes on the wire. (The
real fix is to also switch to a max number of bytes per request sort of
thing.)


...

+        # Seeking to a point between two ranges is possible (only once) 
but
+        # reading there is forbidden
+        f.seek(40)

^- I'm not sure why you need it to be valid to seek to a bad spot 
between
ranges, and why it would fail a second time....
I would tend not to make this part of the contract (and test) unless 
there
is a reason.


Why did you remove this test:
@@ -604,31 +385,20 @@
          # It is a StringIO from the original data
          self.assertEqual(_full_text_response[2], out.read())

-    def test_missing_response(self):
-        self.assertRaises(errors.NoSuchFile,
-            self.get_response, _missing_response)
-
      def test_single_range(self):
          out = self.get_response(_single_range_response)
-        self.assertIsInstance(out, response.HttpRangeResponse)
-
-        self.assertRaises(errors.InvalidRange, out.read, 20)

          out.seek(100)
          self.assertEqual(_single_range_response[2], out.read(100))


And why is test_single_range not testing the exception (does it leave 
the
file dirty?)

You remove a few more of these InvalidRange checks.

...

-    def test_missing_no_content_type(self):
-        # Without Content-Type we should still raise NoSuchFile on a 
404
-        a_response = _missing_response
-        headers = http._extract_headers(a_response[1], 
'http://missing')
-        del headers['Content-Type']
-        self.assertRaises(errors.NoSuchFile,
-            response.handle_response, 'http://missing', a_response[0], 
headers,
-                                      StringIO(a_response[2]))

^- This test was added, because the response code originally strictly
checked Content-Type and would fail with something like a KeyError if it
wasn't present.


...

As you mentioned in your emails, you should use 'izip' rather than 
'imap'
-        for content, f in zip(contents, content_f):
+        # Use itertools.imap() instead of use zip() or map(), since 
they fully
+        # evaluate their inputs, the transport requests should be 
issued and
+        # handled sequentially (we don't want to force transport to 
buffer).
+        for content, f in itertools.imap(None, contents, content_f):


....

+        try:
+            # We don't need total, but note that it may be either the 
file size
+            # or '*' if the server can't or doesn't want to return the 
file
+            # size.
+            start_end, total = values.split('/')
+            start, end = start_end.split('-')
+            start = int(start)
+            end = int(end)
+        except:
+            raise errors.InvalidHttpRange(self._path, content_range,
+                                          "Invalid range values '%s'" % 
values)

^- I really don't like bare excepts like this (as they even catch
KeyboardInterrupt). I think you only need to catch "ValueError".




For details, see: 
http://bundlebuggy.aaronbentley.com/request/%3Cm2zlwrmrn5.fsf%40free.fr%3E