[MERGE][bug #173010] initial branch over http: can show no ui activity for a very long time

Vincent Ladeuil v.ladeuil+lp at free.fr
Tue Dec 4 17:26:43 GMT 2007


>>>>> "john" == John Arbash Meinel <john at arbash-meinel.com> writes:

    john> John Arbash Meinel has voted comment.
    john> Status is now: Semi-approved
    john> Comment:
    john> To start with, this definitely helps the interactivity of fetching over
    john> http. And I would really like to see this in 1.0. Though the
    john> final call is
    john> for Martin to decide.

    john> I'm getting a bit tired now, so I'll have to defer the rest of
    john> the review
    john> until later.

Ok.

    john> This is something I would like to see in 1.0, but I
    john> really think it is the sort of thing that needs to cook
    john> for a bit in bzr.dev before being released. readv() is
    john> pretty much how we handle all of our data over
    john> http.

Yes, I had mixed feelings about that at first, but get more
confident. There may be some holes in the test suite, but it was
a real pleasure to do re-design such a crucial part with such a
safety net.

    john> And we need to know that we will work correctly against
    john> all of the http servers out there.

I'm pretty confident there too, I didn't touch any of the
assumptions made about the http protocol (regarding headers
needed and used). And since we know use HTTPMessage instead of
parsing (some of) them ourselves, we may be even more correct.

But yes, more real life testing will be good (I'd love to setup a
bunch of different http servers above bzr repositories and
automate tests against them, but I miss time for that).

    john> Including some that are a bit broken (I'm looking at
    john> *you* Cherokee v?, they did get better at least).

    john> If Martin says that bzr.dev is open for future work,
    john> and he will cherry-pick stuff into 1.0, then this is
    john> probably tweak. I need to fine-tooth it to make sure
    john> I've gotten everything I can see. But in general you've
    john> done the testing, and the code is good, etc.

Thanks.

    john> Maybe when I wake up tomorrow, I'll recognize the sheer
    john> genious of the change, and just give you a straight "go
    john> merge it request"...

Waiting for while that taking your remarks into account.

    john> Some numbers, doing a full branch of:
    john>   http://people.ubuntu.com/~jameinel/test/bzr.dev/
    john> (a pack repository with my local repo copied up. So it has 15 pack
    john> files of various sizes, and lots of plugins mixed with the bzr code.
    john> 100ms ping from here)

    john> bzr.dev + urllib:
    john>     608s (10m8s) peak memory 119.6MB
    john> vila + urllib
    john>     580s (9m40s) peak memory  98.7MB
    john> vila + pycurl
    john>     862s (14m22s) peak memory 100.7MB

I thought increasing number of requests will increase wall-clock
time but not *that* hardly... May this need to be run several times.

    john> So the new form slows down pycurl a little bit (but
    john> makes it *much* more interactive). (The progress bar
    john> hangs for a little bit, then updates, then hangs, but
    john> it used to *just* hang for a long time.)

Yeah, that was the point.

    john> The HTTPTransport._readv() function seems to be getting
    john> pretty large. Is it possible to factor more of it out
    john> into helper functions?

    john> You use 'file' as a variable. While valid, I prefer to
    john> avoid using variable names that shadow
    john> builtins. (partially because ViM will highlight them
    john> funny, but also it is usually bad form to shadow a
    john> builtin.)

Bad vila, no sugar.

    john> This doesn't seem right:
    john>          self.send_header("Content-Type",
    john>                           "multipart/byteranges; boundary=%s" %
    john> boundary)
    john>          self.end_headers()
    john> +        self.wfile.write("--%s\r\n" % boundary)
    john>          for (start, end) in ranges:
    john> -            self.wfile.write("--%s\r\n" % boundary)
    john>              self.send_header("Content-type",
    john> application/octet-stream')
    john>              self.send_header("Content-Range", "bytes %d-%d/%d" %
    john> (start,
    john>                                                                    end,

    john> ^- Isn't each range delimited by a boundary? (I haven't checked in a
    john> while, though.)

The patch is tricking you the code is:

        self.wfile.write("--%s\r\n" % boundary)
        for (start, end) in ranges:
            self.send_header("Content-type", 'application/octet-stream')
            self.send_header("Content-Range", "bytes %d-%d/%d" % (start,
                                                                  end,
                                                                  file_size))
            self.end_headers()
            self.send_range_content(file, start, end - start + 1)
            self.wfile.write("--%s\r\n" % boundary)

So *one* boundary line by range. The server was bogus and put two.

Now I wondered why we didn't catch it earlier but it was due to
regexp used which, when presented with
<bline><bline><headers><content> just ignored the first bline.

The spec is pretty clear about that, I doubt anyone had made the
same mistake in any server.

    john> For pycurl, what actually makes more sense is to make
    john> the max number of ranges smaller, but the max combine
    john> larger.

This one is simple.

    john> It will decrease the Header overhead, while leaving the
    john> same number of data bytes on the wire. (The real fix is
    john> to also switch to a max number of bytes per request
    john> sort of thing.)

This one is less, but that's the right one. If I don't find the
time to address it I'll file another bug.

    john> +        # Seeking to a point between two ranges is possible
    john> (only once) but
    john> +        # reading there is forbidden
    john> +        f.seek(40)

    john> ^- I'm not sure why you need it to be valid to seek to
    john> a bad spot between ranges,

It doesn't need to be valid but it doesn't need to be forbidden
either, grey area.

    john> and why it would fail a second time....

Because once you cross a range boundary, you're automagically at
the beginning of the next.

    john> I would tend not to make this part of the contract (and
    john> test) unless there is a reason.

Well, I wanted the tests to precisely reflect that point so it
get caught if it change.

    john> Why did you remove this test:
    john> @@ -604,31 +385,20 @@
    john>          # It is a StringIO from the original data
    john>          self.assertEqual(_full_text_response[2], out.read())

    john> -    def test_missing_response(self):
    john> -        self.assertRaises(errors.NoSuchFile,
    john> -            self.get_response, _missing_response)
    john> -


Because handle_response do not catch errors anymore, clients do
that now.

    john>      def test_single_range(self):
    john>          out = self.get_response(_single_range_response)
    john> -        self.assertIsInstance(out, response.HttpRangeResponse)
    john> -
    john> -        self.assertRaises(errors.InvalidRange, out.read, 20)

HttpRangeResponse required to seek to the right place before
reading, that's not true for RangeFile, when there is a single
range, you can read it right away.

    john>          out.seek(100)
    john>          self.assertEqual(_single_range_response[2], out.read(100))


    john> And why is test_single_range not testing the exception
    john> (does it leave the file dirty?)

Quite the opposite.

    john> You remove a few more of these InvalidRange checks.

I carefully check that every case tested in the old
implementation was covered in the new, but don't take my word on
it, you're the reviewer ;-)

    john> ...

    john> -    def test_missing_no_content_type(self):
    john> -        # Without Content-Type we should still raise NoSuchFile
    john> on a 404
    john> -        a_response = _missing_response
    john> -        headers = http._extract_headers(a_response[1],
    john> http://missing')
    john> -        del headers['Content-Type']
    john> -        self.assertRaises(errors.NoSuchFile,
    john> -            response.handle_response, 'http://missing',
    john> a_response[0], headers,
    john> -                                      StringIO(a_response[2]))

    john> ^- This test was added, because the response code
    john> originally strictly checked Content-Type and would fail
    john> with something like a KeyError if it wasn't present.

Two things were mixed up: content-type handling and error codes,
errors are now handled by callers, handle_response should only
receive 200 and 206 and errors out on others. So I deleted the
test. But there is test_full_text_no_content_type for the
content-type checking.

    john> As you mentioned in your emails, you should use 'izip' rather
    john> than 'imap'
    john> -        for content, f in zip(contents, content_f):
    john> +        # Use itertools.imap() instead of use zip() or map(),
    john> since they fully
    john> +        # evaluate their inputs, the transport requests should
    john> be issued and
    john> +        # handled sequentially (we don't want to force transport
    john> to buffer).
    john> +        for content, f in itertools.imap(None, contents, content_f):

Done.

    john> ....

    john> +        try:
    john> +            # We don't need total, but note that it may be
    john> either the file size
    john> +            # or '*' if the server can't or doesn't want to
    john> return the file
    john> +            # size.
    john> +            start_end, total = values.split('/')
    john> +            start, end = start_end.split('-')
    john> +            start = int(start)
    john> +            end = int(end)
    john> +        except:
    john> +            raise errors.InvalidHttpRange(self._path, content_range,
    john> +                                          "Invalid range values
    john> %s'" % values)

    john> ^- I really don't like bare excepts like this (as they even catch
    john> KeyboardInterrupt). I think you only need to catch "ValueError".

Indeed, fixed in the two relevant places.

        Vincent



More information about the bazaar mailing list