[MERGE][bug #173010] initial branch over http: can show no ui activity for a very long time
Vincent Ladeuil
v.ladeuil+lp at free.fr
Tue Dec 4 17:26:43 GMT 2007
>>>>> "john" == John Arbash Meinel <john at arbash-meinel.com> writes:
john> John Arbash Meinel has voted comment.
john> Status is now: Semi-approved
john> Comment:
john> To start with, this definitely helps the interactivity of fetching over
john> http. And I would really like to see this in 1.0. Though the
john> final call is
john> for Martin to decide.
john> I'm getting a bit tired now, so I'll have to defer the rest of
john> the review
john> until later.
Ok.
john> This is something I would like to see in 1.0, but I
john> really think it is the sort of thing that needs to cook
john> for a bit in bzr.dev before being released. readv() is
john> pretty much how we handle all of our data over
john> http.
Yes, I had mixed feelings about that at first, but get more
confident. There may be some holes in the test suite, but it was
a real pleasure to do re-design such a crucial part with such a
safety net.
john> And we need to know that we will work correctly against
john> all of the http servers out there.
I'm pretty confident there too, I didn't touch any of the
assumptions made about the http protocol (regarding headers
needed and used). And since we know use HTTPMessage instead of
parsing (some of) them ourselves, we may be even more correct.
But yes, more real life testing will be good (I'd love to setup a
bunch of different http servers above bzr repositories and
automate tests against them, but I miss time for that).
john> Including some that are a bit broken (I'm looking at
john> *you* Cherokee v?, they did get better at least).
john> If Martin says that bzr.dev is open for future work,
john> and he will cherry-pick stuff into 1.0, then this is
john> probably tweak. I need to fine-tooth it to make sure
john> I've gotten everything I can see. But in general you've
john> done the testing, and the code is good, etc.
Thanks.
john> Maybe when I wake up tomorrow, I'll recognize the sheer
john> genious of the change, and just give you a straight "go
john> merge it request"...
Waiting for while that taking your remarks into account.
john> Some numbers, doing a full branch of:
john> http://people.ubuntu.com/~jameinel/test/bzr.dev/
john> (a pack repository with my local repo copied up. So it has 15 pack
john> files of various sizes, and lots of plugins mixed with the bzr code.
john> 100ms ping from here)
john> bzr.dev + urllib:
john> 608s (10m8s) peak memory 119.6MB
john> vila + urllib
john> 580s (9m40s) peak memory 98.7MB
john> vila + pycurl
john> 862s (14m22s) peak memory 100.7MB
I thought increasing number of requests will increase wall-clock
time but not *that* hardly... May this need to be run several times.
john> So the new form slows down pycurl a little bit (but
john> makes it *much* more interactive). (The progress bar
john> hangs for a little bit, then updates, then hangs, but
john> it used to *just* hang for a long time.)
Yeah, that was the point.
john> The HTTPTransport._readv() function seems to be getting
john> pretty large. Is it possible to factor more of it out
john> into helper functions?
john> You use 'file' as a variable. While valid, I prefer to
john> avoid using variable names that shadow
john> builtins. (partially because ViM will highlight them
john> funny, but also it is usually bad form to shadow a
john> builtin.)
Bad vila, no sugar.
john> This doesn't seem right:
john> self.send_header("Content-Type",
john> "multipart/byteranges; boundary=%s" %
john> boundary)
john> self.end_headers()
john> + self.wfile.write("--%s\r\n" % boundary)
john> for (start, end) in ranges:
john> - self.wfile.write("--%s\r\n" % boundary)
john> self.send_header("Content-type",
john> application/octet-stream')
john> self.send_header("Content-Range", "bytes %d-%d/%d" %
john> (start,
john> end,
john> ^- Isn't each range delimited by a boundary? (I haven't checked in a
john> while, though.)
The patch is tricking you the code is:
self.wfile.write("--%s\r\n" % boundary)
for (start, end) in ranges:
self.send_header("Content-type", 'application/octet-stream')
self.send_header("Content-Range", "bytes %d-%d/%d" % (start,
end,
file_size))
self.end_headers()
self.send_range_content(file, start, end - start + 1)
self.wfile.write("--%s\r\n" % boundary)
So *one* boundary line by range. The server was bogus and put two.
Now I wondered why we didn't catch it earlier but it was due to
regexp used which, when presented with
<bline><bline><headers><content> just ignored the first bline.
The spec is pretty clear about that, I doubt anyone had made the
same mistake in any server.
john> For pycurl, what actually makes more sense is to make
john> the max number of ranges smaller, but the max combine
john> larger.
This one is simple.
john> It will decrease the Header overhead, while leaving the
john> same number of data bytes on the wire. (The real fix is
john> to also switch to a max number of bytes per request
john> sort of thing.)
This one is less, but that's the right one. If I don't find the
time to address it I'll file another bug.
john> + # Seeking to a point between two ranges is possible
john> (only once) but
john> + # reading there is forbidden
john> + f.seek(40)
john> ^- I'm not sure why you need it to be valid to seek to
john> a bad spot between ranges,
It doesn't need to be valid but it doesn't need to be forbidden
either, grey area.
john> and why it would fail a second time....
Because once you cross a range boundary, you're automagically at
the beginning of the next.
john> I would tend not to make this part of the contract (and
john> test) unless there is a reason.
Well, I wanted the tests to precisely reflect that point so it
get caught if it change.
john> Why did you remove this test:
john> @@ -604,31 +385,20 @@
john> # It is a StringIO from the original data
john> self.assertEqual(_full_text_response[2], out.read())
john> - def test_missing_response(self):
john> - self.assertRaises(errors.NoSuchFile,
john> - self.get_response, _missing_response)
john> -
Because handle_response do not catch errors anymore, clients do
that now.
john> def test_single_range(self):
john> out = self.get_response(_single_range_response)
john> - self.assertIsInstance(out, response.HttpRangeResponse)
john> -
john> - self.assertRaises(errors.InvalidRange, out.read, 20)
HttpRangeResponse required to seek to the right place before
reading, that's not true for RangeFile, when there is a single
range, you can read it right away.
john> out.seek(100)
john> self.assertEqual(_single_range_response[2], out.read(100))
john> And why is test_single_range not testing the exception
john> (does it leave the file dirty?)
Quite the opposite.
john> You remove a few more of these InvalidRange checks.
I carefully check that every case tested in the old
implementation was covered in the new, but don't take my word on
it, you're the reviewer ;-)
john> ...
john> - def test_missing_no_content_type(self):
john> - # Without Content-Type we should still raise NoSuchFile
john> on a 404
john> - a_response = _missing_response
john> - headers = http._extract_headers(a_response[1],
john> http://missing')
john> - del headers['Content-Type']
john> - self.assertRaises(errors.NoSuchFile,
john> - response.handle_response, 'http://missing',
john> a_response[0], headers,
john> - StringIO(a_response[2]))
john> ^- This test was added, because the response code
john> originally strictly checked Content-Type and would fail
john> with something like a KeyError if it wasn't present.
Two things were mixed up: content-type handling and error codes,
errors are now handled by callers, handle_response should only
receive 200 and 206 and errors out on others. So I deleted the
test. But there is test_full_text_no_content_type for the
content-type checking.
john> As you mentioned in your emails, you should use 'izip' rather
john> than 'imap'
john> - for content, f in zip(contents, content_f):
john> + # Use itertools.imap() instead of use zip() or map(),
john> since they fully
john> + # evaluate their inputs, the transport requests should
john> be issued and
john> + # handled sequentially (we don't want to force transport
john> to buffer).
john> + for content, f in itertools.imap(None, contents, content_f):
Done.
john> ....
john> + try:
john> + # We don't need total, but note that it may be
john> either the file size
john> + # or '*' if the server can't or doesn't want to
john> return the file
john> + # size.
john> + start_end, total = values.split('/')
john> + start, end = start_end.split('-')
john> + start = int(start)
john> + end = int(end)
john> + except:
john> + raise errors.InvalidHttpRange(self._path, content_range,
john> + "Invalid range values
john> %s'" % values)
john> ^- I really don't like bare excepts like this (as they even catch
john> KeyboardInterrupt). I think you only need to catch "ValueError".
Indeed, fixed in the two relevant places.
Vincent
More information about the bazaar
mailing list