[UGLY HACK] Proof of concept multipart/byteranges support and connection sharing

Mon May 22 14:32:37 BST 2006

Michael Ellerman wrote:
> On Sat, 2006-05-20 at 20:57 +1000, Robert Collins wrote:
>> On Fri, 2006-05-19 at 19:15 +1000, Michael Ellerman wrote:
>>> Hi guys,
>>
>>> I didn't really believe it made this much difference, but I've run these
>>> a few times, and I think I'm not going crazy. (this is just "time bzr
>>> branch foo bar").
>> Its real. Latency bites badly.
> 
> Yeah, I never doubted it, but it's nice to have numbers to show just how
> bad it is.
> 
>>> [2] Unfortunately we're creating about 15 PyCurlTransport() objects, so
>>> to see much improvement we have to share the Curl() object globally.
>>> Yuck. Also it seems (??) you can't unset pycurl.RANGE/NOBODY, so we have
>>> to have three Curl() objects, one for GET, one for HEAD and one for GET
>>> + Range.
>> I'm not sure why we have to share the curl objects specially - unless
>> you mean for the connection sharing. If thats so, I'd introduct a
>> HttpClient object or something that is shared between the
>> PyCurlTransports. It could then hold the 3 curl objects needed. I'm
>> suggesting that get_transport(http://...) would have the effect of
>> making a new one of these always.
> 
> Yeah, the Curl() stuff is purely to do connection sharing. I don't know
> how the transport code works, but from a quick glance, if we create a
> new HttpClient() for each get_transport(..) isn't that equivalent to
> having one Curl() per PyCurlTransport() ? If so, that doesn't get us as
> big a win, because we create lots of transports.

That is the way the current code is. But both ftp and sftp cache their
connections. It would seem reasonable that HTTP could do the same thing.

> 
> As another data point, doing the byterange stuff without the connection
> sharing gets me these rough numbers:
> 
> real    29m8.436s
> real    31m13.397s
> real    29m5.429s
> 
> So it's definitely helping, although the bulk of the improvement is the
> byterange stuff.

Well, what I saw was this:
current urllib 50m, pycurl 50m, byterange 30m, byterange + sharing 20m.
Is that about right?

> 
>> Just having good range support would rock. Its been on my TODO for a
>> bit. Things to watch out for in the multipart response: web servers may
>> return the full object, or may return combined ranges - IIRC the bytes
>> cant be reordered from the requested range though. You'll need to check
>> rfc2616 on that.
> 
> Yeah, there's lots of corner cases. I think I already handle the full
> result case, as long as we get code 200 back, not 206. Reordering would
> break that code in a jiffie. Before writing a proper version I'd like to
> check what twisted do, and/or any other implementations.
> 
> cheers
> 

In general, thanks for looking into this. It looks like a lot of
potential. (And we still haven't had to do anything custom on the server
side.)

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060522/4624d247/attachment.pgp