[RFC] Multipart support for _urllib_

John Arbash Meinel john at arbash-meinel.com
Sat Jun 17 17:26:09 BST 2006


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:
> Michael Ellerman wrote:

> Anyway you can use 'setopt(HEADERFUNCTION, sio.write)' which will let
> you read in the header values.
> On top of that, you can then use the mimetools package to parse
> everything into a dictionary like object.
> This makes it quite nice.
> 
> I haven't figured out exactly how to interface it with your changes, but
> my http branch now supports getting any headers out of curl. (It
> probably does too much right now, but I'll live with it for the moment).
> 
> http://bzr.arbash-meinel.com/branches/bzr/http/


So I worked out the rest of the details, and after refactoring the
_urllib stuff so that it was moved into the base class, and working on
the pycurl stuff so that it uses the new readv and _handle_response
functions.

Anyway, lets recap my previous timing tests:

urllib get:	18m11s
wget -r -np:	10m54s
pycurl get:	18m38s
modified pycurl:11m17s
Ellerman's urllib:10m14s ***


Now my max download speed is 160KB/s, so theoretically I can download
all 30MB of bzr.dev in 3 minutes. So we still have some room for
improvement. But after my combined changes we now have:

pycurl w/ multirange support:	7m24s

So we are now down to 40% (2.5x faster).

And we are up to utilizing 50% of the bandwidth. Which is (IMO) very
good. We probably aren't as good at a pull versus an original get, but
just getting this good at the initial get is probably as good as we can
do. Heck, we're faster than 'wget', and that's with parsing everything
to know what we need to get next.

Now, with a theoretical smart server, which can recompress things on the
fly:

current size of .bzr => 26MB
expanded size of .bzr => 103MB
(zcat all of the .knit files, cat everything else)
bzip2 of expanded texts => 7.9MB

So a smart server could theoretically send only 1/3rd the amount of
bytes over the wire, which could speed us up some more. (Giving us a max
headroom of being approximately 6x faster than the current dumb protocol
code, 2 for bandwidth, 3 for compression, though it makes things slower
on a local network).

Now my branch could probably be cleaned up a little bit, and we would
want to write some tests for the range stuff. But the selftests pass,
and we can do a 'get' of bzr.dev.

I'm not sure how we want to write multirange support into
SimpleHTTPServer, but we might consider merging this anyway.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFElC0hJdeBCYSNAAMRAqypAKCpLAm739oaTvCGpxSpVNEH+rNQLACfc34E
wwSifCRGg09MD9fqnGnaXqE=
=BJ/8
-----END PGP SIGNATURE-----
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: pycurl-multirange.diff
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20060617/d8cedd9e/attachment.diff 


More information about the bazaar mailing list