[RFC] Multipart support for _urllib_

Sat Jun 17 14:23:58 BST 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Michael Ellerman wrote:
> On 6/17/06, Michael Ellerman <michael at ellerman.id.au> wrote:
>> Hi guys,
>>
>> I've cleaned up the work I did to do multipart HTTP. Actually I
>> started from scratch, I think it's reasonably readable now.
> 
> As an FYI, while testing this I got curious about where our bottle neck is.
> 
> With my code I can branch bzr.dev from bazaar-vcs.org in around 21
> minutes. That's somewhere around 30 MB of data.
> 
> I can pull the inventory.knit (7MB) in ~40s. Which would suggests I
> could pull a 30MB blob in about 170 seconds, or a bit under 3 minutes.
> 
> If I capture all the GETs we're doing and ask wget to do them instead
> (with -i, so one connection), it takes ~11 minutes.
> 
> So we could still do a bunch of work to speed up branching, but it
> seems that having to do so many individual requests imposes a limit on
> us.
> 
> cheers

Thanks for doing this. I had some questions about alternate download
methods:

Since an initial get is really just downloading everything, how long
does it take if you just do a recursive wget of
http://bazaar-vcs.org/bzr/bzr.dev/.bzr/

I also assume that you are using a local caching DNS server, since that
can be one of the real banes of urllib.

This is what my timing tests turned up:

urllib get:	18m11s
wget -r -np:	10m54s
pycurl get:	18m38s
modified pycurl:11m17s
Ellerman's urllib:10m14s ***

Now, I should have turned off directories for wget, but otherwise it
should be okay.
My modified version of pycurl uses 2 Curl objects. One for range
requests, and one for full requests. It does this because with my
version of pycurl (7.13.1) trying to use setopt(RANGE, None) raises an
'invalid parameter' error, and setopt(RANGE, '') raises an HTTP 416
error (range cannot be satisfied).
There is also the function 'unsetopt()' but that doesn't seem to be
supported for the RANGE entry.
I only have 7.13.1, but I downloaded the source code to 7.15.2, and they
seem to be very similar (at least in this respect).

Michael mentioned that he didn't see any way to get the Content-range
header out. The only thing I've seen so far is to use WRITEDATA and
WRITEHEADER functions, rather than using WRITEFUNCTION, (WRITEFUNCTION
is incompatible with WRITEHEADER).
That should give us 2 streams, one with the header information, and one
with the content itself.

The working branch for my changes (integrated with Michael's) is at:
http://bzr.arbash-meinel.com/branches/bzr/http/

Considering that using pycurl with objects getting re-used is almost as
fast as a recursive wget. And that with Ellerman's work a plain 'urllib'
set of requests is actually faster than wget, we have some good stuff to
look into.

I think if we update pycurl so that we can use multipart range requests,
we'll have gotten as far as we can on plain http speed, without async or
a smart server. Which is why I think the next step is to get a version
of bundles which can send the knit chunks directly, and a smart server
which can serve these up as a bzip2 stream. The simplest server being
'ssh host bzr bundle --raw --base-id=X --local-path=Y' | bzr apply-bundle

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFElAJuJdeBCYSNAAMRAmyPAJ47AlDpjwuyxF2zZRY5iasaLQH4mQCfaoY1
mHS4fyAQgfDZZGNAyk6IbSA=
=nukG
-----END PGP SIGNATURE-----