[RFC] Multipart support for _urllib_

Michael Ellerman michael at ellerman.id.au
Wed Jun 21 00:30:03 BST 2006


On 6/21/06, John Arbash Meinel <john at arbash-meinel.com> wrote:
> Martin Pool wrote:
> > On 18/06/2006, at 2:26 AM, John Arbash Meinel wrote:
> >>
> >> Anyway, lets recap my previous timing tests:
> >>
> >> urllib get:    18m11s
> >> wget -r -np:    10m54s
> >> pycurl get:    18m38s
> >> modified pycurl:11m17s
> >> Ellerman's urllib:10m14s ***
> >>
> >>
> >> Now my max download speed is 160KB/s, so theoretically I can download
> >> all 30MB of bzr.dev in 3 minutes. So we still have some room for
> >> improvement. But after my combined changes we now have:
> >>
> >> pycurl w/ multirange support:    7m24s
> >>
> >> So we are now down to 40% (2.5x faster).
> >
> > That is indeed impressive.  I wonder if we can get down towards that
> > number by just progressively replacing things in urllib and not
> > depending on pycurl?
>
> I think pycurl gives us connection sharing/keep alive. Which becomes
> even more important over SSL.
> I'm a little curious if performance would be better if we shared the
> range-request object with the full-request object, but I have the
> feeling we tend to use either readv() or get(). We probably don't mix
> them around much.

It'd be nice if we didn't need pycurl. It has some nasty failure
modes, ie. you can't ctrl-c bzr in certain places 'cause it's stuck in
pycurl. I tend to see that a bit at home 'cause my router has
bodgy/slow dns. Connection sharing's pretty important though.

> >
> > It looks reasonable to come in but it does reduce add some untested
> > code.  One option would be to add range support to the server as you say
> > but perhaps instead we could just have unit tests for each aspect:
> > generating and parsing multipart bodies and so on.
>
> Well, that shouldn't be too bad. How much testing do you want in place
> before it is considered worthy of merging? (I would like to get some
> testing as well, but I have some other major focuses this week, and
> would like to see it merged soonish).

I'll try and get some tests written this week .. sometime. When I last
looked there wasn't much testing of the readv() implementations that I
could see. I think we want to have a couple of decent blackbox tests
that do pull/branch of regular and repo branches over the various
transports.

I think it'd be also good if we could collect some stock responses
from various HTTP servers for various range requests, and test against
those. My expectation is Apache's the only one that will actually give
us back ranges, and others might even break, so we'll need some
fallback mechanism.

> >> I'm not sure how we want to write multirange support into
> >> SimpleHTTPServer, but we might consider merging this anyway.

I think that should be doable, a naive implementation at least.

> >> +        raise BzrError("HTTP couldn't handle code %s", response.code)
> >
> > Perhaps this should be a TransportError instead, and include the status
> > text from the response?
>
> Either TransportError or a PathError.

I'm not sure what kind of error, but for testing I actually had code
there to dump the response to a file. It might be better to drop it in
.bzr.log than onto the user though I think.

> >> +    def _offsets_to_ranges(self, offsets):
> >
> > It's not clear from the docstring what the difference is between
> > "offsets" and "byte ranges" in this context or whether there is a
> > difference.  It looks like you just mean to join up adjacent ranges (?)
> > - if so, say so.
>
> Requests to readv are given in 'start, length', but ranges are from
> 'start, end'. It collapses overlapping/adjacent ranges (I could argue we
> could go one better and collapse if they are 'close enough').
> Probably an example makes it the most obvious (and we can add this to
> the doc string):
>
> Examples:
>   [(5, 3)] => [(5, 8)]
>   [(5, 3), (6, 2)] = [(5,8)]
>   [(5, 3), (8, 2)] = [(5,8), (8,10)]
>   [(5, 3), (2, 1)] = [(2,3), (5,8)]

Yep. "Offsets" are the readv format of (start, length), "Ranges" is my
contraction of "Byte-ranges" as the HTTP spec defines them, ie.
(start, end) inclusive.

I thought about collapsing "close enough" ranges, but the million
dollar question is how close is "enough"? On a fast link it might be a
matter of MB, whereas on a modem it might be a matter of bytes. We
might be able to come to a compromise though. In practice I was seeing
most readv()s collapse into 1 range anyway.

cheers




More information about the bazaar mailing list