[RFC] Multipart support for _urllib_
John Arbash Meinel
john at arbash-meinel.com
Tue Jun 20 15:25:04 BST 2006
Martin Pool wrote:
> On 18/06/2006, at 2:26 AM, John Arbash Meinel wrote:
>>
>> Anyway, lets recap my previous timing tests:
>>
>> urllib get: 18m11s
>> wget -r -np: 10m54s
>> pycurl get: 18m38s
>> modified pycurl:11m17s
>> Ellerman's urllib:10m14s ***
>>
>>
>> Now my max download speed is 160KB/s, so theoretically I can download
>> all 30MB of bzr.dev in 3 minutes. So we still have some room for
>> improvement. But after my combined changes we now have:
>>
>> pycurl w/ multirange support: 7m24s
>>
>> So we are now down to 40% (2.5x faster).
>
> That is indeed impressive. I wonder if we can get down towards that
> number by just progressively replacing things in urllib and not
> depending on pycurl?
I think pycurl gives us connection sharing/keep alive. Which becomes
even more important over SSL.
I'm a little curious if performance would be better if we shared the
range-request object with the full-request object, but I have the
feeling we tend to use either readv() or get(). We probably don't mix
them around much.
>
>> Now, with a theoretical smart server, which can recompress things on the
>> fly:
>>
>> current size of .bzr => 26MB
>> expanded size of .bzr => 103MB
>> (zcat all of the .knit files, cat everything else)
>> bzip2 of expanded texts => 7.9MB
>>
>> So a smart server could theoretically send only 1/3rd the amount of
>> bytes over the wire, which could speed us up some more. (Giving us a max
>> headroom of being approximately 6x faster than the current dumb protocol
>> code, 2 for bandwidth, 3 for compression, though it makes things slower
>> on a local network).
>
> I've started on one; I'll post it later on.
>
>> Now my branch could probably be cleaned up a little bit, and we would
>> want to write some tests for the range stuff. But the selftests pass,
>> and we can do a 'get' of bzr.dev.
>
> It looks reasonable to come in but it does reduce add some untested
> code. One option would be to add range support to the server as you say
> but perhaps instead we could just have unit tests for each aspect:
> generating and parsing multipart bodies and so on.
Well, that shouldn't be too bad. How much testing do you want in place
before it is considered worthy of merging? (I would like to get some
testing as well, but I have some other major focuses this week, and
would like to see it merged soonish).
>
>> I'm not sure how we want to write multirange support into
>> SimpleHTTPServer, but we might consider merging this anyway.
>
...
>> + raise BzrError("HTTP couldn't handle code %s", response.code)
>
> Perhaps this should be a TransportError instead, and include the status
> text from the response?
Either TransportError or a PathError.
>
>> def put(self, relpath, f, mode=None):
>> """Copy the file-like or string object into the location.
>> @@ -341,6 +329,51 @@
>> else:
>> return self.__class__(self.abspath(offset))
>>
>> + def _offsets_to_ranges(self, offsets):
>> + """Turn a list of offsets and sizes into a list of byte ranges.
>> +
>> + :param offsets: A list of tuples of (start, size).
>> + An empty list is not accepted.
>> +
>> + :return: a list of byte ranges (start, end). Adjacent ranges
>> will
>> + be combined in the result.
>> + """
>> + # We need a copy of the offsets, as the caller might expect
>> it to
>> + # remain unsorted. This doesn't seem expensive for memory at
>> least.
>> + offsets = sorted(offsets)
>
> It's not clear from the docstring what the difference is between
> "offsets" and "byte ranges" in this context or whether there is a
> difference. It looks like you just mean to join up adjacent ranges (?)
> - if so, say so.
>
> --Martin
Requests to readv are given in 'start, length', but ranges are from
'start, end'. It collapses overlapping/adjacent ranges (I could argue we
could go one better and collapse if they are 'close enough').
Probably an example makes it the most obvious (and we can add this to
the doc string):
Examples:
[(5, 3)] => [(5, 8)]
[(5, 3), (6, 2)] = [(5,8)]
[(5, 3), (8, 2)] = [(5,8), (8,10)]
[(5, 3), (2, 1)] = [(2,3), (5,8)]
Would that be better?
John
=:->
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060620/8760b928/attachment.pgp
More information about the bazaar
mailing list