[RFC] Multipart support for _urllib_
John Arbash Meinel
john at arbash-meinel.com
Sat Jun 17 13:47:14 BST 2006
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Michael Ellerman wrote:
> Hi guys,
>
> I've cleaned up the work I did to do multipart HTTP. Actually I
> started from scratch, I think it's reasonably readable now.
>
> For the moment it doesn't handle the case of sending too many ranges
> to the server. Now that we sort the offsets, we get much better
> combining. I'm seeing at most 16 ranges get sent, which should be ok
> for most servers you'd hope. To be robust we should probably support
> falling back to a full request if we get a 400 for a range request.
>
> Currently I've hooked it into urllib, because I can't for the life of
> me work out how to get the "Content-range" header out of pycurl. I
> don't think it can be done in fact. So hooking this up to pycurl will
> require a kludge, if we get back a non-multipart 206 we'll just have
> to assume it contains the union of all the ranges we requested plus
> any gaps.
>
> It doesn't do connection sharing or any of that, we can sort that out
> later.
>
> I haven't written any tests for this yet, that requires writing a
> python server that does multipart responses, sigh. It seems to work ok
> though, I can branch bzr.dev :)
>
> Also at http://michael.ellerman.id.au/bzr/branches/http
>
> cheers
>
> ps. Don't forget you need "http+urllib://foo" urls to test this!
>
I'm very interested in this. I've already started doing some of my own
benchmarks, so we'll see where we can get. Just a few comments.
1) You sent a bundle, but not against anything else. This is okay, but
it means we can't merge you (I'm missing
michael at ellerman.id.au-20060608162722-df63ead19775ce65)
When you really want bundles, you should use "bzr bundle $path/bzr.dev"
That way we have everything needed to apply your ancestry against bzr.dev.
(You may already know this and just chose to send a simplified bundle)
...
Why did you decide to do this as a 'seek + read(size)' rather than
readv()? Just so it would be more 'file-like'?
> + def read(self, size):
> + """Read size bytes from the current position in the file.
> +
> + Reading across ranges is not supported.
> + """
> + # find the last range which has a start <= pos
> + i = bisect(self._ranges, self._pos) - 1
> +
> + if i < 0 or self._pos > self._ranges[i]._ent_end:
> + raise TransportError("Range response does not contain any data "
> + "at offset %d for %s!" % (self._pos, self._path))
> +
> + r = self._ranges[i]
> +
> + mutter('found range %s %s for pos %s', i, self._ranges[i], self._pos)
> +
> + if (self._pos + size - 1) > r._ent_end:
> + raise TransportError("Read past end of range (%s) at %d size "
> + "%d for %s!" % (r, self._pos,
> + size, self._path))
> +
> + start = r._data_start + (self._pos - r._ent_start)
> + end = start + size
> + mutter("range read %d bytes at %d == %d-%d", size, self._pos,
> + start, end)
> + return self._data[start:end]
> +
> + def seek(self, offset, whence=0):
> + if whence == 0:
> + self._pos = offset
> + elif whence == 1:
> + self._pos += offset
> + elif whence == 2:
> + self._pos = self._len + offset
> + else:
> + raise ValueError("Invalid value %s for whence." % whence)
> +
> + if self._pos < 0:
> + self._pos = 0
You need another space here.
> +
> +class HttpRangeResponse(RangeFile):
> + """A single-range HTTP response."""
...
> + def _offsets_to_ranges(self, offsets):
> + """Turn a list of offsets and sizes into a list of byte ranges.
> +
> + :param offsets: A list of tuples of (start, size).
> + An empty list is not accepted.
> +
> + :return: a list of byte ranges (start, end). Adjacent ranges will
> + be combined in the result.
> + """
You can use the python2.4 idioms:
import operator
offsets = sorted(offsets, key=operator.itemgetter(0))
Though if you sort a tuple with (start, end), it seems like you might as
well just call:
offsets = sorted(offsets)
That way you guarantee that the one with the further back 'end' comes
second.
> + # We need a copy of the offsets, as the caller might expect it to
> + # remain unsorted. This doesn't seem expensive for memory at least.
> + offsets = offsets[:]
> + offsets.sort(key=lambda i: i[0])
> +
> + start, size = offsets[0]
> + prev_end = start + size - 1
> + combined = [[start, prev_end]]
...
> + def readv(self, relpath, offsets):
> + """Get parts of the file at the given relative path.
> +
> + :param offsets: A list of (offset, size) tuples.
> + :param return: A list or generator of (offset, data) tuples
> + """
> + mutter('readv of %s [%s]', relpath, offsets)
> + ranges = self._offsets_to_ranges(offsets)
> + code, f = self._get(relpath, ranges)
> + for start, size in offsets:
> + f.seek(start, 0)
> + data = f.read(size)
> + assert len(data) == size
> + yield start, data
This is where it would seem to make more sense to just make the readv
call into the RangeFile rather than a bunch of seek + read calls.
> +
> + def _is_multipart(self, content_type):
> + return content_type.startswith('multipart/byteranges;')
> +
> + def _handle_response(self, path, response):
> + """Interpret the code & headers and return a HTTP response.
> +
> + This is a factory method which returns an appropriate HTTP response
> + based on the code & headers it's given.
> + """
> + content_type = response.headers['Content-Type']
> + mutter('handling response code %s ctype %s', response.code,
> + content_type)
> +
> + if response.code == 206 and self._is_multipart(content_type):
> + # Full fledged multipart response
> + return HttpMultipartRangeResponse(path, content_type, response)
> + elif response.code == 206:
> + # A response to a range request, but not multipart
> + content_range = response.headers['Content-Range']
> + return HttpRangeResponse(path, content_range, response)
> + elif response.code == 200:
> + # A regular non-range response, unfortunately the result from
> + # urllib doesn't support seek, so we wrap it in a StringIO
> + return StringIO(response.read())
> + elif response.code == 404:
> + raise NoSuchFile(path)
^- maybe it was this code path (since you return a StringIO object). But
it seems like you could just wrap whatever object was returned.
It probably doesn't matter much. It just seemed like you took a lot of
effort to conform to a plain file api. To do something that isn't really
supported by the file api (readv()).
As I said, I'm doing some performance testing, and I'll let you know how
it goes.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFEk/nSJdeBCYSNAAMRAgb5AKCIWGiyXK+BYEdCW9ookg8M+3LqGACfcbTo
oW+gWO6v/IAdCzRNn1z5BVo=
=CQLo
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list