[RFC] Multipart support for _urllib_

Sat Jun 17 13:47:14 BST 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Michael Ellerman wrote:
> Hi guys,
> 
> I've cleaned up the work I did to do multipart HTTP. Actually I
> started from scratch, I think it's reasonably readable now.
> 
> For the moment it doesn't handle the case of sending too many ranges
> to the server. Now that we sort the offsets, we get much better
> combining. I'm seeing at most 16 ranges get sent, which should be ok
> for most servers you'd hope. To be robust we should probably support
> falling back to a full request if we get a 400 for a range request.
> 
> Currently I've hooked it into urllib, because I can't for the life of
> me work out how to get the "Content-range" header out of pycurl. I
> don't think it can be done in fact. So hooking this up to pycurl will
> require a kludge, if we get back a non-multipart 206 we'll just have
> to assume it contains the union of all the ranges we requested plus
> any gaps.
> 
> It doesn't do connection sharing or any of that, we can sort that out
> later.
> 
> I haven't written any tests for this yet, that requires writing a
> python server that does multipart responses, sigh. It seems to work ok
> though, I can branch bzr.dev :)
> 
> Also at http://michael.ellerman.id.au/bzr/branches/http
> 
> cheers
> 
> ps. Don't forget you need "http+urllib://foo" urls to test this!
> 

I'm very interested in this. I've already started doing some of my own
benchmarks, so we'll see where we can get. Just a few comments.

1) You sent a bundle, but not against anything else. This is okay, but
it means we can't merge you (I'm missing
michael at ellerman.id.au-20060608162722-df63ead19775ce65)
When you really want bundles, you should use "bzr bundle $path/bzr.dev"
That way we have everything needed to apply your ancestry against bzr.dev.
(You may already know this and just chose to send a simplified bundle)

...

Why did you decide to do this as a 'seek + read(size)' rather than
readv()? Just so it would be more 'file-like'?

> +    def read(self, size):
> +        """Read size bytes from the current position in the file.
> +
> +        Reading across ranges is not supported.
> +        """
> +        # find the last range which has a start <= pos
> +        i = bisect(self._ranges, self._pos) - 1
> +
> +        if i < 0 or self._pos > self._ranges[i]._ent_end:
> +            raise TransportError("Range response does not contain any data "
> +                   "at offset %d for %s!" % (self._pos, self._path))
> +
> +        r = self._ranges[i]
> +
> +        mutter('found range %s %s for pos %s', i, self._ranges[i], self._pos)
> +
> +        if (self._pos + size - 1) > r._ent_end:
> +            raise TransportError("Read past end of range (%s) at %d size "
> +                                 "%d for %s!" % (r, self._pos,
> +                                 size, self._path))
> +
> +        start = r._data_start + (self._pos - r._ent_start)
> +        end   = start + size
> +        mutter("range read %d bytes at %d == %d-%d", size, self._pos,
> +                start, end)
> +        return self._data[start:end]
> +
> +    def seek(self, offset, whence=0):
> +        if whence == 0:
> +            self._pos = offset
> +        elif whence == 1:
> +            self._pos += offset
> +        elif whence == 2:
> +            self._pos = self._len + offset
> +        else:
> +            raise ValueError("Invalid value %s for whence." % whence)
> +
> +        if self._pos < 0:
> +            self._pos = 0

You need another space here.

> +
> +class HttpRangeResponse(RangeFile):
> +    """A single-range HTTP response."""

...

> +    def _offsets_to_ranges(self, offsets):
> +        """Turn a list of offsets and sizes into a list of byte ranges.
> +
> +        :param offsets: A list of tuples of (start, size).
> +        An empty list is not accepted.
> +
> +        :return: a list of byte ranges (start, end). Adjacent ranges will
> +        be combined in the result.
> +        """

You can use the python2.4 idioms:
import operator
offsets = sorted(offsets, key=operator.itemgetter(0))

Though if you sort a tuple with (start, end), it seems like you might as
well just call:
offsets = sorted(offsets)

That way you guarantee that the one with the further back 'end' comes
second.

> +        # We need a copy of the offsets, as the caller might expect it to
> +        # remain unsorted. This doesn't seem expensive for memory at least.
> +        offsets = offsets[:]
> +        offsets.sort(key=lambda i: i[0])
> +
> +        start, size = offsets[0]
> +        prev_end = start + size - 1
> +        combined = [[start, prev_end]]

...

> +    def readv(self, relpath, offsets):
> +        """Get parts of the file at the given relative path.
> +
> +        :param offsets: A list of (offset, size) tuples.
> +        :param return: A list or generator of (offset, data) tuples
> +        """
> +        mutter('readv of %s [%s]', relpath, offsets)
> +        ranges = self._offsets_to_ranges(offsets)
> +        code, f = self._get(relpath, ranges)
> +        for start, size in offsets:
> +            f.seek(start, 0)
> +            data = f.read(size)
> +            assert len(data) == size
> +            yield start, data

This is where it would seem to make more sense to just make the readv
call into the RangeFile rather than a bunch of seek + read calls.

> +
> +    def _is_multipart(self, content_type):
> +        return content_type.startswith('multipart/byteranges;')
> +
> +    def _handle_response(self, path, response):
> +        """Interpret the code & headers and return a HTTP response.
> +
> +        This is a factory method which returns an appropriate HTTP response
> +        based on the code & headers it's given.
> +        """
> +        content_type = response.headers['Content-Type']
> +        mutter('handling response code %s ctype %s', response.code,
> +            content_type)
> +
> +        if response.code == 206 and self._is_multipart(content_type):
> +            # Full fledged multipart response
> +            return HttpMultipartRangeResponse(path, content_type, response)
> +        elif response.code == 206:
> +            # A response to a range request, but not multipart
> +            content_range = response.headers['Content-Range']
> +            return HttpRangeResponse(path, content_range, response)
> +        elif response.code == 200:
> +            # A regular non-range response, unfortunately the result from
> +            # urllib doesn't support seek, so we wrap it in a StringIO
> +            return StringIO(response.read())
> +        elif response.code == 404:
> +            raise NoSuchFile(path)

^- maybe it was this code path (since you return a StringIO object). But
it seems like you could just wrap whatever object was returned.
It probably doesn't matter much. It just seemed like you took a lot of
effort to conform to a plain file api. To do something that isn't really
supported by the file api (readv()).

As I said, I'm doing some performance testing, and I'll let you know how
it goes.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEk/nSJdeBCYSNAAMRAgb5AKCIWGiyXK+BYEdCW9ookg8M+3LqGACfcbTo
oW+gWO6v/IAdCzRNn1z5BVo=
=CQLo
-----END PGP SIGNATURE-----