bzr log http+urllib does not work, http+pycurl is too slow

Wed Dec 12 16:37:13 GMT 2007

>>>>> "john" == John Arbash Meinel <john at arbash-meinel.com> writes:

    john> Vincent Ladeuil wrote:
    john> ...

    >> That, my friend, is some very bad news.... Well, the good news is
    >> that the patch fixes the bug at least...
    >> 
    >> And sorry for asking again, but can you do that one more time
    >> with -Dhttp so that I can diagnose more easily. There may be
    >> several GET requests for one readv (that's should not be the case
    >> here, but seeing the ranges requested may help evaluate the
    >> wasted bandwidth :/ ).
    >> 
    >> We have a bad situation here, because even if the http transport
    >> can reuse the whole file transferred inside one readv, it will
    >> not be able to reuse that file *across* several readv if we don't
    >> add a local cache (which we want to avoid for several reasons).
    >> 
    >> Vincent

    john> Did you see my earlier comment about readv() returning extra data?

Not sure what you're referring to... but if you summarized it
below, yes, I was aware of that.

What I suspected first in bialix case (bzr log) was that somehow
readv() caller was leaving a lot of unprocessed ranges, but the
proposed fix should be robust enough to handle that too. And
anyway, that was not the case.

    john> At least for the index reads, Packs are designed to
    john> allow the readv() request to buffer more data than
    john> requested. This is to make it so that HTTP requests
    john> buffer 64k pages, while local requests only buffer
    john> 4-8k. (The whole latency versus bandwidth idea.)

Yes. But the the amount of data added here is not enough to make
the network buffers explode, what triggered it in bialix case was
far more than that.

    john> If we want to handle HTTP servers that cannot do
    john> ranges, we could detect it and write to a local cache
    john> that gets cleaned up (something in $TMP).

That's the idea yes, may be a plugin may be more appropriate for
that.

    john> Then we don't have to buffer in memory. I don't really
    john> like having to shutil.rmtree($TMP/foo) as part of an
    john> HTTP teardown, though.

Yeah, especially since there is *no* HTTP teardown so far ;-) But
using mkstemp() may be enough.

    john> But yes, it is known that the plain Python HTTP server
    john> cannot do ranges. It was one of the test cases in the
    john> past, to make sure we at least keep working in those
    john> situations.

It's still there, it's called TestNoRangeRequestServer and
ignores the ranges. But it can't be used to reproduce the bug.

    john> I believe HTTP says that "if ranges were requested, it
    john> is valid to return the whole file". We can certainly
    john> detect that (a return code of 200 versus 206).

Yes.

    john> We do need to access packs a few times. We have 1
    john> request for inventories, and then another to requests
    john> the texts referenced by those inventories. (And I think
    john> one more for the revisions, but we could bundle that
    john> with the request for the texts.)

And one for the signatures.

    john> What about this compromise... Write a plugin which
    john> provides an HTTP transport that does the caching. Don't
    john> bring it into the core, but mention it if people really
    john> need to access servers that don't support range
    john> requests.

+1 ;)

I really prefer to have that outside of the core.

I had to do part of the work for the transportstats plugin
anyway.

    john> Oh, and get Vincent's great urllib2 work merged into
    john> upstream, so that they can have a real HTTP server
    john> built-in :).

Well tried, but you want http_server.py not _urllib2_wrappers ;-)
And wait a bit more because a true http 1.1 server in around the
corner...

        Vincent