bzr log http+urllib does not work, http+pycurl is too slow
Vincent Ladeuil
v.ladeuil+lp at free.fr
Wed Dec 12 16:37:13 GMT 2007
>>>>> "john" == John Arbash Meinel <john at arbash-meinel.com> writes:
john> Vincent Ladeuil wrote:
john> ...
>> That, my friend, is some very bad news.... Well, the good news is
>> that the patch fixes the bug at least...
>>
>> And sorry for asking again, but can you do that one more time
>> with -Dhttp so that I can diagnose more easily. There may be
>> several GET requests for one readv (that's should not be the case
>> here, but seeing the ranges requested may help evaluate the
>> wasted bandwidth :/ ).
>>
>> We have a bad situation here, because even if the http transport
>> can reuse the whole file transferred inside one readv, it will
>> not be able to reuse that file *across* several readv if we don't
>> add a local cache (which we want to avoid for several reasons).
>>
>> Vincent
john> Did you see my earlier comment about readv() returning extra data?
Not sure what you're referring to... but if you summarized it
below, yes, I was aware of that.
What I suspected first in bialix case (bzr log) was that somehow
readv() caller was leaving a lot of unprocessed ranges, but the
proposed fix should be robust enough to handle that too. And
anyway, that was not the case.
john> At least for the index reads, Packs are designed to
john> allow the readv() request to buffer more data than
john> requested. This is to make it so that HTTP requests
john> buffer 64k pages, while local requests only buffer
john> 4-8k. (The whole latency versus bandwidth idea.)
Yes. But the the amount of data added here is not enough to make
the network buffers explode, what triggered it in bialix case was
far more than that.
john> If we want to handle HTTP servers that cannot do
john> ranges, we could detect it and write to a local cache
john> that gets cleaned up (something in $TMP).
That's the idea yes, may be a plugin may be more appropriate for
that.
john> Then we don't have to buffer in memory. I don't really
john> like having to shutil.rmtree($TMP/foo) as part of an
john> HTTP teardown, though.
Yeah, especially since there is *no* HTTP teardown so far ;-) But
using mkstemp() may be enough.
john> But yes, it is known that the plain Python HTTP server
john> cannot do ranges. It was one of the test cases in the
john> past, to make sure we at least keep working in those
john> situations.
It's still there, it's called TestNoRangeRequestServer and
ignores the ranges. But it can't be used to reproduce the bug.
john> I believe HTTP says that "if ranges were requested, it
john> is valid to return the whole file". We can certainly
john> detect that (a return code of 200 versus 206).
Yes.
john> We do need to access packs a few times. We have 1
john> request for inventories, and then another to requests
john> the texts referenced by those inventories. (And I think
john> one more for the revisions, but we could bundle that
john> with the request for the texts.)
And one for the signatures.
john> What about this compromise... Write a plugin which
john> provides an HTTP transport that does the caching. Don't
john> bring it into the core, but mention it if people really
john> need to access servers that don't support range
john> requests.
+1 ;)
I really prefer to have that outside of the core.
I had to do part of the work for the transportstats plugin
anyway.
john> Oh, and get Vincent's great urllib2 work merged into
john> upstream, so that they can have a real HTTP server
john> built-in :).
Well tried, but you want http_server.py not _urllib2_wrappers ;-)
And wait a bit more because a true http 1.1 server in around the
corner...
Vincent
More information about the bazaar
mailing list