Network glitches costing 15 minutes a pop

Fri Aug 15 15:54:42 BST 2008

>>>>> "Mark" == Mark Hammond <mhammond at skippinet.com.au> writes:

<snip/>
    Mark> 937.223  Exception ShortReadvError(): readv() read unknown bytes rather than
    Mark> unknown bytes at unknown for

As you noticed in a later mail, this comes from bzrlib/transport/http/_pycurl.py:

            elif e[0] == CURLE_PARTIAL_FILE:
                # Pycurl itself has detected a short read.  We do not have all
                # the information for the ShortReadvError, but that should be
                # enough
                raise errors.ShortReadvError(url,
                                             offset='unknown', length='unknown',
                                             actual='unknown',
                                             extra='Server aborted the request')

<snip/>

    Mark> Note the 15 minute gap before the 'got pycurl error: 18' messages.

As Robert noticed this sounds like a timeout.

15 minutes, I've often seen network timeouts of 15 minutes *as a
user*. I've never been able to find where it came from though :-(

    Mark> Off the top of my head, I see at least 1 such error
    Mark> every 3rd time pulling from Launchpad.

Then you're welcome to provide some wireshark traces (I
understand that they can be hard to get for you) :-/

    Mark> It seems to me that little network glitches aren't
    Mark> particularly unexpected - but waiting 15 minutes when
    Mark> it happens isn't that friendly.

    Mark> Is this something specific to Windows?  Specific to
    Mark> pycurl?

Little is known about it, you're the first one AFAIK to report
that behavior with such a high occurrence frequency. It may be
pycurl, it may be windows, I'd prefer to avoid guesses without
more data.

    Mark> Any suggestions about what we can do to make such
    Mark> errors have less of an impact?

Yes.

Since you :
- don't use a proxy,
- don't need NTLM authentication,
- don't need to verify https certificates,

try urllib instead.

Either by using http+urllib: instead of plain http: or by using
the following plugin:

,----
| from bzrlib import transport
| 
| 
| transport.register_lazy_transport('http://', 'bzrlib.transport.http._urllib',
|                                   'HttpTransport_urllib')
| transport.register_lazy_transport('https://', 'bzrlib.transport.http._urllib',
|                                   'HttpTransport_urllib')
`----

which will make the urllib implementation become the default
instead of pycurl for http.

As shown above, pycurl doesn't give us precise enough information
about *when* this is occurring, urllib at least will be more
precise.

As John mentioned in a later mail, we also have a strange
select/poll error on Linux with pycurl.

I call that one a "Loch Ness Monster" bug: some pretend they have
seen it but nobody has proof (i.e. receipts to reproduce it).

Somehow, sometimes, bzr as an http client is waiting for a packet
while the server is waiting for some ack before sending another
packet.

It may well be that you're seeing a slightly different symptom
for the same cause: client and server are out of sync and
depending on a yet-to-be-identified cause the client or the
server is aborting the connection before the other.

       Vincent