[MERGE] Ensure failed operations leave remote transport in clean state

Tue Sep 4 07:04:56 BST 2007

BB:resubmit

Aaron Bentley wrote:
[...]
> Since transports are cached, it seems logical to me that failed
> operations must never leave them in a bad state.  If failed operations
> left a transport in a bad state, further operations could select that
> transport from the cache, causing it to fail in bizarre ways when
> perhaps it shouldn't fail at all.
> 
> This is not a theoretical issue.  I encountered it with multi-pull.
> 
> This patch corrects the behavior of smart transports so that if they
> fail due to DNS resolution problems, they can still be reused (likely to
> fail again, of course).

This is really the start of a fix rather than a fix.  There are a lot of cases
you don't cover (as reflected by the minimal test coverage).

So I'm in two minds about this.  It might be a good incremental step towards a
comprehensive fix, but at the moment I'm not sure.

I'm curious to know more about the problem you saw using multi-pull.  Was the
problem there intermittent DNS failures or something else? 

> === modified file 'bzrlib/smart/medium.py'
> --- bzrlib/smart/medium.py	2007-07-31 21:20:02 +0000
> +++ bzrlib/smart/medium.py	2007-08-31 18:43:39 +0000
> @@ -268,7 +268,12 @@
>          """
>          if self._state != "writing":
>              raise errors.WritingCompleted(self)
> -        self._accept_bytes(bytes)
> +        try:
> +            self._accept_bytes(bytes)
> +        except:
> +            self.finished_writing()
> +            self.finished_reading()
> +            raise

This is in SmartClientMediumRequest.accept_bytes.  But that's not the only point
that might encounter a temporary connection error.  What about
SmartClientMediumRequest.read_bytes?  If this change is correct, then it would
need a similar change too I think.  In fact, so would finished_writing: as
described in the docstrings some implementations of SmartClientMediumRequest
will just buffer bytes in accept_bytes and not try to send them until
finished_writing (e.g. SmartClientHTTPMediumRequest).

So I think this is an incomplete fix.

> === modified file 'bzrlib/smart/protocol.py'
> --- bzrlib/smart/protocol.py	2007-08-08 14:34:56 +0000
> +++ bzrlib/smart/protocol.py	2007-08-31 18:43:39 +0000
> @@ -400,7 +400,13 @@
>              # TODO: this is inefficient - but tuples are short.
>              new_char = self._request.read_bytes(1)
>              line += new_char
> -            assert new_char != '', "end of file reading from server."
> +            try:
> +                assert new_char != '', "end of file reading from server."
> +            except:
> +                self._request._medium.disconnect()
> +                self._request.finished_reading()
> +                raise
> +
>          return line

I agree with Martin's comment.  There's no reason for this to be a try/except
rather than an if.

Also it should be an explicit raise rather than an assert statement.  This is a
problem even without your change, but again your fix isn't complete unless
addresses this.  Tests should pass under “python -O” after all.  A simple “raise
BzrError(...)” would be ok with me.

And like SmartClientMediumRequest, _recv_line isn't the only place that should
be changed.  For instance, what about disconnections while reading the body?
For SmartClientHTTPMediumRequest at least (which is connectionless) that should
not victimise later requests.

I wonder if there's a better point we can intercept interruptions and reset
things... perhaps SmartClientRequestProtocolOne should have a _recv_bytes method
that it calls instead of self._request.read_bytes, so that it can detect
end-of-file in just one place?

> === modified file 'bzrlib/tests/test_smart_transport.py'
> --- bzrlib/tests/test_smart_transport.py	2007-08-02 06:40:58 +0000
> +++ bzrlib/tests/test_smart_transport.py	2007-08-31 18:43:39 +0000
> @@ -538,6 +538,27 @@
>          self.assertIsInstance(client_medium, medium.SmartClientMedium)
>  
>  
> +class TestNoHost(tests.TestCase):
> +
> +    def fail_cleanly(self, url):
> +        """Ensure that when an operation fails, the transport can be reused"""
> +        t = get_transport(url)
> +        from socket import gaierror
> +        self.assertRaises((gaierror, errors.ConnectionError, AssertionError),
> +                          t.get, 'foo')
> +        self.assertRaises((gaierror, errors.ConnectionError, AssertionError),
> +                          t.get, 'foo')

The convention for naming custom assertion methods is “assert...”.

> +
> +    def test_bzr_fails_cleanly(self):
> +        self.fail_cleanly('bzr://hosts cannot have spaces')
> +
> +    def test_http_fails_cleanly(self):
> +        self.fail_cleanly('bzr+http://hosts cannot have spaces')
> +
> +    def test_ssh_fails_cleanly(self):
> +        self.fail_cleanly('bzr+ssh://hosts cannot have spaces')

I don't particularly like this test.  It doesn't really assert that it “fails
cleanly”.  It asserts that it'll get the same error repeatedly, but that's a
pretty indirect and not entirely convincing way to test that a temporary
connection problem doesn't leave the transport in a bad state.  It doesn't show
that a subsequent call can succeed after an error during a previous call.

I worry a little that a sufficiently wacky system configuration could lead to a
DNS lookup of “hosts cannot have spaces” succeeding.  Most of the test suite is
pretty well insulated from system (mis)configuration, and it'd be nice to keep
that.  Not a big deal though.

It's also a bit ugly that so many different exceptions might be raised, although
that's a problem with the code being tested rather than the test.

Probably my biggest concern is that this is testing at the wrong layer.  I'm ok
with this test as a simple smoketest for RemoteTransport, but I think there
should be some more direct unit tests for the changes to the
SmartClientMediumRequest and SmartClientRequestProtocolOne.  There are lots of
cases that should be tested (and fixed) that are best tested directly on the
medium/protocol.  For instance it would be unnecessarily complex to test what
happens during disconnection during reading a response body over HTTP via the
transport API vs. directly testing the logic in the medium/protocol.

-Andrew.