[xBBxMERGE] Fix test suite regression on OSX in TestReadMergeableFromUrl.test_smart_server_connection_reset.

Fri Oct 3 08:25:45 BST 2008

>>>>> "Andrew" == Andrew Bennetts <andrew at canonical.com> writes:

    Andrew> Vincent Ladeuil wrote:
    >> >>>>> "Andrew" == Andrew Bennetts <andrew at canonical.com> writes:
    >> 
    Andrew> Andrew Bennetts has voted tweak.
    Andrew> Status is now: Conditionally approved
    Andrew> Comment:
    Andrew> A test for this would be nice.
    >> 
    >> Reminder: this test failed on OS X and prompted me to propose
    >> this fix :-)

    Andrew> Sure.  I should have said (as I did on IRC) that a
    Andrew> test that fails reliably on all/most platforms would
    Andrew> be nice.

Sorry, Freudian slip here, I dream about such a test myself,
since... well basically since I first encounter a different
behavior between linux and OSX regarding sockets from python
(i.e. the very first timeS I ran bzr selftest).

But these differences seem to be in the way cycles are dispatched
between threads and waaaay out of my control and I can't think of
a good way to emulate that.

So I think I just censored the "reliably on all/most platforms"
part, because I know for years that trying to do that is not
worth the effort :)

    Andrew> Or at least, on the platform I use ;)

    Andrew> A test that fails intermittently on one platform is
    Andrew> better than nothing, though.

But after sending that reply an old idea came back haunting me:
what we really need is more test servers where we can control and
simulate errors at any point in time (TestWallServer,
TestBadStatusServer, TestInvalidStatusServer,
TestBadProtocolServer, TestForbiddenServer and
_DisconnectingTCPServer being some simple examples of that).

    Andrew> I think you could probably make a good one using the
    Andrew> _DisconnectingTCPServer in test_bundle.
    >> 
    >> I think so too :)
    >> 
    >> But as said on IRC, I will look into doing the same for other
    >> smart mediums.
    >> 
    >> ...
    >> 
    >> Well, that doesn't work as expected to say the least. Roughly
    >> speaking the model you used for _DisconnectingTCPServer doesn't
    >> work for http (even if test_http.WallServer existed and provides
    >> the same service than _DisconnectingTCPServer), since errors are
    >> handled in two different phases:
    >> - the request headers handling (i.e. during get()),
    >> - the body reading handling (i.e during subsequent read()).

    Andrew> Right, this is why I was saying on IRC that these
    Andrew> tests will need to be unique to each medium
    Andrew> implementation, because the details of what
    Andrew> “connection interrupted/finished” means is very
    Andrew> dependent on the medium.

I agree.

    >> The equivalent phases are handled the same way for bzr over TCP
    >> since all bytes are read via read_bytes.
    >> 
    >> But for http, the connection reset is already either a
    >> ConnectionError or an InvalidHttpResponse in the headers phase
    >> where read_bytes is not involved at all.
    >> 
    >> So I'm back at my earlier question (in what circumstances is the
    >> smart server involved in reading a *bundle*) but waiting for a
    >> higher level answer to understand why it is used that way.

    Andrew> Short answer: It's involved in reading a bundle when
    Andrew> the user asks bzr to read one from a smart server :)

Sorry, wrong formulation again, I understand the mean, what I
want to understand is *why* someone would want a bundle from a
smart server ?

Any dumb server may provide that, I'm not against allowing it,
but when I want a newspaper I don't go to the bakery, that's
nice if the newspaper is available there, but yet...

<snip/>

    >> Because it seems to me that special casing ConnectionReset like
    >> you did will lead to more collisions down the road.
    >> 
    >> I'll merge the previous fix to avoid OSX support bitrot but I
    >> think we should keep discussing and complete that fix.
    >> 
    >> I think there is more than appears behind this collision between
    >> hpss and http:
    >> 
    >> - http lacks some error handling in the body reading phase (but
    >> since it provides a file-like for the body to upper layers, it
    >> need help from upper layers),

    Andrew> Right; any request that returns results incrementally
    Andrew> can fail not just on the first part of the response
    Andrew> but any part of the response.  Returning a file-like
    Andrew> object is an example of this.

Yes. That's where I say we need support from higher levels if we
want to be able to retry something (in whole or partially as
http.readv[1] does).

    >> - we may need to define more specific (regarding causes) and more
    >> generic (regarding the medium involved, be it http, ssh, or raw
    >> TCP) exceptions to unify the error handling (the except clauses
    >> in bundle.read_mergeable_from_transport start to become
    >> impressive, speaking of which jam made a comment there
    >> regarding sftp which is close to the problem I just encountered
    >> (exception during get() or exception during read()).

    Andrew> So I think the three basic events that matter for
    Andrew> HPSS media, at least for stream-based media, are:

    Andrew>   * connection could not be established
    Andrew>   * connection was interrupted (i.e. closed prematurely)
    Andrew>   * connection finished cleanly

For transports we have:

* connection could not be established
* connection was interrupted:
  - during the request phase
  - during the reading phase

There is no end for the connection :)

BUT, there is some retry mechanism if the connection is lost
during the request phase (that mechanism has been lost for some
parts of the reading phase in the http case when streaming was
introduced).

So I think that making a distinction between 'can't connect' and
'connection lost' allows different behaviors and should be kept
(and reconnection full support should be restored in a distant
future, say... when we have time for that :)

    Andrew> We currently treat both of the first two as
    Andrew> ConnectionReset, and I think that's a reasonable
    Andrew> simplification: either way the connection failed, and
    Andrew> we should give up and tell the user about it.  Do you
    Andrew> think that is inadequate?

Inadequate is too strong, the fact is we already distinguish the
first two in some places (http at least) and that makes it harder
to reconcile both.

    Andrew> On the topic of the except clauses in
    Andrew> read_mergeable_from_transport, perhaps the problem
    Andrew> there is that ConnectionReset is in the wrong place
    Andrew> in our exception hierarchy.  If it wasn't a
    Andrew> TransportError, then it wouldn't need special
    Andrew> handling.

That's a step in the right direction then :)

I don't really know how to continue from there, except by trying
to find an agreement on a model compatible with all the
requirements for transports and mediums.

The connection sharing mechanism is designed so that any
transport can reconnect if the connection is lost. Only http does
that, yet, ssh seems reliable enough to not require it, but many
users will just love to have it available for ftp (I have plans
for thathttps://bugs.launchpad.net/bzr/+bug/127164 , but no time
so far).

You said that user should be informed instead of reconnecting in
the smart server case. There are valid reasons to do so, but I
think there are also valid reasons to reconnect. Some users have
sub optimal internet connections (say wifi) and reconnecting
really help them.

Yet, reconnecting also has some drawbacks and some bugs has been
harder to diagnose because of that... 

We may need to find better ways to recognize connection losses
due to "our" bugs and those due to network problems.

       Vincent

[1]: When I made readv streamed its results I fail to realize
that I was escaping the exception handling that implements the
retry mechanism for some parts of the processing.