[bug] sftp cleanup failures

John Arbash Meinel john at arbash-meinel.com
Mon Nov 6 20:02:15 GMT 2006


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

So I've been digging into the sftp cleanup bugs a little bit deeper, and
this is what I think I've found.

First, a design overview of how we are using the sftp system in the test
suite.

When we have a test that expects to need to connect over sftp, we
generally create an instance of SFTPServerWithoutSSH. This class
provides an sftp server on a given loopback port. So we don't have to do
ssh handshaking, etc. Anyway, this server class becomes assigned to
TestCase.transport_server.

When the test thinks it wants to connect, it calls something like
'self.get_url()' which calls self.get_server(). And if the server hasn't
been started yet, it calls server = TestCase.transport_server(), and
server.setUp(), and sets up server.tearDown() to be called when the test
is finished.

So far, so good. We have a way to start up the server on demand, and a
way to make sure Server.tearDown() is always called, whether the test
passes or fails.

So what does SFTPServerWithoutSSH do once it is created, and .setUp() is
called?

The first thing it does is create a thread called SocketListener,
telling it that when it gets a connection it should call
SFTPServer._run_server_entry, and then starts the thread.

SocketListener creates a blocking port when it is instantiated along
with an event it uses to detect if it should stop running.

The current bzr.dev code creates the SocketListener in 'daemon' mode, so
that the interpreter can exit before the thread quits. It turns out,
this isn't necessary, which I'll get to later.

Once the SocketListener thread is started, it selects() over the socket,
every 0.1 seconds waiting for a connection. This seems to work quite
well, as I can see it waking up every 0.1 seconds, to check if there is
any connection waiting, or if the stop event has been triggered.

It is okay that the port is blocking, because it uses select() to check
if there is any data yet.

When it gets a connection, SocketListener then spawns *another* thread
to process _run_server_entry. It says the following:

  # because the loopback socket is inline, and transports are
  # never explicitly closed, best to launch a new thread.
  thread = threading.Thread(target=self._callback, args=(s,)).start()

And now we have 1 thread which is spinning waiting for connections, and
another thread dedicated to a single connection.


I think that gives the basic overview (maybe a little too detailed :)

So I tried playing with some things, and this is what I found:

1) You can make the SocketListener thread be non-daemon, because
TestCase is good to call server.tearDown() which sets the _stop_event,
which tells SocketListener to close down.

2) Doing the above causes 'bzr selftest' to hang after cleaning up. This
is because there are 2 _run_server_entry threads that are hung. Setting
them to daemon mode causes the test suite to finish, but gives the ugly
tracebacks about failing while cleaning up. (Because they don't go to
finally until the python interpreter is shutting down).

3) The problem in (2) is because the sftp subsystem is making a direct x
= self.sock.recv(n) call in a loop which blocks until 'n' characters are
read. And in the specific tests which are causing problems, we never
actually connect to sftp, because we expect it to fail. The a couple
specific tests that fail are:
...ndleFromURL.test_read_bundle_from_url(SFTPAbsoluteServer)
....TestReadBundleFromURL.test_read_fail(SFTPAbsoluteServer)

I haven't been able to figure out why these tests are dying early. The
'test_read_fail' makes sense, since it probably isn't requesting
anything, but the 'test_read_bundle_from_url()' should be making a full
request, and reading back all of the data.

4) I tried to change the socket to have a timeout, so that we could have
the SFTP subprocess stop trying. The problem is you can't use
'makefile()' on a socket with a timeout according to the python
documentation:
http://www.python.org/doc/current/lib/socket-objects.html
  "A consequence of this is that file objects returned by the makefile()
   method must only be used when the socket is in blocking mode; in
   timeout or non-blocking mode file operations that cannot be completed
   immediately will fail"

So at this point I would recommend:

1) Change sftp so SocketListener runs in non-daemon mode, but the
_run_server_entry is run in daemon mode. It doesn't change much, but it
seems cleaner. (By my present count, we have approx 6 threads that are
hanging)

2) Try to figure out why certain tests are creating a SFTP connection,
but causing the SFTP server to hang waiting for a response.

3) I'd *really* like to change the socket to be non-blocking, but we
would need to figure out what code paths expect file-like objects. And
I've seen that bzrlib.transport.ssh uses makefile() to create the
read-and-write handles for SSH connections. (And it is used by the smart
server).

That seems to be the only place that calls 'makefile()' directly, though
I can say that other tests start to fail. Specifically:
...mentations.TransportTests.test_append(SFTPAbsoluteServer)

Instead of cleaning up and getting the _stop_event() flag being set, it
just spins, waiting until I set up another timeout. And when that
triggers, the whole process just hangs.

4) It might be possible to change paramiko to use select() instead of
socket.read(n), and then it could implement a timeout. Though honestly
we probably don't want to timeout anywhere but in the test suite.


Does anyone have any suggestions?
John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFT5THJdeBCYSNAAMRAtK1AJ455ZETEKm1bZ9E2NoTva1pfSu91ACeOcGt
AsCk3Y5Z3NBLT6OWTnT+BPk=
=xm9T
-----END PGP SIGNATURE-----




More information about the bazaar mailing list