[bug] sftp cleanup failures

Robey Pointer robey at lag.net
Mon Nov 13 04:27:48 GMT 2006


On 6 Nov 2006, at 12:02, John Arbash Meinel wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> So I've been digging into the sftp cleanup bugs a little bit  
> deeper, and
> this is what I think I've found.

Your description sounds right to me; i'm trimming the quote to make  
my response shorter.


> When it gets a connection, SocketListener then spawns *another* thread
> to process _run_server_entry. It says the following:
>
>   # because the loopback socket is inline, and transports are
>   # never explicitly closed, best to launch a new thread.
>   thread = threading.Thread(target=self._callback, args=(s,)).start()
>
> And now we have 1 thread which is spinning waiting for connections,  
> and
> another thread dedicated to a single connection.

The comment is a little terse.  A longer explanation:

bzrlib transport objects don't have a way to be "closed"; they're  
just collected and del'd by the GC automatically when they're  
discarded.  So the server thread could be running for a long time  
after the test is done.  That might be okay, but some tests connect  
multiple times.  So this way, the old server thread can linger around  
while the listener is free to accept another connection.


> So I tried playing with some things, and this is what I found:
>
> 1) You can make the SocketListener thread be non-daemon, because
> TestCase is good to call server.tearDown() which sets the _stop_event,
> which tells SocketListener to close down.

I just tried that and can confirm it.


> 2) Doing the above causes 'bzr selftest' to hang after cleaning up.  
> This
> is because there are 2 _run_server_entry threads that are hung.  
> Setting
> them to daemon mode causes the test suite to finish, but gives the  
> ugly
> tracebacks about failing while cleaning up. (Because they don't go to
> finally until the python interpreter is shutting down).
>
> 3) The problem in (2) is because the sftp subsystem is making a  
> direct x
> = self.sock.recv(n) call in a loop which blocks until 'n'  
> characters are
> read. And in the specific tests which are causing problems, we never
> actually connect to sftp, because we expect it to fail. The a couple
> specific tests that fail are:
> ...ndleFromURL.test_read_bundle_from_url(SFTPAbsoluteServer)
> ....TestReadBundleFromURL.test_read_fail(SFTPAbsoluteServer)
>
> I haven't been able to figure out why these tests are dying early. The
> 'test_read_fail' makes sense, since it probably isn't requesting
> anything, but the 'test_read_bundle_from_url()' should be making a  
> full
> request, and reading back all of the data.

The sftp module assumes it's talking to a paramiko Channel object,  
which has its own timeout behavior.  It looks like even if you close 
() the socket from another thread, the recv() will never return.

I can actually notice, from within the sftp module, if the given  
channel is actually a real socket, and use select() on it to give  
python a chance to notice that the file has gone away.  I think  
that's probably fair, so I'll check that in.


> 4) I tried to change the socket to have a timeout, so that we could  
> have
> the SFTP subprocess stop trying. The problem is you can't use
> 'makefile()' on a socket with a timeout according to the python
> documentation:
> http://www.python.org/doc/current/lib/socket-objects.html
>   "A consequence of this is that file objects returned by the  
> makefile()
>    method must only be used when the socket is in blocking mode; in
>    timeout or non-blocking mode file operations that cannot be  
> completed
>    immediately will fail"
>
> So at this point I would recommend:
>
> 1) Change sftp so SocketListener runs in non-daemon mode, but the
> _run_server_entry is run in daemon mode. It doesn't change much,  
> but it
> seems cleaner. (By my present count, we have approx 6 threads that are
> hanging)
>
> 2) Try to figure out why certain tests are creating a SFTP connection,
> but causing the SFTP server to hang waiting for a response.

I think this may be moot (see below).  It should be okay for a unit  
test to ask for an SFTP transport and then not actually use it.  For  
example, the test might fail for other reasons before it uses the  
transport.


> 3) I'd *really* like to change the socket to be non-blocking, but we
> would need to figure out what code paths expect file-like objects. And
> I've seen that bzrlib.transport.ssh uses makefile() to create the
> read-and-write handles for SSH connections. (And it is used by the  
> smart
> server).
>
> That seems to be the only place that calls 'makefile()' directly,  
> though
> I can say that other tests start to fail. Specifically:
> ...mentations.TransportTests.test_append(SFTPAbsoluteServer)
>
> Instead of cleaning up and getting the _stop_event() flag being  
> set, it
> just spins, waiting until I set up another timeout. And when that
> triggers, the whole process just hangs.
>
> 4) It might be possible to change paramiko to use select() instead of
> socket.read(n), and then it could implement a timeout. Though honestly
> we probably don't want to timeout anywhere but in the test suite.

I did a small patch that has the SocketListener track all the sockets  
it accepts, and then closes them when asked to shutdown.  Combined  
with the paramiko patch to use select() when given a socket, this  
lets the sftp server threads die when their sockets are closed,  
whether or not they were used -- and without setting a timeout.

Since my patch probably overlaps yours, I'll attach it here just as  
an example of what I did.

robey
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: sftp-patch.txt
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20061112/d1b8c20e/attachment.txt 
-------------- next part --------------



More information about the bazaar mailing list