poll/sendmsg problem with 3.5.0-37-generic #58~precise1-Ubuntu
Sage Weil
sage at inktank.com
Tue Aug 13 04:34:56 UTC 2013
Hi,
A ceph user hit a problem with the 3.5 precise kernel with symptoms
exactly like an old poll(2) bug[1]. Basically, one end of a socket is
blocked on sendmsg(2), and the other end is blocked on poll(2) waiting for
data. 15 minutes later the poll(2) timeout triggers, we reset the
connection, and ceph recovers and continues. (For this user, the visible
ceph symptoms were stuck peering, stuck recovery, or hung requests that
*eventually* cleared themselves up.)
In this case, it doesn't look like the 3.5.0-37 kernel has the old
problematic patch (which first appeared in 3.6-rc1 and was fixed before
3.6 was released), but we see the exact same behavior (blocked writer,
blocked reader/poller, but netstat showing bytes available on the socket),
and upgrading the kernel to the current 3.8 precise package resolved the
problem. The 3.5 ubuntu kernel does have a few sendmsg patches[2] that
(under the circumstances) appear suspicious.
The one other detail in this case is that it seemed to only crop up
connections involving one node in the system.
I'm not sure where to go from here, since the user is happy to now have a
working system, and I'm not sure if it is worth spending the time to
reproduce the issue. It might be simpler to just recommend users move off
the 3.5 kernel. In the meantime, though, I wanted to at least make
everyone aware of the (potential) problem.
sage
[1] http://marc.info/?l=ceph-devel&m=134540224811321&w=2
[2] https://launchpad.net/ubuntu/+source/linux-lts-quantal/3.5.0-37.58~precise1
More information about the kernel-team
mailing list