[Bug 1798081] Re: ceph got slow request because of primary osd drop messages
Brian Murray
brian at ubuntu.com
Wed Nov 28 15:36:24 UTC 2018
Hello dongdong, or anyone else affected,
Accepted ceph into trusty-proposed. The package will build now and be
available at
https://launchpad.net/ubuntu/+source/ceph/0.80.11-0ubuntu1.14.04.4 in a
few hours, and then in the -proposed repository.
Please help us by testing this new package. See
https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how
to enable and use -proposed. Your feedback will aid us getting this
update out to other Ubuntu users.
If this package fixes the bug for you, please add a comment to this bug,
mentioning the version of the package you tested and change the tag from
verification-needed-trusty to verification-done-trusty. If it does not
fix the bug for you, please add a comment stating that, and change the
tag to verification-failed-trusty. In either case, without details of
your testing we will not be able to proceed.
Further information regarding the verification process can be found at
https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in
advance for helping!
N.B. The updated package will be released to -updates after the bug(s)
fixed by this package have been verified and the package has been in
-proposed for a minimum of 7 days.
** Changed in: ceph (Ubuntu Trusty)
Status: Incomplete => Fix Committed
** Tags added: verification-needed verification-needed-trusty
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to ceph in Ubuntu.
https://bugs.launchpad.net/bugs/1798081
Title:
ceph got slow request because of primary osd drop messages
Status in ceph package in Ubuntu:
Fix Released
Status in ceph source package in Trusty:
Fix Committed
Bug description:
[Impact]
Ceph from ubuntu trusty, the version is 0.80.*.
The bug is that when a message seq number has exceeds the max value of unsigned 32 bit which is 4294967295, the unsigned 64 bit seq number will be truncated to unsigned 32 bit. But the seq number is supposed to increase one by one, and if messenger got a message which has an old seq number, it will just drop it.
when it dropped one sub_op/sub_op reply message, then this operation will not be able to succeed, it will become a slow request.
cluster will suffering from slow requests when it hits this bug.
when it drops other message, might lead to other bad situation ( like droping osd_ping message might cause the osd down).
[Test case]
Deploy a trusty ceph cluster
Two ways to reproduce, A will take at least 2 months to reproduce, B will be much easier:
A.
1. ceph daemon osd.<id> set config debug_ms 10
2. wait for the seq number to increasing to max value of unsigned 32 bit value which is 4294967295, we can get the current seq number by doing
command "grep "reader got message" /var/log/ceph/ceph-osd/ceph-osd.id.log"
3. you will finally see the messenger will drop message and see below log
"reader got old message 0 <= 4294967295 0x7fa6854a3e00 osd_sub_op_reply(client.125216.1:7850657018 3.dc e24410dc/rb.0.103e.238e1f29.000000000027/head//3 [] ondisk, result = 0) v2, discarding "
B.
1. ceph daemon osd.<id> set config debug_ms 10
2. gdb attach <pid of ceph-osd>
3. set a breakpoint at function Pipe::writer and set a command "set out_seq=4294967295" to this breakpoint
4. continue the process, when it hit the breakpoint multiple times (osd number - 1), we can exit the gdb.
5. grep other osds log and see if you can find message dropping log like :
"reader got old message 0 <= 4294967295 0x7fa6854a3e00 osd_sub_op_reply(client.125216.1:7850657018 3.dc e24410dc/rb.0.103e.238e1f29.000000000027/head//3 [] ondisk, result = 0) v2, discarding "
[Regression Potential]
After this patch is applied, When we upgrade to a higher version which does not contain this patch, if the seq number has already exceeded 4294967295. it might drop messages very quickly
[Original Bug Report]
When in_seq is 0xffffffff, Pipe::reader expects the incoming message's sequence to be 0x100000000. But m->get_seq() return 0. So the incoming message gets dropped.
https://tracker.ceph.com/issues/16122
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1798081/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list