[Bug 1819437] Re: transient mon<->osd connectivity HEALTH_WARN events don't self clear in 13.2.4
Dan Hill
1819437 at bugs.launchpad.net
Thu Feb 13 20:03:56 UTC 2020
This issue has been resolved upstream:
pr#30519 in 12.2.13
pr#30481 in 13.2.7
pr#30480 in 14.2.5
The mimic fix has been released, but be advised that upgrading from
13.2.6 -> 13.2.7 may cause OSD crashes [0]. We will be updating our
packaging to 13.2.8 to address this issue.
The 12.2.13 and 14.2.7 point releases landed upstream last week. We are
working on stable release updates (SRUs) for these packages. You can
follow and contribute to the SRU progress at [1], and [2] respectively.
[0] https://tracker.ceph.com/issues/43106
[1] https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1861793
[2] https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1861789
** Bug watch added: tracker.ceph.com/issues #43106
http://tracker.ceph.com/issues/43106
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to ceph in Ubuntu.
https://bugs.launchpad.net/bugs/1819437
Title:
transient mon<->osd connectivity HEALTH_WARN events don't self clear
in 13.2.4
Status in ceph package in Ubuntu:
New
Bug description:
In a recently juju deployed 13.2.4 ceph cluster (as part of an
OpenStack Rocky deploy) we experienced a none clearing HEALTH_WARN
event that appeared to be associated with a short planned network
outage, but did not clear without human intervention:
health: HEALTH_WARN
6 slow ops, oldest one blocked for 112899 sec, daemons [mon.shinx,mon.sliggoo] have slow ops.
We can correlate this back to a known network event, but all OSDs are
up and the cluster otherwise looks healthy:
ubuntu at juju-df624b-4-lxd-14:~$ sudo ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 7.64076 root default
-13 0.90970 host happiny
8 hdd 0.90970 osd.8 up 1.00000 1.00000
-5 0.90970 host jynx
9 hdd 0.90970 osd.9 up 1.00000 1.00000
-3 1.63739 host piplup
0 hdd 0.81870 osd.0 up 1.00000 1.00000
3 hdd 0.81870 osd.3 up 1.00000 1.00000
-9 1.63739 host raichu
5 hdd 0.81870 osd.5 up 1.00000 1.00000
6 hdd 0.81870 osd.6 up 1.00000 1.00000
-11 0.90919 host shinx
7 hdd 0.90919 osd.7 up 1.00000 1.00000
-7 1.63739 host sliggoo
1 hdd 0.81870 osd.1 up 1.00000 1.00000
4 hdd 0.81870 osd.4 up 1.00000 1.00000
ubuntu at shinx:~$ sudo ceph daemon mon.shinx ops
{
"ops": [
{
"description": "osd_failure(failed timeout osd.0 10.48.2.158:6804/211414 for 31sec e911 v911)",
"initiated_at": "2019-03-07 00:40:43.282823",
"age": 113953.696205,
"duration": 113953.696225,
"type_data": {
"events": [
{
"time": "2019-03-07 00:40:43.282823",
"event": "initiated"
},
{
"time": "2019-03-07 00:40:43.282823",
"event": "header_read"
},
{
"time": "0.000000",
"event": "throttled"
},
{
"time": "0.000000",
"event": "all_read"
},
{
"time": "0.000000",
"event": "dispatched"
},
{
"time": "2019-03-07 00:40:43.283360",
"event": "mon:_ms_dispatch"
},
{
"time": "2019-03-07 00:40:43.283360",
"event": "mon:dispatch_op"
},
{
"time": "2019-03-07 00:40:43.283360",
"event": "psvc:dispatch"
},
{
"time": "2019-03-07 00:40:43.283370",
"event": "osdmap:preprocess_query"
},
{
"time": "2019-03-07 00:40:43.283371",
"event": "osdmap:preprocess_failure"
},
{
"time": "2019-03-07 00:40:43.283386",
"event": "osdmap:prepare_update"
},
{
"time": "2019-03-07 00:40:43.283386",
"event": "osdmap:prepare_failure"
}
],
"info": {
"seq": 48576937,
"src_is_mon": false,
"source": "osd.8 10.48.2.206:6800/1226277",
"forwarded_to_leader": false
}
}
},
{
"description": "osd_failure(failed timeout osd.3 10.48.2.158:6800/211410 for 31sec e911 v911)",
"initiated_at": "2019-03-07 00:40:43.282997",
"age": 113953.696032,
"duration": 113953.696127,
"type_data": {
"events": [
{
"time": "2019-03-07 00:40:43.282997",
"event": "initiated"
},
{
"time": "2019-03-07 00:40:43.282997",
"event": "header_read"
},
{
"time": "0.000000",
"event": "throttled"
},
{
"time": "0.000000",
"event": "all_read"
},
{
"time": "0.000000",
"event": "dispatched"
},
{
"time": "2019-03-07 00:40:43.284394",
"event": "mon:_ms_dispatch"
},
{
"time": "2019-03-07 00:40:43.284395",
"event": "mon:dispatch_op"
},
{
"time": "2019-03-07 00:40:43.284395",
"event": "psvc:dispatch"
},
{
"time": "2019-03-07 00:40:43.284402",
"event": "osdmap:preprocess_query"
},
{
"time": "2019-03-07 00:40:43.284403",
"event": "osdmap:preprocess_failure"
},
{
"time": "2019-03-07 00:40:43.284416",
"event": "osdmap:prepare_update"
},
{
"time": "2019-03-07 00:40:43.284417",
"event": "osdmap:prepare_failure"
}
],
"info": {
"seq": 48576958,
"src_is_mon": false,
"source": "osd.8 10.48.2.206:6800/1226277",
"forwarded_to_leader": false
}
}
},
{
"description": "osd_failure(failed timeout osd.7 10.48.2.157:6800/650064 for 1sec e916 v916)",
"initiated_at": "2019-03-07 00:41:08.839840",
"age": 113928.139188,
"duration": 113928.139359,
"type_data": {
"events": [
{
"time": "2019-03-07 00:41:08.839840",
"event": "initiated"
},
{
"time": "2019-03-07 00:41:08.839840",
"event": "header_read"
},
{
"time": "0.000000",
"event": "throttled"
},
{
"time": "0.000000",
"event": "all_read"
},
{
"time": "0.000000",
"event": "dispatched"
},
{
"time": "2019-03-07 00:41:08.840040",
"event": "mon:_ms_dispatch"
},
{
"time": "2019-03-07 00:41:08.840040",
"event": "mon:dispatch_op"
},
{
"time": "2019-03-07 00:41:08.840040",
"event": "psvc:dispatch"
},
{
"time": "2019-03-07 00:41:08.840058",
"event": "osdmap:preprocess_query"
},
{
"time": "2019-03-07 00:41:08.840060",
"event": "osdmap:preprocess_failure"
},
{
"time": "2019-03-07 00:41:08.840080",
"event": "osdmap:prepare_update"
},
{
"time": "2019-03-07 00:41:08.840081",
"event": "osdmap:prepare_failure"
}
],
"info": {
"seq": 48578207,
"src_is_mon": false,
"source": "osd.6 10.48.2.161:6800/499396",
"forwarded_to_leader": false
}
}
}
],
"num_ops": 3
}
This looks remarkably like:
https://tracker.ceph.com/issues/24531
I restarted the 2 affected mons in turn, HEALTH OK and issue did not
reoccur.
Expected behaviour: ceph health should recover from temporary network
event without user interaction.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1819437/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list