[Bug 1819437] Re: transient mon<->osd connectivity HEALTH_WARN events don't self clear in 13.2.4

Dan Hill 1819437 at bugs.launchpad.net
Thu Feb 13 20:03:56 UTC 2020


This issue has been resolved upstream:
pr#30519 in 12.2.13
pr#30481 in 13.2.7
pr#30480 in 14.2.5

The mimic fix has been released, but be advised that upgrading from
13.2.6 -> 13.2.7 may cause OSD crashes [0]. We will be updating our
packaging to 13.2.8 to address this issue.

The 12.2.13 and 14.2.7 point releases landed upstream last week. We are
working on stable release updates (SRUs) for these packages. You can
follow and contribute to the SRU progress at [1], and [2] respectively.

[0] https://tracker.ceph.com/issues/43106
[1] https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1861793
[2] https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1861789


** Bug watch added: tracker.ceph.com/issues #43106
   http://tracker.ceph.com/issues/43106

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to ceph in Ubuntu.
https://bugs.launchpad.net/bugs/1819437

Title:
  transient mon<->osd connectivity HEALTH_WARN events don't self clear
  in 13.2.4

Status in ceph package in Ubuntu:
  New

Bug description:
  In a recently juju deployed 13.2.4 ceph cluster (as part of an
  OpenStack Rocky deploy) we experienced a none clearing HEALTH_WARN
  event that appeared to be associated with a short planned network
  outage, but did not clear without human intervention:

      health: HEALTH_WARN
              6 slow ops, oldest one blocked for 112899 sec, daemons [mon.shinx,mon.sliggoo] have slow ops.

  We can correlate this back to a known network event, but all OSDs are
  up and the cluster otherwise looks healthy:

  ubuntu at juju-df624b-4-lxd-14:~$ sudo ceph osd tree
  ID  CLASS WEIGHT  TYPE NAME        STATUS REWEIGHT PRI-AFF 
   -1       7.64076 root default                             
  -13       0.90970     host happiny                         
    8   hdd 0.90970         osd.8        up  1.00000 1.00000 
   -5       0.90970     host jynx                            
    9   hdd 0.90970         osd.9        up  1.00000 1.00000 
   -3       1.63739     host piplup                          
    0   hdd 0.81870         osd.0        up  1.00000 1.00000 
    3   hdd 0.81870         osd.3        up  1.00000 1.00000 
   -9       1.63739     host raichu                          
    5   hdd 0.81870         osd.5        up  1.00000 1.00000 
    6   hdd 0.81870         osd.6        up  1.00000 1.00000 
  -11       0.90919     host shinx                           
    7   hdd 0.90919         osd.7        up  1.00000 1.00000 
   -7       1.63739     host sliggoo                         
    1   hdd 0.81870         osd.1        up  1.00000 1.00000 
    4   hdd 0.81870         osd.4        up  1.00000 1.00000 

  
  ubuntu at shinx:~$ sudo ceph daemon mon.shinx ops
  {
      "ops": [
          {
              "description": "osd_failure(failed timeout osd.0 10.48.2.158:6804/211414 for 31sec e911 v911)",
              "initiated_at": "2019-03-07 00:40:43.282823",
              "age": 113953.696205,
              "duration": 113953.696225,
              "type_data": {
                  "events": [
                      {
                          "time": "2019-03-07 00:40:43.282823",
                          "event": "initiated"
                      },
                      {
                          "time": "2019-03-07 00:40:43.282823",
                          "event": "header_read"
                      },
                      {
                          "time": "0.000000",
                          "event": "throttled"
                      },
                      {
                          "time": "0.000000",
                          "event": "all_read"
                      },
                      {
                          "time": "0.000000",
                          "event": "dispatched"
                      },
                      {
                          "time": "2019-03-07 00:40:43.283360",
                          "event": "mon:_ms_dispatch"
                      },
                      {
                          "time": "2019-03-07 00:40:43.283360",
                          "event": "mon:dispatch_op"
                      },
                      {
                          "time": "2019-03-07 00:40:43.283360",
                          "event": "psvc:dispatch"
                      },
                      {
                          "time": "2019-03-07 00:40:43.283370",
                          "event": "osdmap:preprocess_query"
                      },
                      {
                          "time": "2019-03-07 00:40:43.283371",
                          "event": "osdmap:preprocess_failure"
                      },
                      {
                          "time": "2019-03-07 00:40:43.283386",
                          "event": "osdmap:prepare_update"
                      },
                      {
                          "time": "2019-03-07 00:40:43.283386",
                          "event": "osdmap:prepare_failure"
                      }
                  ],
                  "info": {
                      "seq": 48576937,
                      "src_is_mon": false,
                      "source": "osd.8 10.48.2.206:6800/1226277",
                      "forwarded_to_leader": false
                  }
              }
          },
          {
              "description": "osd_failure(failed timeout osd.3 10.48.2.158:6800/211410 for 31sec e911 v911)",
              "initiated_at": "2019-03-07 00:40:43.282997",
              "age": 113953.696032,
              "duration": 113953.696127,
              "type_data": {
                  "events": [
                      {
                          "time": "2019-03-07 00:40:43.282997",
                          "event": "initiated"
                      },
                      {
                          "time": "2019-03-07 00:40:43.282997",
                          "event": "header_read"
                      },
                      {
                          "time": "0.000000",
                          "event": "throttled"
                      },
                      {
                          "time": "0.000000",
                          "event": "all_read"
                      },
                      {
                          "time": "0.000000",
                          "event": "dispatched"
                      },
                      {
                          "time": "2019-03-07 00:40:43.284394",
                          "event": "mon:_ms_dispatch"
                      },
                      {
                          "time": "2019-03-07 00:40:43.284395",
                          "event": "mon:dispatch_op"
                      },
                      {
                          "time": "2019-03-07 00:40:43.284395",
                          "event": "psvc:dispatch"
                      },
                      {
                          "time": "2019-03-07 00:40:43.284402",
                          "event": "osdmap:preprocess_query"
                      },
                      {
                          "time": "2019-03-07 00:40:43.284403",
                          "event": "osdmap:preprocess_failure"
                      },
                      {
                          "time": "2019-03-07 00:40:43.284416",
                          "event": "osdmap:prepare_update"
                      },
                      {
                          "time": "2019-03-07 00:40:43.284417",
                          "event": "osdmap:prepare_failure"
                      }
                  ],
                  "info": {
                      "seq": 48576958,
                      "src_is_mon": false,
                      "source": "osd.8 10.48.2.206:6800/1226277",
                      "forwarded_to_leader": false
                  }
              }
          },
          {
              "description": "osd_failure(failed timeout osd.7 10.48.2.157:6800/650064 for 1sec e916 v916)",
              "initiated_at": "2019-03-07 00:41:08.839840",
              "age": 113928.139188,
              "duration": 113928.139359,
              "type_data": {
                  "events": [
                      {
                          "time": "2019-03-07 00:41:08.839840",
                          "event": "initiated"
                      },
                      {
                          "time": "2019-03-07 00:41:08.839840",
                          "event": "header_read"
                      },
                      {
                          "time": "0.000000",
                          "event": "throttled"
                      },
                      {
                          "time": "0.000000",
                          "event": "all_read"
                      },
                      {
                          "time": "0.000000",
                          "event": "dispatched"
                      },
                      {
                          "time": "2019-03-07 00:41:08.840040",
                          "event": "mon:_ms_dispatch"
                      },
                      {
                          "time": "2019-03-07 00:41:08.840040",
                          "event": "mon:dispatch_op"
                      },
                      {
                          "time": "2019-03-07 00:41:08.840040",
                          "event": "psvc:dispatch"
                      },
                      {
                          "time": "2019-03-07 00:41:08.840058",
                          "event": "osdmap:preprocess_query"
                      },
                      {
                          "time": "2019-03-07 00:41:08.840060",
                          "event": "osdmap:preprocess_failure"
                      },
                      {
                          "time": "2019-03-07 00:41:08.840080",
                          "event": "osdmap:prepare_update"
                      },
                      {
                          "time": "2019-03-07 00:41:08.840081",
                          "event": "osdmap:prepare_failure"
                      }
                  ],
                  "info": {
                      "seq": 48578207,
                      "src_is_mon": false,
                      "source": "osd.6 10.48.2.161:6800/499396",
                      "forwarded_to_leader": false
                  }
              }
          }
      ],
      "num_ops": 3
  }

  
  This looks remarkably like:

  https://tracker.ceph.com/issues/24531

  I restarted the 2 affected mons in turn, HEALTH OK and issue did not
  reoccur.

  Expected behaviour: ceph health should recover from temporary network
  event without user interaction.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1819437/+subscriptions



More information about the Ubuntu-openstack-bugs mailing list