[Bug 1819437] Re: transient mon<->osd connectivity HEALTH_WARN events don't self clear in 13.2.4

Dan Hill 1819437 at bugs.launchpad.net
Thu Apr 16 00:31:21 UTC 2020


** Changed in: ceph (Ubuntu Eoan)
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to ceph in Ubuntu.
https://bugs.launchpad.net/bugs/1819437

Title:
  transient mon<->osd connectivity HEALTH_WARN events don't self clear
  in 13.2.4

Status in ceph package in Ubuntu:
  Fix Released
Status in ceph source package in Xenial:
  Invalid
Status in ceph source package in Bionic:
  In Progress
Status in ceph source package in Eoan:
  Fix Released
Status in ceph source package in Focal:
  Fix Released

Bug description:
  In a recently juju deployed 13.2.4 ceph cluster (as part of an
  OpenStack Rocky deploy) we experienced a none clearing HEALTH_WARN
  event that appeared to be associated with a short planned network
  outage, but did not clear without human intervention:

      health: HEALTH_WARN
              6 slow ops, oldest one blocked for 112899 sec, daemons [mon.shinx,mon.sliggoo] have slow ops.

  We can correlate this back to a known network event, but all OSDs are
  up and the cluster otherwise looks healthy:

  ubuntu at juju-df624b-4-lxd-14:~$ sudo ceph osd tree
  ID  CLASS WEIGHT  TYPE NAME        STATUS REWEIGHT PRI-AFF 
   -1       7.64076 root default                             
  -13       0.90970     host happiny                         
    8   hdd 0.90970         osd.8        up  1.00000 1.00000 
   -5       0.90970     host jynx                            
    9   hdd 0.90970         osd.9        up  1.00000 1.00000 
   -3       1.63739     host piplup                          
    0   hdd 0.81870         osd.0        up  1.00000 1.00000 
    3   hdd 0.81870         osd.3        up  1.00000 1.00000 
   -9       1.63739     host raichu                          
    5   hdd 0.81870         osd.5        up  1.00000 1.00000 
    6   hdd 0.81870         osd.6        up  1.00000 1.00000 
  -11       0.90919     host shinx                           
    7   hdd 0.90919         osd.7        up  1.00000 1.00000 
   -7       1.63739     host sliggoo                         
    1   hdd 0.81870         osd.1        up  1.00000 1.00000 
    4   hdd 0.81870         osd.4        up  1.00000 1.00000 

  
  ubuntu at shinx:~$ sudo ceph daemon mon.shinx ops
  {
      "ops": [
          {
              "description": "osd_failure(failed timeout osd.0 10.48.2.158:6804/211414 for 31sec e911 v911)",
              "initiated_at": "2019-03-07 00:40:43.282823",
              "age": 113953.696205,
              "duration": 113953.696225,
              "type_data": {
                  "events": [
                      {
                          "time": "2019-03-07 00:40:43.282823",
                          "event": "initiated"
                      },
                      {
                          "time": "2019-03-07 00:40:43.282823",
                          "event": "header_read"
                      },
                      {
                          "time": "0.000000",
                          "event": "throttled"
                      },
                      {
                          "time": "0.000000",
                          "event": "all_read"
                      },
                      {
                          "time": "0.000000",
                          "event": "dispatched"
                      },
                      {
                          "time": "2019-03-07 00:40:43.283360",
                          "event": "mon:_ms_dispatch"
                      },
                      {
                          "time": "2019-03-07 00:40:43.283360",
                          "event": "mon:dispatch_op"
                      },
                      {
                          "time": "2019-03-07 00:40:43.283360",
                          "event": "psvc:dispatch"
                      },
                      {
                          "time": "2019-03-07 00:40:43.283370",
                          "event": "osdmap:preprocess_query"
                      },
                      {
                          "time": "2019-03-07 00:40:43.283371",
                          "event": "osdmap:preprocess_failure"
                      },
                      {
                          "time": "2019-03-07 00:40:43.283386",
                          "event": "osdmap:prepare_update"
                      },
                      {
                          "time": "2019-03-07 00:40:43.283386",
                          "event": "osdmap:prepare_failure"
                      }
                  ],
                  "info": {
                      "seq": 48576937,
                      "src_is_mon": false,
                      "source": "osd.8 10.48.2.206:6800/1226277",
                      "forwarded_to_leader": false
                  }
              }
          },
          {
              "description": "osd_failure(failed timeout osd.3 10.48.2.158:6800/211410 for 31sec e911 v911)",
              "initiated_at": "2019-03-07 00:40:43.282997",
              "age": 113953.696032,
              "duration": 113953.696127,
              "type_data": {
                  "events": [
                      {
                          "time": "2019-03-07 00:40:43.282997",
                          "event": "initiated"
                      },
                      {
                          "time": "2019-03-07 00:40:43.282997",
                          "event": "header_read"
                      },
                      {
                          "time": "0.000000",
                          "event": "throttled"
                      },
                      {
                          "time": "0.000000",
                          "event": "all_read"
                      },
                      {
                          "time": "0.000000",
                          "event": "dispatched"
                      },
                      {
                          "time": "2019-03-07 00:40:43.284394",
                          "event": "mon:_ms_dispatch"
                      },
                      {
                          "time": "2019-03-07 00:40:43.284395",
                          "event": "mon:dispatch_op"
                      },
                      {
                          "time": "2019-03-07 00:40:43.284395",
                          "event": "psvc:dispatch"
                      },
                      {
                          "time": "2019-03-07 00:40:43.284402",
                          "event": "osdmap:preprocess_query"
                      },
                      {
                          "time": "2019-03-07 00:40:43.284403",
                          "event": "osdmap:preprocess_failure"
                      },
                      {
                          "time": "2019-03-07 00:40:43.284416",
                          "event": "osdmap:prepare_update"
                      },
                      {
                          "time": "2019-03-07 00:40:43.284417",
                          "event": "osdmap:prepare_failure"
                      }
                  ],
                  "info": {
                      "seq": 48576958,
                      "src_is_mon": false,
                      "source": "osd.8 10.48.2.206:6800/1226277",
                      "forwarded_to_leader": false
                  }
              }
          },
          {
              "description": "osd_failure(failed timeout osd.7 10.48.2.157:6800/650064 for 1sec e916 v916)",
              "initiated_at": "2019-03-07 00:41:08.839840",
              "age": 113928.139188,
              "duration": 113928.139359,
              "type_data": {
                  "events": [
                      {
                          "time": "2019-03-07 00:41:08.839840",
                          "event": "initiated"
                      },
                      {
                          "time": "2019-03-07 00:41:08.839840",
                          "event": "header_read"
                      },
                      {
                          "time": "0.000000",
                          "event": "throttled"
                      },
                      {
                          "time": "0.000000",
                          "event": "all_read"
                      },
                      {
                          "time": "0.000000",
                          "event": "dispatched"
                      },
                      {
                          "time": "2019-03-07 00:41:08.840040",
                          "event": "mon:_ms_dispatch"
                      },
                      {
                          "time": "2019-03-07 00:41:08.840040",
                          "event": "mon:dispatch_op"
                      },
                      {
                          "time": "2019-03-07 00:41:08.840040",
                          "event": "psvc:dispatch"
                      },
                      {
                          "time": "2019-03-07 00:41:08.840058",
                          "event": "osdmap:preprocess_query"
                      },
                      {
                          "time": "2019-03-07 00:41:08.840060",
                          "event": "osdmap:preprocess_failure"
                      },
                      {
                          "time": "2019-03-07 00:41:08.840080",
                          "event": "osdmap:prepare_update"
                      },
                      {
                          "time": "2019-03-07 00:41:08.840081",
                          "event": "osdmap:prepare_failure"
                      }
                  ],
                  "info": {
                      "seq": 48578207,
                      "src_is_mon": false,
                      "source": "osd.6 10.48.2.161:6800/499396",
                      "forwarded_to_leader": false
                  }
              }
          }
      ],
      "num_ops": 3
  }

  
  This looks remarkably like:

  https://tracker.ceph.com/issues/24531

  I restarted the 2 affected mons in turn, HEALTH OK and issue did not
  reoccur.

  Expected behaviour: ceph health should recover from temporary network
  event without user interaction.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1819437/+subscriptions



More information about the Ubuntu-openstack-bugs mailing list