[Bug 1819437] [NEW] transient mon<->osd connectivity HEALTH_WARN events don't self clear in 13.2.4

Gareth Woolridge gareth.woolridge at canonical.com
Mon Mar 11 10:26:55 UTC 2019


Public bug reported:

In a recently juju deployed 13.2.4 ceph cluster (as part of an OpenStack
Rocky deploy) we experienced a none clearing HEALTH_WARN event that
appeared to be associated with a short planned network outage, but did
not clear without human intervention:

    health: HEALTH_WARN
            6 slow ops, oldest one blocked for 112899 sec, daemons [mon.shinx,mon.sliggoo] have slow ops.

We can correlate this back to a known network event, but all OSDs are up
and the cluster otherwise looks healthy:

ubuntu at juju-df624b-4-lxd-14:~$ sudo ceph osd tree
ID  CLASS WEIGHT  TYPE NAME        STATUS REWEIGHT PRI-AFF 
 -1       7.64076 root default                             
-13       0.90970     host happiny                         
  8   hdd 0.90970         osd.8        up  1.00000 1.00000 
 -5       0.90970     host jynx                            
  9   hdd 0.90970         osd.9        up  1.00000 1.00000 
 -3       1.63739     host piplup                          
  0   hdd 0.81870         osd.0        up  1.00000 1.00000 
  3   hdd 0.81870         osd.3        up  1.00000 1.00000 
 -9       1.63739     host raichu                          
  5   hdd 0.81870         osd.5        up  1.00000 1.00000 
  6   hdd 0.81870         osd.6        up  1.00000 1.00000 
-11       0.90919     host shinx                           
  7   hdd 0.90919         osd.7        up  1.00000 1.00000 
 -7       1.63739     host sliggoo                         
  1   hdd 0.81870         osd.1        up  1.00000 1.00000 
  4   hdd 0.81870         osd.4        up  1.00000 1.00000 


ubuntu at shinx:~$ sudo ceph daemon mon.shinx ops
{
    "ops": [
        {
            "description": "osd_failure(failed timeout osd.0 10.48.2.158:6804/211414 for 31sec e911 v911)",
            "initiated_at": "2019-03-07 00:40:43.282823",
            "age": 113953.696205,
            "duration": 113953.696225,
            "type_data": {
                "events": [
                    {
                        "time": "2019-03-07 00:40:43.282823",
                        "event": "initiated"
                    },
                    {
                        "time": "2019-03-07 00:40:43.282823",
                        "event": "header_read"
                    },
                    {
                        "time": "0.000000",
                        "event": "throttled"
                    },
                    {
                        "time": "0.000000",
                        "event": "all_read"
                    },
                    {
                        "time": "0.000000",
                        "event": "dispatched"
                    },
                    {
                        "time": "2019-03-07 00:40:43.283360",
                        "event": "mon:_ms_dispatch"
                    },
                    {
                        "time": "2019-03-07 00:40:43.283360",
                        "event": "mon:dispatch_op"
                    },
                    {
                        "time": "2019-03-07 00:40:43.283360",
                        "event": "psvc:dispatch"
                    },
                    {
                        "time": "2019-03-07 00:40:43.283370",
                        "event": "osdmap:preprocess_query"
                    },
                    {
                        "time": "2019-03-07 00:40:43.283371",
                        "event": "osdmap:preprocess_failure"
                    },
                    {
                        "time": "2019-03-07 00:40:43.283386",
                        "event": "osdmap:prepare_update"
                    },
                    {
                        "time": "2019-03-07 00:40:43.283386",
                        "event": "osdmap:prepare_failure"
                    }
                ],
                "info": {
                    "seq": 48576937,
                    "src_is_mon": false,
                    "source": "osd.8 10.48.2.206:6800/1226277",
                    "forwarded_to_leader": false
                }
            }
        },
        {
            "description": "osd_failure(failed timeout osd.3 10.48.2.158:6800/211410 for 31sec e911 v911)",
            "initiated_at": "2019-03-07 00:40:43.282997",
            "age": 113953.696032,
            "duration": 113953.696127,
            "type_data": {
                "events": [
                    {
                        "time": "2019-03-07 00:40:43.282997",
                        "event": "initiated"
                    },
                    {
                        "time": "2019-03-07 00:40:43.282997",
                        "event": "header_read"
                    },
                    {
                        "time": "0.000000",
                        "event": "throttled"
                    },
                    {
                        "time": "0.000000",
                        "event": "all_read"
                    },
                    {
                        "time": "0.000000",
                        "event": "dispatched"
                    },
                    {
                        "time": "2019-03-07 00:40:43.284394",
                        "event": "mon:_ms_dispatch"
                    },
                    {
                        "time": "2019-03-07 00:40:43.284395",
                        "event": "mon:dispatch_op"
                    },
                    {
                        "time": "2019-03-07 00:40:43.284395",
                        "event": "psvc:dispatch"
                    },
                    {
                        "time": "2019-03-07 00:40:43.284402",
                        "event": "osdmap:preprocess_query"
                    },
                    {
                        "time": "2019-03-07 00:40:43.284403",
                        "event": "osdmap:preprocess_failure"
                    },
                    {
                        "time": "2019-03-07 00:40:43.284416",
                        "event": "osdmap:prepare_update"
                    },
                    {
                        "time": "2019-03-07 00:40:43.284417",
                        "event": "osdmap:prepare_failure"
                    }
                ],
                "info": {
                    "seq": 48576958,
                    "src_is_mon": false,
                    "source": "osd.8 10.48.2.206:6800/1226277",
                    "forwarded_to_leader": false
                }
            }
        },
        {
            "description": "osd_failure(failed timeout osd.7 10.48.2.157:6800/650064 for 1sec e916 v916)",
            "initiated_at": "2019-03-07 00:41:08.839840",
            "age": 113928.139188,
            "duration": 113928.139359,
            "type_data": {
                "events": [
                    {
                        "time": "2019-03-07 00:41:08.839840",
                        "event": "initiated"
                    },
                    {
                        "time": "2019-03-07 00:41:08.839840",
                        "event": "header_read"
                    },
                    {
                        "time": "0.000000",
                        "event": "throttled"
                    },
                    {
                        "time": "0.000000",
                        "event": "all_read"
                    },
                    {
                        "time": "0.000000",
                        "event": "dispatched"
                    },
                    {
                        "time": "2019-03-07 00:41:08.840040",
                        "event": "mon:_ms_dispatch"
                    },
                    {
                        "time": "2019-03-07 00:41:08.840040",
                        "event": "mon:dispatch_op"
                    },
                    {
                        "time": "2019-03-07 00:41:08.840040",
                        "event": "psvc:dispatch"
                    },
                    {
                        "time": "2019-03-07 00:41:08.840058",
                        "event": "osdmap:preprocess_query"
                    },
                    {
                        "time": "2019-03-07 00:41:08.840060",
                        "event": "osdmap:preprocess_failure"
                    },
                    {
                        "time": "2019-03-07 00:41:08.840080",
                        "event": "osdmap:prepare_update"
                    },
                    {
                        "time": "2019-03-07 00:41:08.840081",
                        "event": "osdmap:prepare_failure"
                    }
                ],
                "info": {
                    "seq": 48578207,
                    "src_is_mon": false,
                    "source": "osd.6 10.48.2.161:6800/499396",
                    "forwarded_to_leader": false
                }
            }
        }
    ],
    "num_ops": 3
}


This looks remarkably like:

https://tracker.ceph.com/issues/24531

I restarted the 2 affected mons in turn, HEALTH OK and issue did not
reoccur.

Expected behaviour: ceph health should recover from temporary network
event without user interaction.

** Affects: ceph (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to ceph in Ubuntu.
https://bugs.launchpad.net/bugs/1819437

Title:
  transient mon<->osd connectivity HEALTH_WARN events don't self clear
  in 13.2.4

Status in ceph package in Ubuntu:
  New

Bug description:
  In a recently juju deployed 13.2.4 ceph cluster (as part of an
  OpenStack Rocky deploy) we experienced a none clearing HEALTH_WARN
  event that appeared to be associated with a short planned network
  outage, but did not clear without human intervention:

      health: HEALTH_WARN
              6 slow ops, oldest one blocked for 112899 sec, daemons [mon.shinx,mon.sliggoo] have slow ops.

  We can correlate this back to a known network event, but all OSDs are
  up and the cluster otherwise looks healthy:

  ubuntu at juju-df624b-4-lxd-14:~$ sudo ceph osd tree
  ID  CLASS WEIGHT  TYPE NAME        STATUS REWEIGHT PRI-AFF 
   -1       7.64076 root default                             
  -13       0.90970     host happiny                         
    8   hdd 0.90970         osd.8        up  1.00000 1.00000 
   -5       0.90970     host jynx                            
    9   hdd 0.90970         osd.9        up  1.00000 1.00000 
   -3       1.63739     host piplup                          
    0   hdd 0.81870         osd.0        up  1.00000 1.00000 
    3   hdd 0.81870         osd.3        up  1.00000 1.00000 
   -9       1.63739     host raichu                          
    5   hdd 0.81870         osd.5        up  1.00000 1.00000 
    6   hdd 0.81870         osd.6        up  1.00000 1.00000 
  -11       0.90919     host shinx                           
    7   hdd 0.90919         osd.7        up  1.00000 1.00000 
   -7       1.63739     host sliggoo                         
    1   hdd 0.81870         osd.1        up  1.00000 1.00000 
    4   hdd 0.81870         osd.4        up  1.00000 1.00000 

  
  ubuntu at shinx:~$ sudo ceph daemon mon.shinx ops
  {
      "ops": [
          {
              "description": "osd_failure(failed timeout osd.0 10.48.2.158:6804/211414 for 31sec e911 v911)",
              "initiated_at": "2019-03-07 00:40:43.282823",
              "age": 113953.696205,
              "duration": 113953.696225,
              "type_data": {
                  "events": [
                      {
                          "time": "2019-03-07 00:40:43.282823",
                          "event": "initiated"
                      },
                      {
                          "time": "2019-03-07 00:40:43.282823",
                          "event": "header_read"
                      },
                      {
                          "time": "0.000000",
                          "event": "throttled"
                      },
                      {
                          "time": "0.000000",
                          "event": "all_read"
                      },
                      {
                          "time": "0.000000",
                          "event": "dispatched"
                      },
                      {
                          "time": "2019-03-07 00:40:43.283360",
                          "event": "mon:_ms_dispatch"
                      },
                      {
                          "time": "2019-03-07 00:40:43.283360",
                          "event": "mon:dispatch_op"
                      },
                      {
                          "time": "2019-03-07 00:40:43.283360",
                          "event": "psvc:dispatch"
                      },
                      {
                          "time": "2019-03-07 00:40:43.283370",
                          "event": "osdmap:preprocess_query"
                      },
                      {
                          "time": "2019-03-07 00:40:43.283371",
                          "event": "osdmap:preprocess_failure"
                      },
                      {
                          "time": "2019-03-07 00:40:43.283386",
                          "event": "osdmap:prepare_update"
                      },
                      {
                          "time": "2019-03-07 00:40:43.283386",
                          "event": "osdmap:prepare_failure"
                      }
                  ],
                  "info": {
                      "seq": 48576937,
                      "src_is_mon": false,
                      "source": "osd.8 10.48.2.206:6800/1226277",
                      "forwarded_to_leader": false
                  }
              }
          },
          {
              "description": "osd_failure(failed timeout osd.3 10.48.2.158:6800/211410 for 31sec e911 v911)",
              "initiated_at": "2019-03-07 00:40:43.282997",
              "age": 113953.696032,
              "duration": 113953.696127,
              "type_data": {
                  "events": [
                      {
                          "time": "2019-03-07 00:40:43.282997",
                          "event": "initiated"
                      },
                      {
                          "time": "2019-03-07 00:40:43.282997",
                          "event": "header_read"
                      },
                      {
                          "time": "0.000000",
                          "event": "throttled"
                      },
                      {
                          "time": "0.000000",
                          "event": "all_read"
                      },
                      {
                          "time": "0.000000",
                          "event": "dispatched"
                      },
                      {
                          "time": "2019-03-07 00:40:43.284394",
                          "event": "mon:_ms_dispatch"
                      },
                      {
                          "time": "2019-03-07 00:40:43.284395",
                          "event": "mon:dispatch_op"
                      },
                      {
                          "time": "2019-03-07 00:40:43.284395",
                          "event": "psvc:dispatch"
                      },
                      {
                          "time": "2019-03-07 00:40:43.284402",
                          "event": "osdmap:preprocess_query"
                      },
                      {
                          "time": "2019-03-07 00:40:43.284403",
                          "event": "osdmap:preprocess_failure"
                      },
                      {
                          "time": "2019-03-07 00:40:43.284416",
                          "event": "osdmap:prepare_update"
                      },
                      {
                          "time": "2019-03-07 00:40:43.284417",
                          "event": "osdmap:prepare_failure"
                      }
                  ],
                  "info": {
                      "seq": 48576958,
                      "src_is_mon": false,
                      "source": "osd.8 10.48.2.206:6800/1226277",
                      "forwarded_to_leader": false
                  }
              }
          },
          {
              "description": "osd_failure(failed timeout osd.7 10.48.2.157:6800/650064 for 1sec e916 v916)",
              "initiated_at": "2019-03-07 00:41:08.839840",
              "age": 113928.139188,
              "duration": 113928.139359,
              "type_data": {
                  "events": [
                      {
                          "time": "2019-03-07 00:41:08.839840",
                          "event": "initiated"
                      },
                      {
                          "time": "2019-03-07 00:41:08.839840",
                          "event": "header_read"
                      },
                      {
                          "time": "0.000000",
                          "event": "throttled"
                      },
                      {
                          "time": "0.000000",
                          "event": "all_read"
                      },
                      {
                          "time": "0.000000",
                          "event": "dispatched"
                      },
                      {
                          "time": "2019-03-07 00:41:08.840040",
                          "event": "mon:_ms_dispatch"
                      },
                      {
                          "time": "2019-03-07 00:41:08.840040",
                          "event": "mon:dispatch_op"
                      },
                      {
                          "time": "2019-03-07 00:41:08.840040",
                          "event": "psvc:dispatch"
                      },
                      {
                          "time": "2019-03-07 00:41:08.840058",
                          "event": "osdmap:preprocess_query"
                      },
                      {
                          "time": "2019-03-07 00:41:08.840060",
                          "event": "osdmap:preprocess_failure"
                      },
                      {
                          "time": "2019-03-07 00:41:08.840080",
                          "event": "osdmap:prepare_update"
                      },
                      {
                          "time": "2019-03-07 00:41:08.840081",
                          "event": "osdmap:prepare_failure"
                      }
                  ],
                  "info": {
                      "seq": 48578207,
                      "src_is_mon": false,
                      "source": "osd.6 10.48.2.161:6800/499396",
                      "forwarded_to_leader": false
                  }
              }
          }
      ],
      "num_ops": 3
  }

  
  This looks remarkably like:

  https://tracker.ceph.com/issues/24531

  I restarted the 2 affected mons in turn, HEALTH OK and issue did not
  reoccur.

  Expected behaviour: ceph health should recover from temporary network
  event without user interaction.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1819437/+subscriptions



More information about the Ubuntu-openstack-bugs mailing list