[Bug 1891567] Re: [SRU] ceph_osd crash in _committed_osd_maps when failed to encode first inc map

Ponnuvel Palaniyappan 1891567 at bugs.launchpad.net
Thu Aug 27 21:21:10 UTC 2020


I have tested this ussuri-proposed packages and it fixes the issue.

Setup a Nautilus cluster with the following versions:

# ceph versions
{                                                                                          
    "mon": {                             
        "ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable)": 1
    },                                                                                                                                                                                 
    "mgr": {
        "ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable)": 1                                                                                          
    },                                                                                                                                                                                 
    "osd": {
        "ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable)": 3                                                                                          
    },                                                                                                                                                                                 
    "mds": {},
    "overall": {                                                                                                                                                                       
        "ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable)": 5
    }                                                                                                                                                                                  
}             

# dpkg -l | grep -i ceph
ii  ceph                             14.2.9-0ubuntu0.19.10.1~cloud0              amd64        distributed storage and file system
ii  ceph-base                        14.2.9-0ubuntu0.19.10.1~cloud0              amd64        common ceph daemon libraries and management tools
ii  ceph-common                      14.2.9-0ubuntu0.19.10.1~cloud0              amd64        common utilities to mount and interact with a ceph storage cluster
ii  ceph-mgr                         14.2.9-0ubuntu0.19.10.1~cloud0              amd64        manager for the ceph distributed file system
ii  ceph-mon                         14.2.9-0ubuntu0.19.10.1~cloud0              amd64        monitor server for the ceph storage system
ii  ceph-osd                         14.2.9-0ubuntu0.19.10.1~cloud0              amd64        OSD server for the ceph storage system
ii  libcephfs2                       14.2.9-0ubuntu0.19.10.1~cloud0              amd64        Ceph distributed file system client library
ii  python3-ceph-argparse            14.2.9-0ubuntu0.19.10.1~cloud0              amd64        Python 3 utility libraries for Ceph CLI
ii  python3-cephfs                   14.2.9-0ubuntu0.19.10.1~cloud0              amd64        Python 3 libraries for the Ceph libcephfs library
ii  python3-rados                    14.2.9-0ubuntu0.19.10.1~cloud0              amd64        Python 3 libraries for the Ceph librados library

Then upgraded the cluster to:

# ceph versions
{
    "mon": {
        "ceph version 15.2.3 (d289bbdec69ed7c1f516e0a093594580a76b78d0) octopus (stable)": 1
    },
    "mgr": {
        "ceph version 15.2.3 (d289bbdec69ed7c1f516e0a093594580a76b78d0) octopus (stable)": 1
    },
    "osd": {
        "ceph version 15.2.3 (d289bbdec69ed7c1f516e0a093594580a76b78d0) octopus (stable)": 3
    },
    "mds": {},
    "overall": {
        "ceph version 15.2.3 (d289bbdec69ed7c1f516e0a093594580a76b78d0) octopus (stable)": 5
    }
}

# dpkg -l | grep -i ceph
ii  ceph                             15.2.3-0ubuntu0.20.04.2~cloud0              amd64        distributed storage and file system
ii  ceph-base                        15.2.3-0ubuntu0.20.04.2~cloud0              amd64        common ceph daemon libraries and management tools
ii  ceph-common                      15.2.3-0ubuntu0.20.04.2~cloud0              amd64        common utilities to mount and interact with a ceph storage cluster
ii  ceph-mds                         15.2.3-0ubuntu0.20.04.2~cloud0              amd64        metadata server for the ceph distributed file system
ii  ceph-mgr                         15.2.3-0ubuntu0.20.04.2~cloud0              amd64        manager for the ceph distributed file system
ii  ceph-mgr-modules-core            15.2.3-0ubuntu0.20.04.2~cloud0              all          ceph manager modules which are always enabled
ii  ceph-mon                         15.2.3-0ubuntu0.20.04.2~cloud0              amd64        monitor server for the ceph storage system
ii  ceph-osd                         15.2.3-0ubuntu0.20.04.2~cloud0              amd64        OSD server for the ceph storage system
ii  libcephfs2                       15.2.3-0ubuntu0.20.04.2~cloud0              amd64        Ceph distributed file system client library
ii  python3-ceph-argparse            15.2.3-0ubuntu0.20.04.2~cloud0              amd64        Python 3 utility libraries for Ceph CLI
ii  python3-ceph-common              15.2.3-0ubuntu0.20.04.2~cloud0              all          Python 3 utility libraries for Ceph
ii  python3-cephfs                   15.2.3-0ubuntu0.20.04.2~cloud0              amd64        Python 3 libraries for the Ceph libcephfs library
ii  python3-rados                    15.2.3-0ubuntu0.20.04.2~cloud0              amd64        Python 3 libraries for the Ceph librados library
ii  python3-rbd                      15.2.3-0ubuntu0.20.04.2~cloud0              amd64        Python 3 libraries for the Ceph librbd library


Then tested cluster as noted in the description (set `osd_inject_bad_map_crc_probability` to 1 on one OSD and then restart a different OSD). And observed no OSD crash happened and the cluster is OK.

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1891567

Title:
  [SRU] ceph_osd crash in _committed_osd_maps when failed to encode
  first inc map

Status in Ubuntu Cloud Archive:
  Invalid
Status in Ubuntu Cloud Archive ussuri series:
  Fix Committed
Status in Ubuntu Cloud Archive victoria series:
  Invalid
Status in ceph package in Ubuntu:
  Fix Released
Status in ceph source package in Focal:
  Fix Committed
Status in ceph source package in Groovy:
  Fix Released

Bug description:
  [Impact]
  Upstream tracker: issue#46443 [0].

  The ceph-osd service can crash when processing osd map updates.

  When the osd encounters a CRC error while processing an incremental
  map update, it will request a full map update from its peers. In this
  code path, an uninitialized variable was recently introduced and that
  will get de-referenced causing a crash.

  The uninitialized variable was introduced in nautilus 14.2.10, and
  octopus 15.2.1.

  [Test Case]
  # Inject osd_inject_bad_map_crc_probability = 1
  sudo ceph daemon osd.{id} config set osd_inject_bad_map_crc_probability 1

  # Trigger some osd map updates by restarting a different osd
  sudo systemctl restart osd@{diff-id}

  [Regression Potential]
  The code has been updated to leave handle_osd_maps() early if a CRC error is encountered, therefore preventing the map commit if the failure is encountered while processing an incremental map update. This will make the full map update take longer but should prevent the crash that resulted in this bug. Additionally, _committed_osd_maps() is now coded to assert if first <= last, but it is assumed that code should never be reached.

  [Other Info]
  Upstream has released a fix for this issue in Nautilus 14.2.11. The SRU for this point release is being tracked by LP: #1891077

  Upstream has merged a fix for this issue in Octopus [1], but there is
  no current release target. The ceph packages in focal, groovy, and the
  ussuri cloud archive are exposed to this critical regression.

  [0] https://tracker.ceph.com/issues/46443
  [1] https://github.com/ceph/ceph/pull/36340

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1891567/+subscriptions



More information about the Ubuntu-openstack-bugs mailing list