[Bug 1969000] Re: [SRU] bail from handle_command() if _generate_command_map() fails

Thu Apr 6 09:58:11 UTC 2023

This has not made it into the ubuntu quincy pt. release as yet,

# pull-lp-source ceph jammy
Found ceph 17.2.5-0ubuntu0.22.04.2 in jammy
Downloading ceph_17.2.5-0ubuntu0.22.04.2.dsc from ports.ubuntu.com (0.010 MiB)
[=====================================================>]100%
Good signature by James Page <james.page at ubuntu.com> (0xBFECAECBA0E7D8C3)
Downloading ceph_17.2.5.orig.tar.xz from ports.ubuntu.com (112.261 MiB)
[=====================================================>]100%
Downloading ceph_17.2.5-0ubuntu0.22.04.2.debian.tar.xz from ports.ubuntu.com (0.121 MiB)
[=====================================================>]100%

Checking if this code has made it
(https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1969000/+attachment/5655825/+files/focal_debdiff_octopus
)

I see, in src/mon/Monitor.cc for the quincy sources,

  // Catch bad_cmd_get exception if _generate_command_map() throws it
  try {
    _generate_command_map(cmdmap, param_str_map);
  }
  catch(bad_cmd_get& e) {
    reply_command(op, -EINVAL, e.what(), 0);
  }

---

Neither in pacific as yet,

#  pull-uca-source ceph xena
Found ceph 16.2.11-0ubuntu0.21.10.1~cloud0 in focal
Downloading ceph_16.2.11-0ubuntu0.21.10.1~cloud0.dsc from ubuntu-cloud.archive.canonical.com (0.009 MiB)
[=====================================================>]100%
Good signature by James Page <james.page at ubuntu.com> (0xBFECAECBA0E7D8C3)
Downloading ceph_16.2.11.orig.tar.xz from ubuntu-cloud.archive.canonical.com (100.423 MiB)
[=====================================================>]100%
Downloading ceph_16.2.11-0ubuntu0.21.10.1~cloud0.debian.tar.xz from ubuntu-cloud.archive.canonical.com (0.113 MiB)
[=====================================================>]100%

  // Catch bad_cmd_get exception if _generate_command_map() throws it
  try {
    _generate_command_map(cmdmap, param_str_map);
  }
  catch(bad_cmd_get& e) {
    reply_command(op, -EINVAL, e.what(), 0);
  }

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1969000

Title:
  [SRU] bail from handle_command() if _generate_command_map() fails

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in ceph package in Ubuntu:
  New
Status in ceph source package in Focal:
  New
Status in ceph source package in Impish:
  New
Status in ceph source package in Jammy:
  New
Status in ceph source package in Kinetic:
  New

Bug description:
  [Impact]
  If improper json data is passed to rados using a manual curl command, or invalid json data through a script like the python eg. shown, it can end up crashing the mon.

  [Test Plan]
  Setup a ceph octopus cluster. A manual run of curl with malformed request like this results in the crash.

  curl -k -H "Authorization: Basic $TOKEN"
  "https://juju-3b3d82-10-lxd-0:8003/request" -X POST -d
  '{"prefix":"auth add","entity":"client.testuser02","caps":"mon
  '\''allow r'\'' osd '\''allow rw pool=testpool01'\''"}'

  The request status shows it is still in the queue if you check with
  curl -k -X GET "$endpoint/request"

  [
      {
          "failed": [],
          "finished": [],
          "has_failed": false,
          "id": "140576245092648",
          "is_finished": false,
          "is_waiting": false,
          "running": [
              {
                  "command": "auth add entity=client.testuser02 caps=mon 'allow r' osd 'allow rw pool=testpool01'",
                  "outb": "",
                  "outs": ""
              }
          ],
          "state": "pending",
          "waiting": []
      }
  ]

  This reproduces without restful API too.

  Use this python script to reproduce the issue. Run it on the mon node,

  root at juju-8c5f4a-sts-stein-bionic-0:/root# cat testcrashnorest.py
  #!/usr/bin/env python3
  import json
  import rados
  c = rados.Rados(conffile='/etc/ceph/ceph.conf')
  c.connect()
  cmd = json.dumps({"prefix":"auth add","entity":"client.testuser02","caps":"mon '\''allow r'\'' osd '\''allow rw pool=testpool01'\''"})
  print(c.mon_command(cmd, b''))

  root at juju-8c5f4a-sts-stein-bionic-0:/root# ceph -s
  cluster:
  id: 6123c916-a12a-11ec-bc02-fa163e9f86e0
  health: HEALTH_WARN
  mon is allowing insecure global_id reclaim
  1 monitors have not enabled msgr2
  Reduced data availability: 69 pgs inactive
  1921 daemons have recently crashed

  services:
  mon: 1 daemons, quorum juju-8c5f4a-sts-stein-bionic-0 (age 92s)
  mgr: juju-8c5f4a-sts-stein-bionic-0(active, since 22m)
  osd: 3 osds: 3 up (since 22h), 3 in

  data:
  pools: 4 pools, 69 pgs
  objects: 0 objects, 0 B
  usage: 0 B used, 0 B / 0 B avail
  pgs: 100.000% pgs unknown
  69 unknown

  root at juju-8c5f4a-sts-stein-bionic-0:/root# ./testcrashnorest.py

  ^C

  (note the script hangs)

  mon logs show - https://pastebin.com/Cuu9jkmu , the crash is seen, and
  then it seems like systemd restarts ceph, so ceph -s hangs for a while
  then we see the restart messages like.

  --- end dump of recent events ---
  2022-03-16T05:35:30.111+0000 7ffaf0e3b540 0 set uid:gid to 64045:64045 (ceph:ceph)
  2022-03-16T05:35:30.111+0000 7ffaf0e3b540 0 ceph version 15.2.14 (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be) octopus (stable), process ceph-mon, pid 490328
  2022-03-16T05:35:30.111+0000 7ffaf0e3b540 0 pidfile_write: ignore empty --pid-file
  2022-03-16T05:35:30.139+0000 7ffaf0e3b540 0 load: jerasure load: lrc load: isa
  2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option compression = kNoCompression
  2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option level_compaction_dynamic_level_bytes = true
  2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option write_buffer_size = 33554432
  2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option compression = kNoCompression
  2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option level_compaction_dynamic_level_bytes = true
  2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option write_buffer_size = 33554432
  2022-03-16T05:35:30.143+0000 7ffaf0e3b540 1 rocksdb: do_open column families: [default]
  2022-03-16T05:35:30.143+0000 7ffaf0e3b540 4 rocksdb: RocksDB version: 6.1.2

  While the fix to catch the exception is already part of the Octopus 15.2.17 point release, (PR https://github.com/ceph/ceph/pull/45891),
  we need a cleanup fix that has now been also merged upstream - https://github.com/ceph/ceph/pull/45547

  The cleanup fix bails out of the function if the exception is
  thrown, therefore avoiding continuing in the function 
  void Monitor::handle_command(MonOpRequestRef op) in this
  error situation.

  [Where problems could occur]
  The only potential problem with this cleanup fix is if
  some additional code in the void Monitor::handle_command(MonOpRequestRef op) function is needed to run before exit()'ing out. I have looked for such potential conditions and not found any.

  [Other Info]
  Upstream tracker - https://tracker.ceph.com/issues/57859
  Fixed in ceph main through https://github.com/ceph/ceph/pull/48044

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1969000/+subscriptions