[Bug 1969000] Re: [SRU] bail from handle_command() if _generate_command_map() fails

Mon Mar 6 11:32:09 UTC 2023

https://github.com/ceph/ceph/pull/48845 and
https://github.com/ceph/ceph/pull/48846 are merged upstream. Once they
are available in the ceph Ubuntu pacific and quincy point releases this
Octopus SRU can be submitted.

(I will discuss this Thursday in ceph office hours if I should just SRU
this one-liner fix to Ubuntu P/Q if the point releases are some time
away)

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1969000

Title:
  [SRU] bail from handle_command() if _generate_command_map() fails

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in ceph package in Ubuntu:
  New
Status in ceph source package in Focal:
  New
Status in ceph source package in Impish:
  New
Status in ceph source package in Jammy:
  New
Status in ceph source package in Kinetic:
  New

Bug description:
  [Impact]
  If improper json data is passed to rados using a manual curl command, or invalid json data through a script like the python eg. shown, it can end up crashing the mon.

  [Test Plan]
  Setup a ceph octopus cluster. A manual run of curl with malformed request like this results in the crash.

  curl -k -H "Authorization: Basic $TOKEN"
  "https://juju-3b3d82-10-lxd-0:8003/request" -X POST -d
  '{"prefix":"auth add","entity":"client.testuser02","caps":"mon
  '\''allow r'\'' osd '\''allow rw pool=testpool01'\''"}'

  The request status shows it is still in the queue if you check with
  curl -k -X GET "$endpoint/request"

  [
      {
          "failed": [],
          "finished": [],
          "has_failed": false,
          "id": "140576245092648",
          "is_finished": false,
          "is_waiting": false,
          "running": [
              {
                  "command": "auth add entity=client.testuser02 caps=mon 'allow r' osd 'allow rw pool=testpool01'",
                  "outb": "",
                  "outs": ""
              }
          ],
          "state": "pending",
          "waiting": []
      }
  ]

  This reproduces without restful API too.

  Use this python script to reproduce the issue. Run it on the mon node,

  root at juju-8c5f4a-sts-stein-bionic-0:/root# cat testcrashnorest.py
  #!/usr/bin/env python3
  import json
  import rados
  c = rados.Rados(conffile='/etc/ceph/ceph.conf')
  c.connect()
  cmd = json.dumps({"prefix":"auth add","entity":"client.testuser02","caps":"mon '\''allow r'\'' osd '\''allow rw pool=testpool01'\''"})
  print(c.mon_command(cmd, b''))

  root at juju-8c5f4a-sts-stein-bionic-0:/root# ceph -s
  cluster:
  id: 6123c916-a12a-11ec-bc02-fa163e9f86e0
  health: HEALTH_WARN
  mon is allowing insecure global_id reclaim
  1 monitors have not enabled msgr2
  Reduced data availability: 69 pgs inactive
  1921 daemons have recently crashed

  services:
  mon: 1 daemons, quorum juju-8c5f4a-sts-stein-bionic-0 (age 92s)
  mgr: juju-8c5f4a-sts-stein-bionic-0(active, since 22m)
  osd: 3 osds: 3 up (since 22h), 3 in

  data:
  pools: 4 pools, 69 pgs
  objects: 0 objects, 0 B
  usage: 0 B used, 0 B / 0 B avail
  pgs: 100.000% pgs unknown
  69 unknown

  root at juju-8c5f4a-sts-stein-bionic-0:/root# ./testcrashnorest.py

  ^C

  (note the script hangs)

  mon logs show - https://pastebin.com/Cuu9jkmu , the crash is seen, and
  then it seems like systemd restarts ceph, so ceph -s hangs for a while
  then we see the restart messages like.

  --- end dump of recent events ---
  2022-03-16T05:35:30.111+0000 7ffaf0e3b540 0 set uid:gid to 64045:64045 (ceph:ceph)
  2022-03-16T05:35:30.111+0000 7ffaf0e3b540 0 ceph version 15.2.14 (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be) octopus (stable), process ceph-mon, pid 490328
  2022-03-16T05:35:30.111+0000 7ffaf0e3b540 0 pidfile_write: ignore empty --pid-file
  2022-03-16T05:35:30.139+0000 7ffaf0e3b540 0 load: jerasure load: lrc load: isa
  2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option compression = kNoCompression
  2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option level_compaction_dynamic_level_bytes = true
  2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option write_buffer_size = 33554432
  2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option compression = kNoCompression
  2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option level_compaction_dynamic_level_bytes = true
  2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option write_buffer_size = 33554432
  2022-03-16T05:35:30.143+0000 7ffaf0e3b540 1 rocksdb: do_open column families: [default]
  2022-03-16T05:35:30.143+0000 7ffaf0e3b540 4 rocksdb: RocksDB version: 6.1.2

  While the fix to catch the exception is already part of the Octopus 15.2.17 point release, (PR https://github.com/ceph/ceph/pull/45891),
  we need a cleanup fix that has now been also merged upstream - https://github.com/ceph/ceph/pull/45547

  The cleanup fix bails out of the function if the exception is
  thrown, therefore avoiding continuing in the function 
  void Monitor::handle_command(MonOpRequestRef op) in this
  error situation.

  [Where problems could occur]
  The only potential problem with this cleanup fix is if
  some additional code in the void Monitor::handle_command(MonOpRequestRef op) function is needed to run before exit()'ing out. I have looked for such potential conditions and not found any.

  [Other Info]
  Upstream tracker - https://tracker.ceph.com/issues/57859
  Fixed in ceph main through https://github.com/ceph/ceph/pull/48044

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1969000/+subscriptions