[Bug 1978913] Re: [SRU] ceph-osd takes all memory at boot

Wed Apr 12 09:58:41 UTC 2023

I have retested on Octopus, and Pacific, and I see the expected results.
The missing factor in the earlier testing was simply enough IO for a
higher chance to trigger IO to that particular PG in which we were
injecting the dups. Once that is done, I see the expected results on
Octopus too,

https://pastebin.canonical.com/p/Ksd6ZqxpDK/

I will remove the verification-failed-focal after confirming with
Dongdong and Dan.

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1978913

Title:
  [SRU] ceph-osd takes all memory at boot

Status in Ubuntu Cloud Archive:
  Invalid
Status in Ubuntu Cloud Archive queens series:
  New
Status in Ubuntu Cloud Archive ussuri series:
  Fix Committed
Status in Ubuntu Cloud Archive wallaby series:
  Invalid
Status in Ubuntu Cloud Archive xena series:
  Invalid
Status in Ubuntu Cloud Archive yoga series:
  Invalid
Status in ceph package in Ubuntu:
  Fix Released
Status in ceph source package in Bionic:
  New
Status in ceph source package in Focal:
  Confirmed
Status in ceph source package in Jammy:
  Invalid
Status in ceph source package in Kinetic:
  Invalid

Bug description:
  [Impact]
  The OSD will fail to trim the pg log dup entries, which could result in millions of dup entries for a PG while it was supposed to be at most 3000 (controlled by option osd_pg_log_dups_tracked).

  This could cause OSD to run out of memory and crash, and it might not
  be able to start up again due to the need of loading millions of dup
  entries. This could happen to multiple OSDs at the same time (as also
  reported by many community users), so we may get a completely unusable
  cluster if we hit this issue.

  The current known trigger for this problem is the pg split, as the whole dup entries will be copied during the pg split. The reason we don’t observe this so often before is that the pg autoscale wasn’t turned on by default, it’s on by default since from octopus.

  Note that there is also no way to check the number of dups in a PG
  online.

  [Test Plan]
  To see the problem, follow this approach for a test cluster, with for eg. 3 OSDs,

  #ps -eaf | grep osd
  root      334891       1  0 Sep21 ?        00:42:03 /home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 0 -c /home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf
  root      335541       1  0 Sep21 ?        00:40:20 /home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 2 -c /home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf

  kill all OSDs, so they're down,

  root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build# ceph -s
  2022-09-22T08:26:15.120+0000 7fa9694fe700 -1 WARNING: all dangerous and experimental features are enabled.
  2022-09-22T08:26:15.140+0000 7fa963fff700 -1 WARNING: all dangerous and experimental features are enabled.
    cluster:
      id:     9e7c0a82-8072-4c48-b697-1e6399b4fc9e
      health: HEALTH_WARN
              2 osds down
              1 host (3 osds) down
              1 root (3 osds) down
              Reduced data availability: 169 pgs stale
              Degraded data redundancy: 255/765 objects degraded (33.333%), 64 pgs degraded, 169 pgs undersized

    services:
      mon: 3 daemons, quorum a,b,c (age 3s)
      mgr: x(active, since 28h)
      mds: a:1 {0=a=up:active}
      osd: 3 osds: 0 up (since 83m), 2 in (since 91m)
      rgw: 1 daemon active (8000)

    task status:

    data:
      pools:   7 pools, 169 pgs
      objects: 255 objects, 9.5 KiB
      usage:   4.1 GiB used, 198 GiB / 202 GiB avail
      pgs:     255/765 objects degraded (33.333%)
               105 stale+active+undersized
               64  stale+active+undersized+degraded

  Then inject dups using this json for all OSDs,

  root at nikhil-Lenovo-Legion-Y540-15IRH-PG0:/home/nikhil/HDD_MOUNT/Downloads/ceph_build_oct/ceph/build# cat bin/dups.json
  [
   {"reqid": "client.4177.0:0",
   "version": "3'0",
   "user_version": "0",
   "generate": "500000",
   "return_code": "0"}
  ]

  Use the ceph-objectstore-tool with the --pg-log-inject-dups parameter,
  to inject dups for all OSDs.

  root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build#
  ./bin/ceph-objectstore-tool --data-path dev/osd0/ --op pg-log-inject-
  dups --file bin/dups.json --no-mon-config --pgid 2.1e

  root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build#
  ./bin/ceph-objectstore-tool --data-path dev/osd1/ --op pg-log-inject-
  dups --file bin/dups.json --no-mon-config --pgid 2.1e

  root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build#
  ./bin/ceph-objectstore-tool --data-path dev/osd2/ --op pg-log-inject-
  dups --file bin/dups.json --no-mon-config --pgid 2.1e

  Then set osd debug level to 20 (since here is the log that actually
  doing the trim:
  https://github.com/ceph/ceph/pull/47046/commits/aada08acde7a05ad769bb7a886ebcece628d522c#diff-b293fb673637ea53b5874bbb04f8f0638ca39cab009610e2cbc40a867bca4906L138,
  so need debug_osd = 20)

  set debug osd=20 in global in ceph.conf,

  root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build# cat ceph.conf | grep "debug osd"
          debug osd=20

  Then bring up the OSDs

  /home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 0 -c
  /home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf

  /home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 1 -c
  /home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf

  /home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 2 -c
  /home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf

  Run some IO on the OSDs. Wait at least a few hours.

  Then take the OSDs down (so the command below can be run), and run,

  root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build#
  ./bin/ceph-objectstore-tool --data-path dev/osd1/ --no-mon-config
  --pgid 2.1e --op log > op.log

  You will see at the end of that output in the file op.log, the number
  of dups is still as it was when they were injected, (no trimming has
  taken place)

              {
                  "reqid": "client.4177.0:0",
                  "version": "3'499999",
                  "user_version": "0",
                  "return_code": "0"
              },
              {
                  "reqid": "client.4177.0:0", <-- note the id (4177)
                  "version": "3'500000", <---
                  "user_version": "0",
                  "return_code": "0"
              }
          ]
      },
      "pg_missing_t": {
          "missing": [],
          "may_include_deletes": true
      }

  To verify the patch:
  With the patch in place, once the dups are injected, output of ./bin/ceph-objectstore-tool --data-path dev/osd1/ --no-mon-config --pgid 2.1f --op log will again show the dups (this command should be run with the OSDs down, like before).

  Then bring up the OSDs and start IO using rbd bench-write, leave the
  IO running a few hours, till these logs
  (https://github.com/ceph/ceph/pull/47046/commits/aada08acde7a05ad769bb7a886ebcece628d522c#diff-b293fb673637ea53b5874bbb04f8f0638ca39cab009610e2cbc40a867bca4906L138)
  are seen as below, in the osd logs, with the same client ID (4177 in
  my example) as the one that the client that injected the dups had
  used,

  root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build/out#
  cat osd.1.log | grep -i "trim dup "  | grep 4177 | more

  2022-09-26T10:30:53.125+0000 7fdb72741700  1 trim dup log_dup(reqid=client.4177.0:0 v=3'5 uv=0 rc=0)
  ...
  ...
  2022-09-26T10:30:53.125+0000 7fdb72741700  1 trim dup log_dup(reqid=client.4177.0:0 v=3'52 uv=0 rc=0)

  # grep -ri "trim dup " *.log | grep 4177 | wc -l
  390001 <-- total of all OSDs, should be ~ 3x what is seen in the below output (dups trimmed till 130001) if you have 3 OSDs for eg. Basically this number of trimmed dup logs are from all OSDs combined.

  And the output of ./bin/ceph-objectstore-tool --data-path dev/osd1/
  --no-mon-config --pgid 2.1f --op log (you would need to take the
  particular OSD down for verifying this) will show that the first bunch
  of (130k for eg. here) dups have been trimmed already, see the
  "version", which starts with the figure 3'130001 instead of 0 now,

   "dups": [
              {
                  "reqid": "client.4177.0:0",
                  "version": "3'130001", <----
                  "user_version": "0",
                  "return_code": "0"
              },
              {
                  "reqid": "client.4177.0:0",
                  "version": "3'130002",
                  "user_version": "0",
                  "return_code": "0"
              },

  This will verify that the dups are being trimmed by the patch, and it
  is working correctly. And of course, OSDs should not go OOM at boot
  time!

  [Where problems could occur]
  This is not a clean cherry-pick due to some differences in the octopus and master codebases, related to RocksDBStore and Objectstore. (see https://github.com/ceph/ceph/pull/47046#issuecomment-1243252126).

  Also, an earlier attempt to fix this issue upstream was reverted, as
  discussed at
  https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1978913/comments/1

  While this fix has been tested and validated after building it into
  the upstream 15.2.17 release (please see the [Test Plan] section), we
  would still need to proceed with extreme caution by allowing some time
  for problems (if any) to surface before going ahead with this SRU, and
  running our QA tests on the packages that build this fix into the
  15.2.17 release before releasing it to the customer who await this fix
  on octopus.

  [Other Info]
  The way this is fixed is that the PGLog needs to trim duplicates by the number of entries rather than the versions. That way, we prevent unbounded duplicate growth.

  Reported upstream at https://tracker.ceph.com/issues/53729 and fixed
  on master through https://github.com/ceph/ceph/pull/47046

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1978913/+subscriptions