[Bug 1844195] Re: beegfs-meta lockup with glibc 2.27 on bionic

Tue Oct 8 12:59:12 UTC 2019

Really curious if this turns out to be a glibc problem or a beegfs-meta
problem, but we were able to hack together something that circumvents
this problem by forcing beegfs-meta to use an older glibc6 (from
xenial).   beegfs-meta is now stable and performing much better!

There's probably a simpler way to do this, but we did something along
these lines:

1. debootstrap xenial /srv/xenial-chroot
2. chroot into xenial-chroot, add beegfs to apt sources.list, apt install beegfs-meta, exit chroot
3. prepare a systemd pre-exec script "/usr/local/bin/setup-beegfs-meta-chroot.sh":

#!/bin/bash

set -e

cp /etc/beegfs/beegfs-meta.conf /srv/xenial-chroot/etc/beegfs/beegfs-meta.conf
mountpoint -q /srv/xenial-chroot/proc || mount --bind /proc /srv/xenial-chroot/proc
mountpoint -q /srv/xenial-chroot/sys || mount --bind /sys /srv/xenial-chroot/sys
mountpoint -q /srv/xenial-chroot/path/to/metadata || mount --bind /path/to/metadata /srv/xenial-chroot/path/to/metadata

4.  copy /lib/systemd/system/beegfs-meta.service to /etc/systemd/system
/beegfs-meta.service, adding the following to the [service] section:

RootDirectory=/srv/xenial-chroot
ExecStartPre=/usr/local/bin/setup-beegfs-meta-chroot.sh
RootDirectoryStartOnly=yes

5.  daemon-reload systemd and restart beeegfs-meta

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to glibc in Ubuntu.
https://bugs.launchpad.net/bugs/1844195

Title:
  beegfs-meta lockup with glibc 2.27 on bionic

Status in glibc package in Ubuntu:
  Confirmed

Bug description:
  Bug report: Lock up of beegfs-meta with glibc 2.27

  Affected system:

  Release: Ubuntu 18.04.3 bionic
  Kernel: 4.15.0-62-generic
  libc6: 2.27-3ubuntu1
  beegfs: 7.1.3

  We have discovered an issue we believe to be a bug in the version of glibc in
  Ubuntu 18.04 that causes a beegfs-meta service to lock up and become
  unresponsive. (https://www.beegfs.io/)

  The issue has also been observed in three other installations, all running
  Ubuntu 18.04 and does not occur on Ubuntu 16.04 or RHEL/CentOS 6 or 7.

  The affected processes resume normal operation almost immediately after a
  debugger like strace or gdb is attached to the process and then continue to run
  normally for some time until they get stuck again. In the short period between
  attaching strace and the process resuming normal operation we see messages like

  38371 futex(0x5597341d9ca8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 282, NULL, 0xffffffff) = -1 EAGAIN (Resource temporarily unavailable)
  38371 futex(0x5597341d9ca8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 282, NULL, 0xffffffff) = -1 EAGAIN (Resource temporarily unavailable)
  38371 futex(0x5597341d9ca8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 282, NULL, 0xffffffff) = -1 EAGAIN (Resource temporarily unavailable)
  38371 futex(0x5597341d9ca8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 282, NULL, 0xffffffff) = -1 EAGAIN (Resource temporarily unavailable)

  and a CPU load of 100% on one core, and after the process gets unstuck

  38371 futex(0x5597341d9ca8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 282, NULL, 0xffffffff) = -1 EAGAIN (Resource temporarily unava
  ilable)
  38371 futex(0x5597341d9ca8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 282, NULL, 0xffffffff) = -1 EAGAIN (Resource temporarily unava
  ilable)
  38371 futex(0x5597341d9cb0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3, NULL, 0xffffffff <unfinished ...>
  38231 futex(0x5597341d9cb0, FUTEX_WAKE_PRIVATE, 2147483647) = 2
  38371 <... futex resumed> )             = 0
  38371 futex(0x5597341d9cb0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3, NULL, 0xffffffff <unfinished ...>

  We found this [1] patch to glibc that might be related to the issue and built
  our own version of the official glibc package with only the following diff
  applied to it. All other changes in the patch only touch tests and modify the
  Makefile to build those tests and the changelog, so we decided to skip these
  for the sake of being able to apply the patch cleanly to the Ubuntu glibc.

  index 5dd5342..85fc1bc 100644 (file)
  --- a/nptl/pthread_rwlock_common.c
  +++ b/nptl/pthread_rwlock_common.c
  @@ -314,7 +314,7 @@ __pthread_rwlock_rdlock_full (pthread_rwlock_t *rwlock,
                   harmless because the flag is just about the state of
                   __readers, and all threads set the flag under the same
                   conditions.  */
  -             while ((atomic_load_relaxed (&rwlock->__data.__readers)
  +             while (((r = atomic_load_relaxed (&rwlock->__data.__readers))
                        & PTHREAD_RWLOCK_RWAITING) != 0)
                  {
                    int private = __pthread_rwlock_get_private (rwlock);

  Unfortunately the lockups did not stop after we installed the patched package
  versions and restarted our services. The only thing we noticed was that during
  the lockups, we could not observe high CPU load any more.

  We were able to record backtraces of all of the threads in our stuck processes
  before and after applying the patch. The traces are attached to this report.

  Additionally, to discard other reasons, we explored the internal mutexes and
  condition variables to check for dead(live)locks produced at the application
  level (BeeGFS routines). We could not find any.

  If you need additional information or testing, we would be happy to provide you
  with what we can to help solve this issue.

  [1]
  https://sourceware.org/git/?p=glibc.git;a=commit;h=f21e8f8ca466320fed38bdb71526c574dae98026

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1844195/+subscriptions