[Bug 1432786] Re: upgrade of ceph in vivid breaks qemu-system-arm

Fri Oct 27 17:26:28 UTC 2017

Launchpad has imported 27 comments from the remote bug at
https://bugzilla.redhat.com/show_bug.cgi?id=1222286.

If you reply to an imported comment from within Launchpad, your comment
will be sent to the remote bug automatically. Read more about
Launchpad's inter-bugtracker facilities at
https://help.launchpad.net/InterBugTracking.

------------------------------------------------------------------------
On 2015-05-17T09:33:29+00:00 Richard wrote:

Description of problem:

My ARMv7 server is broken at the moment so I can't try this, but
it seems as if the simple command:

/usr/bin/qemu-system-arm -help

hangs (on ARMv7 hardware).

See this build log for an example:

https://kojipkgs.fedoraproject.org//work/tasks/3698/9763698/build.log

Version-Release number of selected component (if applicable):

qemu 2:2.3.0-3.fc23.armv7hl

How reproducible:

Unknown.

Steps to Reproduce:
1. Run `qemu-system-arm -help'

Actual results:

Hangs.

Expected results:

Should print help output.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1432786/comments/3

------------------------------------------------------------------------
On 2015-05-17T09:55:27+00:00 Richard wrote:

Another build and again it's hanging in the same place, but again
only on armv7hl:
http://koji.fedoraproject.org/koji/taskinfo?taskID=9767065
https://kojipkgs.fedoraproject.org//work/tasks/7067/9767067/build.log

This works fine on x86-64 & i686.

The configure.ac fragment is very simple so it's not likely to
be a problem with autoconf, although I guess anything is possible:

    AC_MSG_CHECKING([that $QEMU -help works])
    if $QEMU -help >&AS_MESSAGE_LOG_FD 2>&1; then
        AC_MSG_RESULT([yes])
    else
        AC_MSG_RESULT([no])
        AC_MSG_FAILURE(<error message omitted>)
    fi

Reply at:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1432786/comments/4

------------------------------------------------------------------------
On 2015-05-17T13:52:22+00:00 Richard wrote:

strace output is here:
https://kojipkgs.fedoraproject.org//work/tasks/7470/9767470/build.log

It's frankly bizarre.  The last system calls are:

open("/dev/urandom", O_RDONLY|O_LARGEFILE) = 3
fcntl64(3, F_GETFD)                     = 0
fcntl64(3, F_SETFD, FD_CLOEXEC)         = 0
fstat64(3, {st_mode=S_IFCHR|0444, st_rdev=makedev(1, 9), ...}) = 0
clock_gettime(CLOCK_REALTIME, {1431870340, 192190777}) = 0
getrusage(0x1 /* RUSAGE_??? */, {ru_utime={0, 330000}, ru_stime={0, 55000}, ...}
) = 0
read(3, "*\222:\253\26\211\313\245\6KF\256Bm\242\363s\202>`u\223\6*1\351\234\324
E\362\347j", 32) = 32
read(3, "83\313y\206\10\316^\226\342&\266;R3\215\375\350\227\363+\16?\210LTy\345
v.\251\243", 32) = 32
read(3, "\231\302\270,\311\271z\324", 8) = 8
gettimeofday({1431870340, 198569}, NULL) = 0
gettimeofday({1431870340, 198647}, NULL) = 0

... and gettimeofday repeats in a loop forever.  The context looks as
if it is a bug in libselinux.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1432786/comments/5

------------------------------------------------------------------------
On 2015-05-17T15:41:44+00:00 Richard wrote:

According to gdb, this is the backtrace:

warning: Probes-based dynamic linker interface failed.
Reverting to original interface.
Cannot parse expression `.L1046 4 at r4'.
#0  0xb6fd10b0 in dl_main () from /lib/ld-linux-armhf.so.3
#1  0xb6fe6670 in _dl_sysdep_start () from /lib/ld-linux-armhf.so.3
#2  0xb6fd23a8 in _dl_start () from /lib/ld-linux-armhf.so.3
#3  0xb6fcdb50 in _start () from /lib/ld-linux-armhf.so.3
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Reply at:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1432786/comments/6

------------------------------------------------------------------------
On 2015-05-17T16:01:56+00:00 Richard wrote:

OK I believe the code is from libnettle (gnutls):

lib/nettle/rnd-common.c:_rnd_system_entropy_init contains

        _gnutls_urandom_fd = open("/dev/urandom", O_RDONLY);
        if (_gnutls_urandom_fd < 0) {
                _gnutls_debug_log("Cannot open urandom!\n");
                goto fallback;
        }

        old = fcntl(_gnutls_urandom_fd, F_GETFD);
        if (old != -1)
                fcntl(_gnutls_urandom_fd, F_SETFD, old | FD_CLOEXEC);

        if (fstat(_gnutls_urandom_fd, &st) >= 0) {
                _gnutls_urandom_fd_mode = st.st_mode;
        }

lib/nettle/rnd-common.c:_rnd_get_event contains

        gettime(&e->now);

#ifdef HAVE_GETRUSAGE
        if (getrusage(ARG_RUSAGE, &e->rusage) < 0) {
                _gnutls_debug_log("getrusage failed: %s\n",
                          strerror(errno));
        }
#endif

which looks identical to the sequence of syscalls being made.  It's
still a mystery where the infinite loop happens.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1432786/comments/7

------------------------------------------------------------------------
On 2015-05-17T16:14:34+00:00 Richard wrote:

I'm still pretty sure this stack trace is wrong, but I got gdb
to print out the register contents:

warning: Probes-based dynamic linker interface failed.
Reverting to original interface.
Cannot parse expression `.L1046 4 at r4'.
r0             0xb6fff8f4	3070228724
r1             0x0	0
r2             0xb6fff908	3070228744
r3             0xb6fff8f4	3070228724
r4             0xb6fff8f4	3070228724
r5             0x0	0
r6             0xfffffffc	4294967292
r7             0xa3	163
r8             0xa3	163
r9             0x0	0
r10            0xb6fff908	3070228744
r11            0xbefff95c	3204446556
r12            0xb6ffefac	3070226348
sp             0xbefff8c0	0xbefff8c0
lr             0xb6fd10b0	-1224929104
pc             0xb6fd10b0	0xb6fd10b0 <dl_main+11624>
cpsr           0x60070010	1611071504
Thread 1 (process 804):
#0  0xb6fd10b0 in dl_main () from /lib/ld-linux-armhf.so.3
#1  0xb6fe6670 in _dl_sysdep_start () from /lib/ld-linux-armhf.so.3
#2  0xb6fd23a8 in _dl_start () from /lib/ld-linux-armhf.so.3
#3  0xb6fcdb50 in _start () from /lib/ld-linux-armhf.so.3
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Reply at:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1432786/comments/8

------------------------------------------------------------------------
On 2015-05-17T16:48:47+00:00 Richard wrote:

*If* the stack trace is to be believed, then it points to
the following code in glibc.  PC is marked with an arrow.

    40a0:       eb00376c        bl      11e58 <_dl_debug_initialize>
    40a4:       e580500c        str     r5, [r0, #12]
    40a8:       e1a04000        mov     r4, r0
    40ac:       eb003768        bl      11e54 <_dl_debug_state>
--> 40b0:       e320f000        nop     {0}
    40b4:       eb0052c7        bl      18bd8 <_dl_unload_cache>
    40b8:       e24bd020        sub     sp, fp, #32
    40bc:       e8bd8ff0        pop     {r4, r5, r6, r7, r8, r9, sl, fp, pc}

Reply at:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1432786/comments/9

------------------------------------------------------------------------
On 2015-05-17T22:25:30+00:00 Nikos wrote:

The gnutls stacktrace being repeated is run within the library
constructor. Could it be that one of these calls causes a dlopen in that
platform which in turn causes the infinite loop?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1432786/comments/10

------------------------------------------------------------------------
On 2015-05-18T15:22:51+00:00 Richard wrote:

Finally I am able to reproduce this locally, so I'll attach a
debugger and find out what's really going on here.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1432786/comments/11

------------------------------------------------------------------------
On 2015-05-18T17:41:20+00:00 Richard wrote:

First observation is that the stack trace above *is* bogus.  qemu-system-arm
for some reason behaves differently under gdb, triggering some kind
of (gdb?) bug.  Please ignore the stack traces in comments 3, 5 & 6.
And my apologies for blaming gnutls.

The REAL stack trace points to ceph:

#0  0xb284846c in gettimeofday () from /lib/libc.so.6
#1  0xb405b868 in Cycles::init () at common/Cycles.cc:75
#2  0xb6b28500 in call_init.part () from /lib/ld-linux-armhf.so.3
#3  0xb6b28664 in _dl_init () from /lib/ld-linux-armhf.so.3
#4  0xb6b17b84 in _dl_start_user () from /lib/ld-linux-armhf.so.3

Reply at:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1432786/comments/12

------------------------------------------------------------------------
On 2015-05-18T18:01:46+00:00 Richard wrote:

A very trivial reproducer for this is:

$ echo 'main(){}' > test.c
$ gcc test.c -lrados -o test
$ ./test

The final command will hang forever on 32 bit ARM.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1432786/comments/13

------------------------------------------------------------------------
On 2015-05-18T18:13:07+00:00 Richard wrote:

Created attachment 1026798
fix-cycles-arch.patch

BLEAH UPSTREAM FIRST *!#$

Turns out that Ubuntu fixed this and are carrying a downstream patch
for it, which I have attached.

Ubuntu bug: https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1432786

Reply at:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1432786/comments/14

------------------------------------------------------------------------
On 2015-05-18T18:32:07+00:00 Boris wrote:

Actually, that patch got half-merged upstream (the first half) and that
is why you were seeing the hang-ups. I'll apply the second half of the
patch and rebuild the packages.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1432786/comments/15

------------------------------------------------------------------------
On 2015-05-18T22:01:38+00:00 Fedora wrote:

ceph-0.94.1-1.fc22 has been submitted as an update for Fedora 22.
https://admin.fedoraproject.org/updates/FEDORA-2015-8189/ceph-0.94.1-1.fc22

Reply at:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1432786/comments/16

------------------------------------------------------------------------
On 2015-05-18T22:32:05+00:00 Boris wrote:

btw: Note that it is not fixed in ceph-0.94.1-1.fc22, it is (should be)
fixed in ceph-0.94.1-2.fc22 and ceph-0.94.1-2.fc23 that were just
recently built in koji:

http://koji.fedoraproject.org/koji/taskinfo?taskID=9777655
http://koji.fedoraproject.org/koji/taskinfo?taskID=9777671

Reply at:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1432786/comments/17

------------------------------------------------------------------------
On 2015-05-19T07:29:19+00:00 Richard wrote:

Can confirm that this is fixed in ceph-0.94.1-2.fc23 on armv7hl.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1432786/comments/18

------------------------------------------------------------------------
On 2015-05-20T02:54:10+00:00 Fedora wrote:

Package ceph-0.94.1-2.fc22:
* should fix your issue,
* was pushed to the Fedora 22 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing ceph-0.94.1-2.fc22'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2015-8189/ceph-0.94.1-2.fc22
then log in and leave karma (feedback).

Reply at:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1432786/comments/19

------------------------------------------------------------------------
On 2015-05-20T14:24:49+00:00 Cole wrote:

FWIW qemu built against this package still seems to hang:

http://koji.fedoraproject.org/koji/taskinfo?taskID=9780467

But I'm trying another build now

Reply at:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1432786/comments/20

------------------------------------------------------------------------
On 2015-05-22T02:31:20+00:00 Fedora wrote:

Package ceph-0.94.1-3.fc22:
* should fix your issue,
* was pushed to the Fedora 22 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing ceph-0.94.1-3.fc22'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2015-8189/ceph-0.94.1-3.fc22
then log in and leave karma (feedback).

Reply at:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1432786/comments/21

------------------------------------------------------------------------
On 2015-05-29T20:32:28+00:00 Richard wrote:

This has regressed somehow, so it's started to happen again.

https://kojipkgs.fedoraproject.org//work/tasks/1496/9871496/build.log

The build log shows that it is using:

DEBUG util.py:388:   librados2                   armv7hl 1:0.94.1-3.fc23
build 1.5 M

https://kojipkgs.fedoraproject.org//work/tasks/1496/9871496/root.log

I don't understand exactly why, since the patch exists and
is still being applied.  Perhaps the patch is not a complete
fix for the problem.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1432786/comments/22

------------------------------------------------------------------------
On 2015-05-29T21:15:11+00:00 Richard wrote:

I can reproduce this on baremetal armv7hl with
librados2-0.94.1-3.fc23.armv7hl
qemu-system-arm-2.3.0-5.fc23.armv7hl
and (as before) simply running 'qemu-system-arm -help'.

Unfortunately gdb doesn't let me get a stack trace.  It simply
ignores ^C.

I was able to use strace, and the process is doing this over and
over again:

gettimeofday({1432934064, 725311}, NULL) = 0
nanosleep({0, 2000001}, NULL)           = 0
gettimeofday({1432934064, 728002}, NULL) = 0
nanosleep({0, 2000001}, NULL)           = 0

Reply at:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1432786/comments/23

------------------------------------------------------------------------
On 2015-05-29T21:35:13+00:00 Richard wrote:

Created attachment 1032285
strace qemu-system-arm -help

Reply at:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1432786/comments/24

------------------------------------------------------------------------
On 2015-06-01T02:45:40+00:00 Boris wrote:

Just few thoughts:

The ceph itself should not run the loop, now. Is there any chance that
qemu is trying to do that instead?

Can you still reproduce on bare metal with the simple reproducer that
you mentioned in Comment 10?

What about the older librados2-0.94.1-2.fc23.armv7hl? The only change
between -2 and -3 should be that lttng was disabled but one never knows.

I'm trying to get my hands on an armhfp machine to debug further myself
but that may take some time.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1432786/comments/25

------------------------------------------------------------------------
On 2015-06-01T07:25:04+00:00 Boris wrote:

I'll answer the questions myself as I've finally got my hands on an
armhfp machine where I could reproduce.

The steps in Comment 10 do not reproduce the issue so that bug is indeed
fixed. I've tapped into the running process (gdb -p $(pidof qemu-system-
arm)) and I didn't see anything ceph (rbd/rados) related in the
backtrace.

The backtrace leading to nanosleep was

#0  0xffffffff in nanosleep () at /lib/libpthread.so.0
#1  0xffffffff in base::internal::SpinLockDelay(int volatile*, int, int) () at /lib/libtcmalloc.so.4
#2  0xffffffff in SpinLock::SlowLock() () at /lib/libtcmalloc.so.4
#3  0xffffffff in tcmalloc::ThreadCache::InitModule() () at /lib/libtcmalloc.so.4
#4  0xffffffff in tc_malloc () at /lib/libtcmalloc.so.4
#5  0xffffffff in __fopen_internal () at /lib/libc.so.6
#6  0xffffffff in load_debug_frame () at /lib/libunwind.so.8
#7  0xffffffff in locate_debug_info () at /lib/libunwind.so.8
#8  0xffffffff in _ULarm_dwarf_find_debug_frame () at /lib/libunwind.so.8
#9  0xffffffff in _ULarm_dwarf_callback () at /lib/libunwind.so.8
#10 0xffffffff in dl_iterate_phdr () at /lib/libc.so.6
#11 0xffffffff in _ULarm_find_proc_info () at /lib/libunwind.so.8
#12 0xffffffff in fetch_proc_info () at /lib/libunwind.so.8
#13 0xffffffff in _ULarm_dwarf_find_save_locs () at /lib/libunwind.so.8
#14 0xffffffff in _ULarm_dwarf_step () at /lib/libunwind.so.8
#15 0xffffffff in _ULarm_step () at /lib/libunwind.so.8
#16 0xffffffff in GetStackTrace_libunwind(void**, int, int) () at /lib/libtcmalloc.so.4
#17 0xffffffff in GetStackTrace(void**, int, int) () at /lib/libtcmalloc.so.4
#18 0xffffffff in tcmalloc::PageHeap::GrowHeap(unsigned int) () at /lib/libtcmalloc.so.4
#19 0xffffffff in tcmalloc::PageHeap::New(unsigned int) () at /lib/libtcmalloc.so.4
#20 0xffffffff in tcmalloc::CentralFreeList::Populate() () at /lib/libtcmalloc.so.4
#21 0xffffffff in tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**) () at /lib/libtcmalloc.so.4
#22 0xffffffff in tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) () at /lib/libtcmalloc.so.4
#23 0xffffffff in tcmalloc::ThreadCache::FetchFromCentralCache(unsigned int, unsigned int) () at /lib/libtcmalloc.so.4
#24 0xffffffff in tc_malloc () at /lib/libtcmalloc.so.4
#25 0xffffffff in strdup () at /lib/libc.so.6
#26 0xffffffff in verify_selinuxmnt () at /lib/libselinux.so.1
#27 0xffffffff in init_lib () at /lib/libselinux.so.1
#28 0xffffffff in call_init.part () at /lib/ld-linux-armhf.so.3
#29 0xffffffff in _dl_init_internal () at /lib/ld-linux-armhf.so.3
#30 0xffffffff in _dl_start_user () at /lib/ld-linux-armhf.so.3

The suspects seemed to be tcmalloc (gperftools) or libunwind. While both
of these are used by ceph, neither of them is used by the client
libraries (rbd/rados) that qemu links to.

I've done some more tests and it seems that tcmalloc (gperftools) is
actually the one causing the hang-ups, i.e. similar to comment 10, I did

$ echo 'main(){}' > test.c
$ gcc test.c -ltcmalloc -o test
$ ./test

and the program hanged indefinitely.

I'm moving this bz back to ON_QA (as there actually was a bug in ceph
code that was fixed in the -2 release) and I'll clone this bz for the
tcmalloc issue.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1432786/comments/26

------------------------------------------------------------------------
On 2015-06-01T08:40:12+00:00 Richard wrote:

Thanks - yes I believe you are correct.  Quite annoying that two
difficult bugs look coincidentally so similar :-(

Reply at:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1432786/comments/27

------------------------------------------------------------------------
On 2015-06-04T18:14:26+00:00 Ken wrote:

The patch was merged to Ceph master upstream in
35c5fd0091fc4d63b21207fb94e46b343519fd56
(https://github.com/ceph/ceph/pull/4839)

Reply at:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1432786/comments/28

------------------------------------------------------------------------
On 2015-06-04T20:19:45+00:00 Fedora wrote:

ceph-0.94.1-3.fc22 has been pushed to the Fedora 22 stable repository.
If problems still persist, please make note of it in this bug report.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1432786/comments/30

** Changed in: ceph (Fedora)
       Status: Unknown => Fix Released

** Changed in: ceph (Fedora)
   Importance: Unknown => Undecided

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to ceph in Ubuntu.
https://bugs.launchpad.net/bugs/1432786

Title:
  upgrade of ceph in vivid breaks qemu-system-arm

Status in ceph package in Ubuntu:
  Fix Released
Status in libguestfs package in Ubuntu:
  Invalid
Status in ceph source package in Vivid:
  Fix Released
Status in libguestfs source package in Vivid:
  Invalid
Status in ceph package in Fedora:
  Fix Released

Bug description:
  seen in the libguestfs build logs, qemu-system-arm -help hangs
  indefinitely and the build has to be killed.

  <doko> Program received signal SIGINT, Interrupt.
  <doko> 0xb4121d46 in gettimeofday () at ../sysdeps/unix/syscall-template.S:81
  <doko> 81      ../sysdeps/unix/syscall-template.S: No such file or directory.
  <doko> (gdb) bt
  <doko> #0  0xb4121d46 in gettimeofday () at ../sysdeps/unix/syscall-template.S:81
  <doko> #1  0xb4a40772 in ?? () from /usr/lib/arm-linux-gnueabihf/librados.so.2
  <doko> #2  0xb6fe2216 in call_init (l=<optimized out>, argc=2, argv=0xbefff734, env=0xbefff740) at dl-init.c:78
  <doko> #3  0xb6fe22a8 in _dl_init (main_map=0xb6fff958, argc=2, argv=0xbefff734, env=0xbefff740) at dl-init.c:126
  <doko> #4  0xb6fd7bf2 in _dl_start_user () from /lib/ld-linux-armhf.so.3
  <doko> Backtrace stopped: previous frame identical to this frame (corrupt stack?)

  checked older toolchain versions, and qemu builds, all with the same
  result.

  downgrading librados2 and librbd1 to 0.87-0ubuntu5 lets the qemu call
  succeed.

  Setting the priority to, assuming we want a working qemu in vivid.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1432786/+subscriptions