[Bug 1641241] [NEW] TensorFlow application crashes after glibc upgrade from 14.04 to 16.04 version

Fri Nov 11 22:50:14 UTC 2016

Public bug reported:

== Comment: #0 - Brian Hart - 2016-10-25 11:38:18 ==
---Problem Description---
TensorFlow application crashes after glibc upgrade from 14.04 to 16.04 version

My team is building and running Google's TensorFlow deep learning
framework application.  We've observed that TensorFlow (v 0.9.0, 0.10.0,
and latest master) work on Ubuntu 14.04, but crash on Ubuntu 16.04.

We done various isolation experiments and have found that the same
application binary that runs on 14.04 can be made to fail by updating
just the libc packages on the 14.04 system.  So we're suspecting either
some regression in glibc, or perhaps some incorrect application code
that worked by accident under the older glibc and now fails.

The failure takes various forms, usually one of:

a) crash due to gcc's "stack smashing" detection
b) segfault due to dereferencing a pointer that was damaged while it was on the stack
c) python type exception due to mismatched list sizes

---uname output---
Linux p10a102 4.4.0-43-generic #63-Ubuntu SMP Wed Oct 12 13:45:41 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

---Additional Hardware Info---
Application uses NVIDIA GPU. Problem is seen with any of: PCI-attached K80 and M40, NVLINK-attached P100 

Machine Type = Minsky 

---Steps to Reproduce---
 Train an inception network () on the ILSVRC2012 dataset using TensorFlow.

Inception network model:

$ git clone https://github.com/tensorflow/models.git
$ cd inception
$ bazel build inception/imagenet_train
$ bazel-bin/inception/imagenet_train --max_steps=1000 --num_gpus=1 --train_dir=<path_to_output_dir> --data_dir=<path_to_ILSVRC_dataset>

Userspace tool common name: TensorFlow 

The userspace tool has the following bit modes: 64-bit 

Userspace tool obtained from project website:  TensorFlow v0.9.0, built for example from: https://github.com/ibmsoe/tensorflow/tree/v0.9.0-ppc 

== Comment: #4 - Brian Hart  - 2016-10-25 16:05:03 ==
Tulio,

Thank you; to respond to your points...

[W]ere you able to restrict to a smaller code sample?

=> Not yet.

Are you able to run the same training without GPU? i.e. using just the POWER
processor.

=> This is on our list of tests to run, but haven't tried it yet.

Have you run this code on valgrind's memcheck tool?

=> Running this now. So far at startup, I get several complaints about
read-after-free on the part of python 2.7--mostly in things like
python's realloc and GC code paths. I'm not going to worry about those
at the moment.

Could you generate a coredump? e.g. running 'ulimit -c unlimited' before
training the network.

=> We'll try.

I wouldn't expect glibc from Ubuntu 16.04 to run on Ubuntu 14.04.
Is this issue appearing on an Ubuntu 16.04 install?

=> Yes, the crash happens with a straight 16.04 install. But we have
been able to use newer libc packages (a constellation of about 6
packages, covering several libraries--libc, libm, pthreads, etc.). We
find:

14.04 w/ glibc 2.19 - no crash
14.04 w/ glibc 2.21 (0ubuntu4) - no crash
14.04 w/ glibc 2.21 (0ubuntu4.3) - no crash
14.04 w/ glibc 2.23 - crashes
14.04 w/ glibc 2.24 - crashes

Were there any problematic changes between 2.21 and 2.23?

== Comment: #5 - Brian Hart  - 2016-10-25 16:47:13 ==
Just got a core file from a crash. Moving it (~11GB) to a system where I can set up access.

The crash scenario was kind of interesting...

We crashed dereferencing the stack pointer:

(gdb) x/4i $pc-8
   0x3fff7f4420a8 <_ZN10tensorflow15OpKernelContext15allocate_outputEiRKNS_11TensorShapeEPPNS_6TensorENS_19AllocatorAttributesE+312>:   add     r30,r30,r28
   0x3fff7f4420ac <_ZN10tensorflow15OpKernelContext15allocate_outputEiRKNS_11TensorShapeEPPNS_6TensorENS_19AllocatorAttributesE+316>:   mr      r3,r29
=> 0x3fff7f4420b0 <_ZN10tensorflow15OpKernelContext15allocate_outputEiRKNS_11TensorShapeEPPNS_6TensorENS_19AllocatorAttributesE+320>:   ld      r0,16(r1)
   0x3fff7f4420b4 <_ZN10tensorflow15OpKernelContext15allocate_outputEiRKNS_11TensorShapeEPPNS_6TensorENS_19AllocatorAttributesE+324>:   ld      r9,8(r30)

Which currently holds a bad value. But it looks like the stack pointer value
would be valid except that the high 16-bits has been changed from 0x0000 to
0x0001:

(gdb) info registers r1
r1             0x13bffa77fd4d0

If we drop the 0x0001 we're left with a pointer to a sane-looking stack
frame, with a saved LR that would put us in at least a nearby routine:

(gdb) x/8g 0x3bffa77fd4d0
0x3bffa77fd4d0: 0x00003bffa77fd500      0x00003bffa77fd5a8
0x3bffa77fd4e0: 0x00003fff7f442148      0x00003fff8931b428
0x3bffa77fd4f0: 0x00003bffa77fd928      0x00003bffa77fd500
0x3bffa77fd500: 0x00003bffa77fd650      0x00003fff00002200

(gdb) x/i 0x00003fff7f442148-4
   0x3fff7f442144 <_ZN10tensorflow15OpKernelContext15allocate_outputEiRKNS_11TensorShapeEPPNS_6TensorE+52>:     bl      0x3fff7de07940

In another case we anaylzed we segfaulted because we dereferenced via r31,
which had been damaged in the same way (high 16-bits changed from 0x0000 to
0x0001). In that case, r31 was an alias for the stack pointer (because we built
with "-fstack-protector-all") and the r31 value had apparently been damaged
_while sitting on the stack_ during a call to glibc's free(). (Caller had
dereferenced r31 just before the call to free(), so it was fine then, and we
could see that the damaged value was still present in the now-defunct free()
stackframe when free() caller ultimately dereferenced r31.

The current case is probably more interesting because the cases where r1 would
be lying around in memory to be damaged should be narrower than for other
registers. Basically when the thread is switched out.

A stray write by any other thread in the process might be the cause of the
problem here. But then why do we only see it at recent glibc versions?  Were
there any changes to, say, the pthreads context switching code lately (does
pthreads even _do_ any context switching, or does it leave that all to the
kernel)?

== Comment: #6 - Brian Hart - 2016-10-25 20:13:33 ==
A couple of further comments...

Re: Valgrind - After the startup batch of complaints about python
itself, valgrind is silent while the app is running.  (And starting up
python under valgrind without the app generates a similar set of
complaints.)

We made the core file available to Tulio; sent access info out of band.

== Comment: #7 - Tulio Magno Quites Machado Filho - 2016-10-26 09:17:28 ==
(In reply to comment #5)
> A stray write by any other thread in the process might be the cause of the
> problem here. But then why do we only see it at recent glibc versions?

glibc 2.23 enabled -fstack-protector-strong by default.
So, there is a chance the problem was already there, but glibc 2.23 started to catch and report it.

For the record, there has been an issue reported to the tensorflow community: https://github.com/tensorflow/tensorflow/issues/3174
It was closed due to the lack of information.

> Were there any changes to, say, the pthreads context switching code lately (does
> pthreads even _do_ any context switching, or does it leave that all to the
> kernel)?

That's all kernel code.

== Comment: #8 - Brian Hart - 2016-10-26 12:29:00 ==
I'm not sure the stack protection setting change in glibc 2.23 can be the explanation here.

My understanding of the stack protection is that the compiler emits some
additional code to stash a sentinal value on the stack, and to verify
the sentinal value later (either just at the end of the routine, or
after every child function call in the case of stack-protector-all).

But we're seeing crashes that are sensitive to glibc levels, in a common
binary that was built with the older toolchain. So the application
binary doesn't have any additional stack protection checks in the 2.23
case compared to the 2.21 case.  The newer glibc itself might have been
built with the newer toolchain, but the stack protection hits we do see
are occurring in the app object rather than the glibc objects.

And several of the crashes we see aren't stack protector catches.

Thank you for the pointer to the TensorFlow bug report; we're looking at that to see if it contains anything that might help with problem isolation.

** Affects: glibc (Ubuntu)
     Importance: Undecided
     Assignee: Taco Screen team (taco-screen-team)
         Status: New

** Tags: architecture-ppc64le bugnameltc-147893 severity-high targetmilestone-inin---

** Tags added: architecture-ppc64le bugnameltc-147893 severity-high
targetmilestone-inin---

** Changed in: ubuntu
     Assignee: (unassigned) => Taco Screen team (taco-screen-team)

** Package changed: ubuntu => glibc (Ubuntu)

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to glibc in Ubuntu.
https://bugs.launchpad.net/bugs/1641241

Title:
  TensorFlow application crashes after glibc upgrade from 14.04 to 16.04
  version

Status in glibc package in Ubuntu:
  New

Bug description:
  == Comment: #0 - Brian Hart - 2016-10-25 11:38:18 ==
  ---Problem Description---
  TensorFlow application crashes after glibc upgrade from 14.04 to 16.04 version

  My team is building and running Google's TensorFlow deep learning
  framework application.  We've observed that TensorFlow (v 0.9.0,
  0.10.0, and latest master) work on Ubuntu 14.04, but crash on Ubuntu
  16.04.

  We done various isolation experiments and have found that the same
  application binary that runs on 14.04 can be made to fail by updating
  just the libc packages on the 14.04 system.  So we're suspecting
  either some regression in glibc, or perhaps some incorrect application
  code that worked by accident under the older glibc and now fails.

  The failure takes various forms, usually one of:

  a) crash due to gcc's "stack smashing" detection
  b) segfault due to dereferencing a pointer that was damaged while it was on the stack
  c) python type exception due to mismatched list sizes

  ---uname output---
  Linux p10a102 4.4.0-43-generic #63-Ubuntu SMP Wed Oct 12 13:45:41 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

  ---Additional Hardware Info---
  Application uses NVIDIA GPU. Problem is seen with any of: PCI-attached K80 and M40, NVLINK-attached P100 

  Machine Type = Minsky 

  ---Steps to Reproduce---
   Train an inception network () on the ILSVRC2012 dataset using TensorFlow.

  Inception network model:

  $ git clone https://github.com/tensorflow/models.git
  $ cd inception
  $ bazel build inception/imagenet_train
  $ bazel-bin/inception/imagenet_train --max_steps=1000 --num_gpus=1 --train_dir=<path_to_output_dir> --data_dir=<path_to_ILSVRC_dataset>

  Userspace tool common name: TensorFlow 

  The userspace tool has the following bit modes: 64-bit 

  Userspace tool obtained from project website:  TensorFlow v0.9.0, built for example from: https://github.com/ibmsoe/tensorflow/tree/v0.9.0-ppc 

  == Comment: #4 - Brian Hart  - 2016-10-25 16:05:03 ==
  Tulio,

  Thank you; to respond to your points...

  [W]ere you able to restrict to a smaller code sample?

  => Not yet.

  Are you able to run the same training without GPU? i.e. using just the POWER
  processor.

  => This is on our list of tests to run, but haven't tried it yet.

  Have you run this code on valgrind's memcheck tool?

  => Running this now. So far at startup, I get several complaints about
  read-after-free on the part of python 2.7--mostly in things like
  python's realloc and GC code paths. I'm not going to worry about those
  at the moment.

  Could you generate a coredump? e.g. running 'ulimit -c unlimited' before
  training the network.

  => We'll try.

  I wouldn't expect glibc from Ubuntu 16.04 to run on Ubuntu 14.04.
  Is this issue appearing on an Ubuntu 16.04 install?

  => Yes, the crash happens with a straight 16.04 install. But we have
  been able to use newer libc packages (a constellation of about 6
  packages, covering several libraries--libc, libm, pthreads, etc.). We
  find:

  14.04 w/ glibc 2.19 - no crash
  14.04 w/ glibc 2.21 (0ubuntu4) - no crash
  14.04 w/ glibc 2.21 (0ubuntu4.3) - no crash
  14.04 w/ glibc 2.23 - crashes
  14.04 w/ glibc 2.24 - crashes

  Were there any problematic changes between 2.21 and 2.23?

  == Comment: #5 - Brian Hart  - 2016-10-25 16:47:13 ==
  Just got a core file from a crash. Moving it (~11GB) to a system where I can set up access.

  The crash scenario was kind of interesting...

  We crashed dereferencing the stack pointer:

  (gdb) x/4i $pc-8
     0x3fff7f4420a8 <_ZN10tensorflow15OpKernelContext15allocate_outputEiRKNS_11TensorShapeEPPNS_6TensorENS_19AllocatorAttributesE+312>:   add     r30,r30,r28
     0x3fff7f4420ac <_ZN10tensorflow15OpKernelContext15allocate_outputEiRKNS_11TensorShapeEPPNS_6TensorENS_19AllocatorAttributesE+316>:   mr      r3,r29
  => 0x3fff7f4420b0 <_ZN10tensorflow15OpKernelContext15allocate_outputEiRKNS_11TensorShapeEPPNS_6TensorENS_19AllocatorAttributesE+320>:   ld      r0,16(r1)
     0x3fff7f4420b4 <_ZN10tensorflow15OpKernelContext15allocate_outputEiRKNS_11TensorShapeEPPNS_6TensorENS_19AllocatorAttributesE+324>:   ld      r9,8(r30)

  Which currently holds a bad value. But it looks like the stack pointer value
  would be valid except that the high 16-bits has been changed from 0x0000 to
  0x0001:

  (gdb) info registers r1
  r1             0x13bffa77fd4d0

  If we drop the 0x0001 we're left with a pointer to a sane-looking stack
  frame, with a saved LR that would put us in at least a nearby routine:

  (gdb) x/8g 0x3bffa77fd4d0
  0x3bffa77fd4d0: 0x00003bffa77fd500      0x00003bffa77fd5a8
  0x3bffa77fd4e0: 0x00003fff7f442148      0x00003fff8931b428
  0x3bffa77fd4f0: 0x00003bffa77fd928      0x00003bffa77fd500
  0x3bffa77fd500: 0x00003bffa77fd650      0x00003fff00002200

  (gdb) x/i 0x00003fff7f442148-4
     0x3fff7f442144 <_ZN10tensorflow15OpKernelContext15allocate_outputEiRKNS_11TensorShapeEPPNS_6TensorE+52>:     bl      0x3fff7de07940

  In another case we anaylzed we segfaulted because we dereferenced via r31,
  which had been damaged in the same way (high 16-bits changed from 0x0000 to
  0x0001). In that case, r31 was an alias for the stack pointer (because we built
  with "-fstack-protector-all") and the r31 value had apparently been damaged
  _while sitting on the stack_ during a call to glibc's free(). (Caller had
  dereferenced r31 just before the call to free(), so it was fine then, and we
  could see that the damaged value was still present in the now-defunct free()
  stackframe when free() caller ultimately dereferenced r31.

  The current case is probably more interesting because the cases where r1 would
  be lying around in memory to be damaged should be narrower than for other
  registers. Basically when the thread is switched out.

  A stray write by any other thread in the process might be the cause of the
  problem here. But then why do we only see it at recent glibc versions?  Were
  there any changes to, say, the pthreads context switching code lately (does
  pthreads even _do_ any context switching, or does it leave that all to the
  kernel)?

  == Comment: #6 - Brian Hart - 2016-10-25 20:13:33 ==
  A couple of further comments...

  Re: Valgrind - After the startup batch of complaints about python
  itself, valgrind is silent while the app is running.  (And starting up
  python under valgrind without the app generates a similar set of
  complaints.)

  We made the core file available to Tulio; sent access info out of
  band.

  == Comment: #7 - Tulio Magno Quites Machado Filho - 2016-10-26 09:17:28 ==
  (In reply to comment #5)
  > A stray write by any other thread in the process might be the cause of the
  > problem here. But then why do we only see it at recent glibc versions?

  glibc 2.23 enabled -fstack-protector-strong by default.
  So, there is a chance the problem was already there, but glibc 2.23 started to catch and report it.

  For the record, there has been an issue reported to the tensorflow community: https://github.com/tensorflow/tensorflow/issues/3174
  It was closed due to the lack of information.

  > Were there any changes to, say, the pthreads context switching code lately (does
  > pthreads even _do_ any context switching, or does it leave that all to the
  > kernel)?

  That's all kernel code.

  == Comment: #8 - Brian Hart - 2016-10-26 12:29:00 ==
  I'm not sure the stack protection setting change in glibc 2.23 can be the explanation here.

  My understanding of the stack protection is that the compiler emits
  some additional code to stash a sentinal value on the stack, and to
  verify the sentinal value later (either just at the end of the
  routine, or after every child function call in the case of stack-
  protector-all).

  But we're seeing crashes that are sensitive to glibc levels, in a
  common binary that was built with the older toolchain. So the
  application binary doesn't have any additional stack protection checks
  in the 2.23 case compared to the 2.21 case.  The newer glibc itself
  might have been built with the newer toolchain, but the stack
  protection hits we do see are occurring in the app object rather than
  the glibc objects.

  And several of the crashes we see aren't stack protector catches.

  Thank you for the pointer to the TensorFlow bug report; we're looking at that to see if it contains anything that might help with problem isolation.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1641241/+subscriptions