[Bug 1899800] Re: Runtime deadlock: pthread_cond_signal failed to wake up pthread_cond_wait due to a bug in undoing stealing

Balint Reczey 1899800 at bugs.launchpad.net
Wed Dec 16 11:37:38 UTC 2020


** Description changed:

+ [Impact]
+ 
+ * Various multi-threaded applications using pthread_cond hang.
+ 
+ [Test Case]
+ 
+ * Run the reproducer attached to the upstream bug report (I used a qemu-
+ emulated 8 core machine on a 4 core one):
+ 
+   wget https://sourceware.org/bugzilla/attachment.cgi?id=12480 -O repro-lp1899800.c
+   gcc -pthread repro-lp1899800.c
+   ./a.out 16
+ 
+ Total Threads Count; 16
+ RefereeThread - (null) started
+ LoopCriticalSectionThread - 1 started
+ ...
+ LoopCriticalSectionThread - 16 started
+ Monitor - g_counter 411380000, loop_round 3024, threads_finished 13
+ ...
+ Monitor - g_counter 1920301632, loop_round 266764, threads_finished 0
+ Monitor - g_counter -1851097664, loop_round 270614, threads_finished 13
+ Monitor - g_counter -1227241664, loop_round 275201, threads_finished 14
+ Monitor - g_counter -337385664, loop_round 281744, threads_finished 0
+ Monitor - g_counter 519822336, loop_round 288047, threads_finished 16
+ Monitor - g_counter 1401918336, loop_round 294533, threads_finished 0
+ Monitor - g_counter -1993136960, loop_round 301150, threads_finished 16
+ Monitor - g_counter -1140185466, loop_round 307422, threads_finished 12
+ Monitor - g_counter -1063307960, loop_round 307987, threads_finished 15
+ Monitor - g_counter -1063307960, loop_round 307987, threads_finished 15
+ Monitor - g_counter -1063307960, loop_round 307987, threads_finished 15
+ Monitor - g_counter -1063307960, loop_round 307987, threads_finished 15
+ Monitor - g_counter -1063307960, loop_round 307987, threads_finished 15
+ ...
+ 
+    The lockup is observed as repeating identical lines ^.
+ 
+ * Observe the threads hanging in a few minutes with unfixed libc6 and
+ not hanging for hours with the fixed one.
+ 
+ [Where problems could occur]
+ 
+ * The fix which is rather a workaround in the one-line form is waking up
+ all threads when there is a chance of hitting the deadlock. This causes
+ a slight rare overhead, but the exact amount of the overhead is unknown.
+ 
+ [Original Bug Text]
+ 
  This bug was submitted by Qin Li to glibc bugzilla earlier this year,
  with a one-line patch, though it hasn't been merged into glibc yet:
  
  https://sourceware.org/bugzilla/show_bug.cgi?id=25847
  
  This bug in pthread conditions will deadlock the OCaml runtime, as well
  as Python's runtime, and .NET.
  
  The bug was introduced in glibc 2.27, so affects Ubuntu 18.04 onwards.
  I confirm my OCaml app, as well as the repro from the bugzilla,
  deadlocks on Ubuntu 20.04 and Ubuntu 18.04.  To further strengthen the
  case that this is because of a bug in glibc, my app and the repro do not
  deadlock on Ubuntu 16.04.
  
  To rule out kernel issues, I further confirm that no deadlock happens
  when I copy Ubuntu 16.04's libc to 18.04 and redirect the dynamic linker
  so my app loads the earlier libc.
  
  I confirm that the one-line patch (available at the above bugzilla)
  applies cleanly on top of:
  
  * glibc-2.31-0ubuntu9.1 (Ubuntu 20.04 latest)
  * glibc-2.28-10 (Debian Buster/10 latest)
  * glibc-2.27-3ubuntu1.2 (Ubuntu 18.04 latest)
  
  I confirm that the one-line patch to glibc cures the deadlock issue in
  my OCaml apps.
  
  On Ubuntu 20.04, I have not been able to get the repro to deadlock in 5
  days.  My OCaml apps have not deadlocked in 5 days.
  
  On Debian Buster/10, the repro has not deadlocked in about 5 days.  This
  is my desktop box, and I can otherwise use normal applications as usual
  like the GNOME environment, etc.
  
  On Ubuntu 18.04, the repro takes about 24-48 hours before it triggers a
  deadlock.  Prior to patching glibc, it would take only a few hours.  I
  have not seen my OCaml apps deadlock since applying this patch, however.
  
  On Ubuntu 16.04 I have not been able to get the repro to deadlock ever.
  My OCaml apps never deadlocked on this platform.  This is expected,
  since this platform runs glibc 2.23, where the bug has not been
  introduced yet (the bugzilla report claims introduced in 2.27).
  
  As for why 18.04 still deadlocks, I suspect another, unrelated pthread
  bug was introduced in glibc 2.27 and fixed by 2.28.  When applied to
  glibc 2.27, the one-line patch appears to significantly reduce the
  deadlocking by an order of magnitude.
  
  Please kindly consider merging the patch into Ubuntu glibc.
  
  More background about this bug, for the sake of future internet searchers:
  * https://discuss.ocaml.org/t/is-there-a-known-recent-linux-locking-bug-that-affects-the-ocaml-runtime

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to glibc in Ubuntu.
https://bugs.launchpad.net/bugs/1899800

Title:
  Runtime deadlock: pthread_cond_signal failed to wake up
  pthread_cond_wait due to a bug in undoing stealing

Status in glibc package in Ubuntu:
  Fix Released
Status in glibc source package in Bionic:
  New
Status in glibc source package in Focal:
  New
Status in glibc source package in Groovy:
  New

Bug description:
  [Impact]

  * Various multi-threaded applications using pthread_cond hang.

  [Test Case]

  * Run the reproducer attached to the upstream bug report (I used a
  qemu-emulated 8 core machine on a 4 core one):

    wget https://sourceware.org/bugzilla/attachment.cgi?id=12480 -O repro-lp1899800.c
    gcc -pthread repro-lp1899800.c
    ./a.out 16

  Total Threads Count; 16
  RefereeThread - (null) started
  LoopCriticalSectionThread - 1 started
  ...
  LoopCriticalSectionThread - 16 started
  Monitor - g_counter 411380000, loop_round 3024, threads_finished 13
  ...
  Monitor - g_counter 1920301632, loop_round 266764, threads_finished 0
  Monitor - g_counter -1851097664, loop_round 270614, threads_finished 13
  Monitor - g_counter -1227241664, loop_round 275201, threads_finished 14
  Monitor - g_counter -337385664, loop_round 281744, threads_finished 0
  Monitor - g_counter 519822336, loop_round 288047, threads_finished 16
  Monitor - g_counter 1401918336, loop_round 294533, threads_finished 0
  Monitor - g_counter -1993136960, loop_round 301150, threads_finished 16
  Monitor - g_counter -1140185466, loop_round 307422, threads_finished 12
  Monitor - g_counter -1063307960, loop_round 307987, threads_finished 15
  Monitor - g_counter -1063307960, loop_round 307987, threads_finished 15
  Monitor - g_counter -1063307960, loop_round 307987, threads_finished 15
  Monitor - g_counter -1063307960, loop_round 307987, threads_finished 15
  Monitor - g_counter -1063307960, loop_round 307987, threads_finished 15
  ...

     The lockup is observed as repeating identical lines ^.

  * Observe the threads hanging in a few minutes with unfixed libc6 and
  not hanging for hours with the fixed one.

  [Where problems could occur]

  * The fix which is rather a workaround in the one-line form is waking
  up all threads when there is a chance of hitting the deadlock. This
  causes a slight rare overhead, but the exact amount of the overhead is
  unknown.

  [Original Bug Text]

  This bug was submitted by Qin Li to glibc bugzilla earlier this year,
  with a one-line patch, though it hasn't been merged into glibc yet:

  https://sourceware.org/bugzilla/show_bug.cgi?id=25847

  This bug in pthread conditions will deadlock the OCaml runtime, as
  well as Python's runtime, and .NET.

  The bug was introduced in glibc 2.27, so affects Ubuntu 18.04 onwards.
  I confirm my OCaml app, as well as the repro from the bugzilla,
  deadlocks on Ubuntu 20.04 and Ubuntu 18.04.  To further strengthen the
  case that this is because of a bug in glibc, my app and the repro do
  not deadlock on Ubuntu 16.04.

  To rule out kernel issues, I further confirm that no deadlock happens
  when I copy Ubuntu 16.04's libc to 18.04 and redirect the dynamic
  linker so my app loads the earlier libc.

  I confirm that the one-line patch (available at the above bugzilla)
  applies cleanly on top of:

  * glibc-2.31-0ubuntu9.1 (Ubuntu 20.04 latest)
  * glibc-2.28-10 (Debian Buster/10 latest)
  * glibc-2.27-3ubuntu1.2 (Ubuntu 18.04 latest)

  I confirm that the one-line patch to glibc cures the deadlock issue in
  my OCaml apps.

  On Ubuntu 20.04, I have not been able to get the repro to deadlock in
  5 days.  My OCaml apps have not deadlocked in 5 days.

  On Debian Buster/10, the repro has not deadlocked in about 5 days.
  This is my desktop box, and I can otherwise use normal applications as
  usual like the GNOME environment, etc.

  On Ubuntu 18.04, the repro takes about 24-48 hours before it triggers
  a deadlock.  Prior to patching glibc, it would take only a few hours.
  I have not seen my OCaml apps deadlock since applying this patch,
  however.

  On Ubuntu 16.04 I have not been able to get the repro to deadlock
  ever.  My OCaml apps never deadlocked on this platform.  This is
  expected, since this platform runs glibc 2.23, where the bug has not
  been introduced yet (the bugzilla report claims introduced in 2.27).

  As for why 18.04 still deadlocks, I suspect another, unrelated pthread
  bug was introduced in glibc 2.27 and fixed by 2.28.  When applied to
  glibc 2.27, the one-line patch appears to significantly reduce the
  deadlocking by an order of magnitude.

  Please kindly consider merging the patch into Ubuntu glibc.

  More background about this bug, for the sake of future internet searchers:
  * https://discuss.ocaml.org/t/is-there-a-known-recent-linux-locking-bug-that-affects-the-ocaml-runtime

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1899800/+subscriptions



More information about the foundations-bugs mailing list