GNOME Bugzilla – Bug 657891
spawn-multithreaded test hangs occasionally on recent Linux kernels/glibc
Last modified: 2011-09-16 21:23:09 UTC
I'm on Fedora 15. Nothing special here. Occasionally spawn-multithreaded hangs in a rather fantastic way. Try this: ~/code/glib/gthread/tests$ while true; do ./spawn-multithreaded ; done /gthread/spawn-sync: OK /gthread/spawn-async: OK /gthread/spawn-sync: OK /gthread/spawn-async: OK /gthread/spawn-sync: OK eventually you'll get a /gthread/spawn-sync: or sometimes with async. Looking at 'ps', you see: desrt 10546 0.1 0.0 3905000 2188 pts/8 Sl+ 23:05 0:00 /home/desrt/code/glib/gthread/tests/.libs/lt-spawn-multithreaded desrt 10722 99.3 0.1 4437740 11760 pts/8 R+ 23:05 2:03 /home/desrt/code/glib/gthread/tests/.libs/lt-spawn-multithreaded with the running process eating 100% of CPU. The 4 gigs of virtual memory is pretty impressive too. Attempting to attach gdb results in gdb growing to about a gig in size, and then starting to consume 100% CPU itself. No help there. I thought that maybe strace would help, but when I run it under strace, the crash doesn't seem to happen.
When it happens in the async case, you often also see a lot of this: 24189 pts/8 Sl+ 0:00 /home/desrt/code/glib/gthread/tests/.libs/lt-spawn-multithreaded 24562 pts/8 R+ 0:27 /home/desrt/code/glib/gthread/tests/.libs/lt-spawn-multithreaded 24565 pts/8 Z+ 0:00 [test-spawn-echo] <defunct> 24566 pts/8 Z+ 0:00 [test-spawn-echo] <defunct> 24567 pts/8 Z+ 0:00 [test-spawn-echo] <defunct> 24568 pts/8 Z+ 0:00 [test-spawn-echo] <defunct> 24572 pts/8 Z+ 0:00 [test-spawn-echo] <defunct> 24573 pts/8 Z+ 0:00 [test-spawn-echo] <defunct> 24575 pts/8 Z+ 0:00 [test-spawn-echo] <defunct> 24576 pts/8 Z+ 0:00 [test-spawn-echo] <defunct> 24578 pts/8 Z+ 0:00 [test-spawn-echo] <defunct> 24579 pts/8 Z+ 0:00 [test-spawn-echo] <defunct> 24582 pts/8 Z+ 0:00 [test-spawn-echo] <defunct> 24589 pts/8 Z+ 0:00 [test-spawn-echo] <defunct>
See also https://bugzilla.gnome.org/show_bug.cgi?id=652072#c17
Created attachment 195399 [details] proof of libc/kernel bug Here's a program written against pure pthreads that demonstrates the problem. It takes a lot longer to crash than the GLib version, so I'm guessing GLib makes some timing issues more favourable... but the bug is clearly here. Compile with -pthread.
This bug is present on 32 bits of F15 as well as 64. The bug is present on F16 as well with kernel 3.0.0-1.fc16.x86_64 and glibc-2.14.90-4.x86_64. The bug is present on Ubuntu Oneiric alphas with kernel 3.0.0-9-generic and (e)glibc 2.13-17ubuntu2.
More: - using -static appears to have the side effect of solving the problem - replacing fork() with syscall (SYS_fork) appears to solve the problem - replacing the execv() with a direct exit(0) does not solve the problem but seems to change the frequency of the occurrence
As per the glibc website, I filed bugs against the distributions: - https://bugs.launchpad.net/ubuntu/+source/eglibc/+bug/838975 After more than 2 weeks and a few IRC pokes, no love from the Ubuntu guys. After getting bored of waiting, I filed a bug against Fedora too: - https://bugzilla.redhat.com/show_bug.cgi?id=737387 After a few days, it looks like there's a fixed package in F16.
The test is 100% fine with the updated glibc installed.