Bug 731554 – Handle EAGAIN from pthread_create() gracefully

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 731554 - Handle EAGAIN from pthread_create() gracefully


Summary:	Handle EAGAIN from pthread_create() gracefully


Status:	RESOLVED WONTFIX

Product:	glib
Classification:	Platform
Component:	gthread
Version:	2.40.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2014-06-12 08:19 UTC by Milan Crha
Modified:	2014-07-09 09:19 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Milan Crha 2014-06-12 08:19:36 UTC

There are many filled bug reports from various project either here, though more in bugzilla.redhat.com, about crashes in g_thread_new() with error
"Resource temporarily unavailable". This happens when pthread_create() returns EAGAIN. As the application aborts, while the "error" is only temporary, then it would be good to handle this in GLib in a more graceful way.

Downstream examples, the first with quite many duplicates:
https://bugzilla.redhat.com/show_bug.cgi?id=995177
https://bugzilla.redhat.com/show_bug.cgi?id=1072154
https://bugzilla.redhat.com/show_bug.cgi?id=1108443

Comment 1 Emmanuele Bassi (:ebassi) 2014-06-12 08:43:24 UTC

I actually noticed that the Linux kernel/glibc on Fedora, starting from when the 3.14 kernel hit Fedora 20, has become way more EAGAIN-happy when it comes to thread creation and limits.

we do have g_thread_try_new(), but this does not protect existing code from changes in the layer underneath GLib; now that sessions become fairly thread-intensive as soon as a component aborts the whole session gets flaky. for instance, for me the first service to go away is gnome-keyring, which means that my SSH authenticating agent dies and does not get replaced in any existing terminal session.

Comment 2 Dan Winship 2014-06-12 13:23:18 UTC

(In reply to comment #0)
> As the application aborts, while the "error" is only temporary, then it
> would be good to handle this in GLib in a more graceful way.

As I understand it, it's "temporary" only under the assumption that another thread will exit soon.

The pthread_create man page seems to imply that there is a single thread cap covering all of a user's processes, not a per-process limit like with EMFILE. So, if *any* process leaks threads, then all processes are doomed. (It's possible the man page is written poorly and isn't supposed to imply that, or else that the text predates NPTL and was never updated.) But if that's true, then part of the fix might be getting the default value for that increased, since user sessions are going to have a lot more threads in them these days than they used to.

Comment 3 Miloslav Trmac 2014-06-12 13:49:32 UTC

(In reply to comment #0)
> This happens when pthread_create() returns
> EAGAIN. As the application aborts, while the "error" is only temporary

This may not be temporary; the limits may be explicitly (though mistakenly) set too low (https://bugzilla.redhat.com/show_bug.cgi?id=1072154#c11 ), making the failure consistent and deterministic.  So, for example, just looping and retrying might actually make the situation worse in some scenarios.

All I’d personally ask for is a clean propagation of the error condition to callers via GError (all the way up the caller stack; again, see https://bugzilla.redhat.com/show_bug.cgi?id=1072154 for a scenario where there are willing recipients of error information but it doesn’t get to them).

Comment 4 Dan Winship 2014-06-12 16:11:32 UTC

(In reply to comment #3)
> All I’d personally ask for is a clean propagation of the error condition to
> callers via GError (all the way up the caller stack; again, see
> https://bugzilla.redhat.com/show_bug.cgi?id=1072154 for a scenario where there
> are willing recipients of error information but it doesn’t get to them).

No, if the thread-creation failure got propagated up to giomodule, the effect would be that the gvfs backend would just fail to load, so if the program did continue running, it wouldn't have access to any remote filesystems. (And given that the failure occurred while creating a GDBus-internal singleton, nothing else using D-Bus would work either.) And we wouldn't have done anything to address the thread exhaustion, so the next attempt to create a new thread would fail too.

Comment 5 Allison Karlitskaya (desrt) 2014-06-13 17:22:48 UTC

This is ridiculous.  Unless this is an EINTR-style "please try again immediately" type of failure, we have no business getting in the middle of it.  We already have g_thread_try_new() for those two want to attempt to deal with these failures, but the fact is that most people don't want to use it, and with good reason.

As for GLib's internal uses, such as D-Bus, I really don't want to get into a situation where some weird non-deterministic temporary condition within the kernel leads to degraded application performance (as would be the case of reporting failure to connect to the bus).

This is a kernel problem, and it really needs to be fixed there.

Comment 6 Milan Crha 2014-06-16 07:07:27 UTC

(In reply to comment #5)
> This is a kernel problem, and it really needs to be fixed there.

You know "the problem" had been introduced by GLib, not by kernel, right? Evolution was always a heavy user of threads. There can be many running at the same time, but not constantly. Well, until the GLib begun to use threads for whatever reasons, like GTask, and basically anything async needs its thread. What worries me the most are the persistent threads. Evolution has currently these "foreign" persistent threads after start:
   dconf_gdbus_worker_thread ()
   gdbus_shared_thread_func ()
   glib_worker_main ()
(it has many more foreign threads, many from WebKitGTK3, but those are not directly related to GLib; if you are wondering, then my Evolution after start has 21 threads, including the main). I also tried gtk3-demo and it has only two threads after start, the main and gdbus_shared_thread_func().

Anyway, if I count these 3 threads for each GLib based application (which also uses GSettings), then with 10 applications (or even D-Bus services, while there are running quite many D-Bus services these days) I lost 30 threads which can be used better than by GLib internals.

If I recall properly, then this situation begun only in time of 3.10.x, not before, thus whatever the GLib changed it made only higher resources requirements and bad user experience (to be fair, I don't recall I ever saw any such crash myself; I do not run that many applications too).

Comment 7 Emmanuele Bassi (:ebassi) 2014-06-24 13:11:43 UTC

(In reply to comment #6)
> (In reply to comment #5)
> > This is a kernel problem, and it really needs to be fixed there.
> 
> You know "the problem" had been introduced by GLib, not by kernel, right?

"the problem" has been introduced by the fact that this is not 1997, and people use threads, and a modern OS should actually deal with that.

> Anyway, if I count these 3 threads for each GLib based application (which also
> uses GSettings), then with 10 applications (or even D-Bus services, while there
> are running quite many D-Bus services these days) I lost 30 threads which can
> be used better than by GLib internals.

I sincerely doubt that, since it's not like GLib is creating threads for s**t and giggles. even reducing the number of long running threads, applications are using them, and some applications do run for the whole length of the session, so we gain nothing by overcomplicating GLib (if it's at all possible).
 
> If I recall properly, then this situation begun only in time of 3.10.x, not
> before, thus whatever the GLib changed it made only higher resources
> requirements and bad user experience (to be fair, I don't recall I ever saw any
> such crash myself; I do not run that many applications too).

just run any web browser that separates each tab into its own process; after getting at aroun 60 tabs you'll start hitting EAGAIN from creating threads and processes pretty much everywhere.

Comment 8 Mikhail 2014-06-24 17:40:03 UTC

Agree with Milan and Emmanuele. Why not?

Comment 9 Mikhail 2014-07-09 09:19:47 UTC

(In reply to comment #5)
> This is ridiculous.  Unless this is an EINTR-style "please try again
> immediately" type of failure, we have no business getting in the middle of it. 
> We already have g_thread_try_new() for those two want to attempt to deal with
> these failures, but the fact is that most people don't want to use it, and with
> good reason.
> 
> As for GLib's internal uses, such as D-Bus, I really don't want to get into a
> situation where some weird non-deterministic temporary condition within the
> kernel leads to degraded application performance (as would be the case of
> reporting failure to connect to the bus).
> 
> This is a kernel problem, and it really needs to be fixed there.

Ryan, if you offer to increase limits in the kernel then why we need any limits?
So problem not solved and I think you should reopen this bug report, or write to kernel developer that they remove this limit.