After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 660306 - gdkeventloop-quartz: don't die with an assertion if poll() fails
gdkeventloop-quartz: don't die with an assertion if poll() fails
Status: RESOLVED OBSOLETE
Product: gtk+
Classification: Platform
Component: Backend: Quartz
unspecified
Other Mac OS
: Normal normal
: ---
Assigned To: gtk-quartz maintainers
gtk-bugs
Depends on:
Blocks:
 
 
Reported: 2011-09-27 23:18 UTC by Owen Taylor
Modified: 2018-04-14 23:58 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
gdkeventloop-quartz: don't die with an assertion if poll() fails (1.87 KB, patch)
2011-09-27 23:18 UTC, Owen Taylor
none Details | Review

Description Owen Taylor 2011-09-27 23:18:12 UTC
This, combined with file description management fixed in bug 599664
generated the assertion failure described in:

https://mail.gnome.org/archives/gtk-devel-list/2010-February/msg00012.html
Comment 1 Owen Taylor 2011-09-27 23:18:14 UTC
Created attachment 197617 [details] [review]
gdkeventloop-quartz: don't die with an assertion if poll() fails

If the GLib poll function returns -1, that means that an error
occurred; confusing that with the -1 return from
select_thread_start_poll() was causing us to die with an assertion
failure when we tried to collect the results of the poll.
Comment 2 John Ralls 2011-11-11 19:10:34 UTC
You've created a "magic number" return code. You should use a #define to name it.
You also didn't change the documentation of the return values in the comment before select_thread_start_poll().

But ISTM that it would be better to document that select_thread_start_poll() can return -1 either because async polling was started or there was an error, and that callers should check errno to decide what to do -- and then do that in poll_func and run_loop_before_waiting (just change the condition to if (errno) from if (nready == -2)).

My other concern is that you're assuming poll() will set only EAGAIN or EINTR. There's no problem with ignoring EINTR, but we should count the times we get EAGAIN and crash if it doesn't self-heal after n (10?) tries so that the thread doesn't hang. We need to react drastically if poll() returns EFAULT or EINVAL: They indicate a serious problem with the program. What's more, an application might set its own polling function that sets other errno values that we can't anticipate. 

I suggest ignoring EINTR (as you do already), counting EAGAINS and bailing if it doesn't heal in some reasonable number of retries, and crashing (with a suitable g_critical() for anything else). (I suppose one could instead step through the fds and find the one that's causing the error and disconnect it, but that might be beyond the scope of this bug.)
Comment 3 Owen Taylor 2011-11-14 16:45:07 UTC
(In reply to comment #2)
> You've created a "magic number" return code. You should use a #define to name
> it.
> You also didn't change the documentation of the return values in the comment
> before select_thread_start_poll().
> 
> But ISTM that it would be better to document that select_thread_start_poll()
> can return -1 either because async polling was started or there was an error,
> and that callers should check errno to decide what to do -- and then do that in
> poll_func and run_loop_before_waiting (just change the condition to if (errno)
> from if (nready == -2)).

errno is only set when an error occurs. So if we wanted to use errno to distinguish two meanings of -1, then somewhere we'd have to clear errno, strtol() style. Not something I like at all.

> My other concern is that you're assuming poll() will set only EAGAIN or EINTR.
> There's no problem with ignoring EINTR, but we should count the times we get
> EAGAIN and crash if it doesn't self-heal after n (10?) tries so that the thread
> doesn't hang. We need to react drastically if poll() returns EFAULT or EINVAL:
> They indicate a serious problem with the program. What's more, an application
> might set its own polling function that sets other errno values that we can't
> anticipate. 
>

The code isn't assuming that poll() will only set EAGAIN or EINTR - if we don't have EINTR, then it warns - this is basically rather assuming that EAGAIN doesn't need to be handled; if EAGAIN was a realistic thing to retry on, we would want to do that silently, and not g_warning() producing new memory allocations, etc.

What the code is actually doing more than anything else is exactly matching what GLib does in g_poll() - if poll() fails for anything other than EINTR it warns, and the programmer will hopefully use that to debug their buggy program. We've been using the same logic in GLib since 2000, and EAGAIN has never been an issue, I'm not sure it's worth adding code for here.

> I suggest ignoring EINTR (as you do already), counting EAGAINS and bailing if
> it doesn't heal in some reasonable number of retries, and crashing (with a
> suitable g_critical() for anything else). (I suppose one could instead step
> through the fds and find the one that's causing the error and disconnect it,
> but that might be beyond the scope of this bug.)

I don't think there is any way we could do the back-mapping and figure out the responsible GSource - so by the next time that the poll function got called, we'd have the same offending FD, and get the problem again.

In terms of g_warning() vs. g_critical() - I don't really care ... I just copied what we're doing in GLib for level and text. g_critical() is arguably slightly more right for a programmer bug, but we tend to use g_warning() for everything but g_return_if_fail() and friends. (GTK+ has 8 uses of g_critical() and 671 uses of g_warning())
Comment 4 John Ralls 2011-11-14 23:14:33 UTC
You'd clear errno before calling select_thread_call_poll(). Yes, just like strtol(). If you don't like that you could add a GError argument to select_thread_call_poll() and test that, or an int* nready argument and leave the return value to signal the error. Any of those are better than mixing an error return in with an already overloaded return value.

The point of EAGAIN is that it's the only thing that might not hang the program if that's what the error is, because it might get better. If it doesn't, or if the error is EFAULT or EINVAL, the program will hang because nothing tells it to exit the runloop. That seems undesirable to me (crashes are much easier to diagnose than hangs) so since you're letting the runloop keep running I guessed that you were assuming the problem would fix itself. I guess that criticism is equally true of the way g_main_context_poll() works.
Comment 5 Michael Natterer 2012-04-14 15:41:49 UTC
Owen, I run into that assertion anyway, also with your patch applied,
but it happens unreliably. The scenarios are (all in GIMP):

- Open recent, invoked from the menu (so triggered by quartz)
- DND of a file onto the toolbox or the empty image window

The weird thing here is, if it worked once, it keeps working,
so only the very first open recent or drop would cause a crash
(if at all), if it didn't, things just work and I don't manage
to run into the assertion.

Note that the DND thing has been there forever, but the assertion
on open recent only happens since today, after I found a fix for
the hard crash in bug #674108.
Comment 6 Michael Natterer 2012-04-14 17:17:23 UTC
Replacing the g_assert()s by the following:

  if (ufds[i].fd != current_pollfds[i].fd ||
      ufds[i].events != current_pollfds[i].events)
    {
      g_warning ("ufds[] don't match current_pollfds[], bailing");
      SELECT_THREAD_UNLOCK ();
      return n_ready;
    }

Make things just work in the two cases described in the last
comment. I know this is evil and not a fix at all, but I didn't
see any side effects whatsoever.

I should probably have filed a separate bug about this, it's
not related to failed poll()...
Comment 7 Alex Markley 2015-02-19 02:48:00 UTC
Hey all! Is this bug at all related to the following error message?

	select_thread_collect_poll: assertion failed: (ufds[i].fd == current_pollfds[i].fd)

I have been getting this error message a lot recently running Pidgin and it is pretty annoying because it brings my whole chat client down.

In the past I have not had any issues running Pidgin in this configuration, but recently it has begun terminating with the above error message several times per day. However the issue seems to be erratic and I don't know how to trigger it.

My software stack:

	Mac OS X Yosemite 10.10.2
	MacPorts 2.3.3
	gtk2 @2.24.25_0+quartz
	pidgin @2.10.11_0+finch+quartz+spellcheck

Some searching has revealed some others complaining about this issue, but not very many and not very recently:

https://github.com/conformal/xombrero/issues/49#issuecomment-37491660
https://mail.gnome.org/archives/gtk-osx-users-list/2010-February/msg00021.html

Is there any way I can assist in troubleshooting this issue?
Comment 8 John Ralls 2015-02-19 04:08:12 UTC
(In reply to Alex Markley from comment #7)
> Hey all! Is this bug at all related to the following error message?
> 
> 	select_thread_collect_poll: assertion failed: (ufds[i].fd ==
> current_pollfds[i].fd)
> 

More likely it's bug 674108.
Comment 9 Matthias Clasen 2018-02-10 04:55:00 UTC
We're moving to gitlab! As part of this move, we are moving bugs to NEEDINFO if they haven't seen activity in more than a year. If this issue is still important to you and still relevant with GTK+ 3.22 or master, please reopen it and we will migrate it to gitlab.
Comment 10 Matthias Clasen 2018-04-14 23:58:19 UTC
As announced a while ago, we are migrating to gitlab, and bugs that haven't seen activity in the last year or so will be not be migrated, but closed out in bugzilla.

If this bug is still relevant to you, you can open a new issue describing the symptoms and how to reproduce it with gtk 3.22.x or master in gitlab:

https://gitlab.gnome.org/GNOME/gtk/issues/new