After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 674108 - Hard crash due to wrong NSAutoreleasePool stacking
Hard crash due to wrong NSAutoreleasePool stacking
Status: RESOLVED OBSOLETE
Product: gtk+
Classification: Platform
Component: Backend: Quartz
2.24.x
Other Mac OS
: Normal critical
: ---
Assigned To: gtk-quartz maintainers
gtk-bugs
: 683207 (view as bug list)
Depends on:
Blocks:
 
 
Reported: 2012-04-14 15:28 UTC by Michael Natterer
Modified: 2018-04-15 00:13 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Patch which simplifies and fixes autorelease pool handling (3.41 KB, patch)
2012-04-14 15:28 UTC, Michael Natterer
none Details | Review
Crash report after using "Open Recent" in gimp. (36.85 KB, text/plain)
2012-05-07 11:33 UTC, Daniel Sabo
  Details
Leaks caused by moving the autorelease pool into poll_func (24.76 KB, text/plain)
2012-05-10 12:48 UTC, Daniel Sabo
  Details
Hack for poll_func recursion errors (1.38 KB, patch)
2012-05-11 03:47 UTC, Daniel Sabo
none Details | Review
Patch fixing the crash for me (2.80 KB, patch)
2012-06-23 21:36 UTC, Kristian Rietveld
committed Details | Review

Description Michael Natterer 2012-04-14 15:28:18 UTC
Created attachment 212050 [details] [review]
Patch which simplifies and fixes autorelease pool handling

Apparently, some cases of recursive loops are not handled
properly in gdkeventloop-quartz.c, I can reliably reproduce
a hard crash with totally useless stack trace of two frames,
ending in an objc memory function, when using "open recent"
in GIMP. Paul Davis reported the same in Ardour.

The attached patch is much less smart about refreshing the
autorelease pool and doesn't try to figure at which loop
levels to refresh the pool, it simply stacks the pools
by having one pool per poll_func(), which completely seems
to fix the crash.
Comment 1 Kristian Rietveld 2012-04-22 20:09:33 UTC
I am not sure yet.  The old code was refreshing an autorelease pool that stayed active during the event handling (and event handlers can do Cocoa calls).  This pool is setup when the Glib main loop is in control. This serves the purpose of supplying an autorelease pool that is usually supplied by the Cocoa main loop if I am not mistaken.

What is weird is that the code comment says this should happen in _prepare(), while it actually happens in _check().

To me, the patch appears to set up an auto release pool just for the Cocoa calls in poll_func and it does not supply a pool for when e.g. event handlers are run.


I need to think about this some more (the fact that I am not very familiar with this code does not help much ;).
Comment 2 Daniel Sabo 2012-05-07 11:33:21 UTC
Created attachment 213579 [details]
Crash report after using "Open Recent" in gimp.

I have (I assume) the same crash in gimp, but for me it produces a full stack trace in the crash report. Michael's patch does prevent the crash, but instead produces a lot of "just leaking" warnings.

(Using GTK 2.24.10 plus the patches from the gtk-osx-stable jhbuild module)
Comment 3 Michael Natterer 2012-05-09 20:09:38 UTC
What "just leaking" warnings do you mean? I get no warnings. And what
patches exactly?
Comment 4 Daniel Sabo 2012-05-10 12:48:49 UTC
Created attachment 213802 [details]
Leaks caused by moving the autorelease pool into poll_func

The patches are:
http://git.gnome.org/browse/gtk-osx/plain/patches/0004-Bug-571582-GtkSelection-implementation-for-quartz.patch
http://git.gnome.org/browse/gtk-osx/plain/patches/0006-Bug-658722-Drag-and-Drop-sometimes-stops-working.patch
http://git.gnome.org/browse/gtk-osx/plain/patches/0008-Implement-GtkDragSourceOwner-pasteboardChangedOwner.patch
http://git.gnome.org/browse/gtk-osx/plain/patches/gtk+-Bug-655065-Better-Fix.patch
http://git.gnome.org/browse/gtk-osx/plain/patches/0001-Bug-670373-gtk2-modules-printing-cups-gtkprintbackendcups.patch
http://git.gnome.org/browse/gtk-osx/plain/patches/0002-gtk2-Extract-reasons-and-reasons_desc-arrays-to-file-leve.patch
http://git.gnome.org/browse/gtk-osx/plain/patches/0003-gtk2-Create-enum-PrinterStateLevel.patch
http://git.gnome.org/browse/gtk-osx/plain/patches/0004-gtk2-Extract-printer-setup-variables-into-a-struct.patch
http://git.gnome.org/browse/gtk-osx/plain/patches/0005-gtk2-Extract-function-cups_printer_handle_attribute.patch
http://git.gnome.org/browse/gtk-osx/plain/patches/0006-gtk2-Extract-Function-cups_create_printer.patch
http://git.gnome.org/browse/gtk-osx/plain/patches/0007-gtk2-Move-some-variable-declarations-into-the-scopes-in-w.patch
http://git.gnome.org/browse/gtk-osx/plain/patches/0008-Bug-670373-gtk2-modules-printing-cups-gtkprintbackendcups.patch
http://git.gnome.org/browse/gtk-osx/plain/patches/0009-Bug-670373-gtk2-modules-printing-cups-gtkprintbackendcups.patch
0001-Add-extended-input-support-to-GTK-Quartz.patch -> http://bugzilla-attachments.gnome.org/attachment.cgi?id=213140
0001-quartz-move-the-autorelease-pool-into-poll_func.patch -> http://bugzilla-attachments.gnome.org/attachment.cgi?id=212050

Note the Gdk:ERROR at the end, I hadn't noticed it before but it seems fairly persistent.
Comment 5 Daniel Sabo 2012-05-10 12:51:51 UTC
After hacking at this some more, while the autorelease pool is part of the issue it's not the entire solution. If I use a counter to prevent it from draining when poll_func recurses (or just never drain the pool) it will stop the send message crash. However, it replaces it with this error:

Gdk:ERROR:gdkeventloop-quartz.c:564:select_thread_collect_poll: assertion failed: (ufds[i].fd == current_pollfds[i].fd)
gimp: terminated: Abort trap

I suspect this is because the Gimp menu action does some file operations and modifies ufds. It may even cause it to be deallocated, but I've only managed to get that error once in valgrind (and running gimp in valgrind is excruciatingly slow).

My conclusion would be that [NSApp nextEventMatchingMask:] can't be called safely between select_thread_start_poll and select_thread_collect_poll, but I'm not sure what the correct fix is.
Comment 6 John Ralls 2012-05-10 16:30:23 UTC
Several of those patches are a bit out of date and have been incorporated into master already. Perhaps you could do a build of gtk-2-24 from git (use modulesets-unstable if you're using gtk-osx) without any patches except the one Mitch proposed on this bug and work with that, just to make sure that we're all working from the same code.

Also, Apple's malloc has some built-in debugging features that I've found work better than valgrind on OSX. Take a look at http://developer.apple.com/library/mac/technotes/tn2124/_index.html#//apple_ref/doc/uid/DTS10003391-CH1-SECMALLOCDEBUG for a summary.
Comment 7 Michael Natterer 2012-05-10 17:16:21 UTC
Daniel, can you attach the patch that determines by counter when to
drain the pool? If it fixed the crash *and* the leaks, then it's
clearly preferrable over my patch.
Comment 8 Daniel Sabo 2012-05-11 03:47:53 UTC
Created attachment 213845 [details] [review]
Hack for poll_func recursion errors

Re: John
I re-checked things with a clean git master with no other patches applied, it produces the same leaks and (sometimes) the GTK ERROR crash.

Re: Michael
I didn't make a patch because it just moved the crash to the more intermittent GTK ERROR assertion, poll_func still couldn't safely recurse.

Changing this line:
if (current_loop_level == 0 && g_main_depth() == 0)
to this
if (current_loop_level == 0 && g_main_depth() == 0 && getting_events == 0)

is sufficient to fix the autorelease pool error but not the other crash. It also bothers me because g_main_depth() == 0 should already cover this condition.

For reference sake I've attached the nasty hack I have going to avoid both these errors, but it's not something I'd want to commit.
Comment 9 Michael Natterer 2012-05-11 06:23:27 UTC
Thanks. There is not only this case when the crash happens, e.g.
bug #673264 gets the crash but not the assertion. I think we should
commit your patch with a big fat comment that this entire file
is making some false assumption we have not figured yet. Will
take care of that.
Comment 10 Kristian Rietveld 2012-06-23 21:34:03 UTC
I've debugged this with gimp with commit 668891e7455ff9107901b004e7ce9d422041ef71 reverted. First, I've had to move the auto release pool "refresh" from the check to the dispatch function. This avoids the crash with the smashed stack. Also, for some reason, I think it is better suited to do this in dispatch.

With that change I get the above mentioned assert:

Gdk:ERROR:gdkeventloop-quartz.c:564:select_thread_collect_poll: assertion
failed: (ufds[i].fd == current_pollfds[i].fd)


It turns out this happens when in a recursive main loop a fd is added/removed. In that case, g_main_context_iterate() re-allocates the cached fds (cached_poll_array). Note that the ufds parameter to poll_func() points at this. So, by the time we return in the base level main loop and want to call the collect function, the ufds array might no longer be there.  Thus, recursive main loops spawned from nextEventMatchingMask: might invalidate ufds.

Will attach a patch.
Comment 11 Kristian Rietveld 2012-06-23 21:36:50 UTC
Created attachment 217096 [details] [review]
Patch fixing the crash for me

This patch moves the autorelease pool drain and introduces protection against the invalidated ufds. Basically, when we suspect ufds has been invalidated by a recursive main loop instance, we refrain from calling the collect function.

The patch is not fully correct yet:

 1) We still don't really have a good explanation why the autorelease pool drain needs to be moved.

 2) I get warnings like "(gimp-2.9:20008): GLib-WARNING **: poll(2) failed due to: Undefined error: 0." probably caused by a wrong return value when we refrain from running the collect function. Perhaps we should just immediately bail out of the poll function with a sensible return value instead of just skipping collect.

I have not tried this patch yet with the other GIMP segfault around select -> rounded rectangle.
Comment 12 Kristian Rietveld 2012-07-17 20:07:39 UTC
The patch unfortunately does not fix the problems around select -> rounded rectangle.
Comment 13 Kristian Rietveld 2012-07-17 20:37:15 UTC
To clarify, select -> rounded rectangle does not crash, but hangs blocking on a read (gimp wire read).  Perhaps poll missed the incoming data on the fd?  Will investigate.
Comment 14 Michael Natterer 2012-09-14 13:28:03 UTC
(In reply to comment #11)
> Created an attachment (id=217096) [details] [review]
> Patch fixing the crash for me

Applied to master and gtk-2-24. Leaving open because we still get
this warning from poll. All crashes seem fixed though.

commit 1ad25dfb812f53d1b288d0816b1d0d85c1d24d9f
Author: Michael Natterer <mitch@gimp.org>
Date:   Fri Sep 14 15:18:33 2012 +0200

    quartz: Bug 674108 - Hard crash due to wrong NSAutoreleasePool stacking
    
    Apply patch from Kristian Rietveld which addresses two issues
    in gdkeventloop-quartz.c:
    
    This patch moves the autorelease pool drain and introduces protection against
    the invalidated ufds. Basically, when we suspect ufds has been invalidated by a
    recursive main loop instance, we refrain from calling the collect function.
    (cherry picked from commit 79b3326eaab18b942bd7e03ae8d24544182cb3dd)

 gdk/quartz/gdkeventloop-quartz.c |   45 ++++++++++++++++++++++++-------------
 1 files changed, 29 insertions(+), 16 deletions(-)
Comment 15 Michael Natterer 2012-09-14 13:30:13 UTC
*** Bug 683207 has been marked as a duplicate of this bug. ***
Comment 16 Daniel Sabo 2012-09-15 16:38:24 UTC
This will still cause events for the outer loop to get lost when a recursion happens. There should at least be a visible g_warning so we know the corruption is happening, otherwise we're going to end up with very hard to track bugs whenever these calls get made (e.g. bug 683207 is going to become "one out of 100 times drag and drop doesn't open the file").
Comment 17 John Ralls 2013-06-11 15:20:21 UTC
(In reply to comment #16)
> e.g. bug 683207 is going to become "one out of
> 100 times drag and drop doesn't open the file").

That would be bug 658722.

Note also bug 701571, which reports anomalous behavior in gtkosx_application menus after this patch is applied.
Comment 18 John Ralls 2013-06-26 02:50:41 UTC
I've made a gtk2 version of the  patch on bug 701571: https://bugzilla.gnome.org/attachment.cgi?id=247787

It moves the autoreleasepool recycling to gdk_edit_prepare(), so the same pool is in effect for a whole cycle of the main loop. It also creates an outer pool so that the objects initialized before starting the main loop don't all get released in the first iteration.

Mitch, could you test this for your crash in the Gimp?
Comment 19 Kristian Rietveld 2013-07-03 20:29:32 UTC
The patch does release all objects properly contrary to the current code in master, but it does not solve the main problem we were seeing in gimp (this is only visible when reverting gimp commit 668891e7455ff9107901b004e7ce9d422041ef71).

The patch we committed did fix this problem, but that happened because the body of the if-statement was never executed -- the autorelease pool was never drained. This can be easily fixed by comparing g_main_depth() to 1 instead of 0, so that the pool is drained correctly. However, also with this change, it does no longer solve the problem we saw in gimp.

It turns out that the real problem is that the pool is currently drained when poll_func is active (and with NSApplication nextEventMatchingMask in the current stack trace). This indicates that something must be wrong with how recursive main loops are handled currently.


I am thinking about a fix for both of these issues. Hopefully these problems are then gone forever.
Comment 20 John Ralls 2013-07-03 23:20:48 UTC
Kris, I'm having trouble understanding how poll_func() can be active at the same time as the base gdk_event_prepare(). Can you post a stack trace showing that?
Comment 21 Alex Markley 2015-02-19 04:12:23 UTC
Not sure this is related, but I am also getting lots of assertion failed: (ufds[i].fd == current_pollfds[i].fd) error messages and abnormal program exits using Pidgin with the GTK Quartz backend. (I posted more details of my situation at https://bugzilla.gnome.org/show_bug.cgi?id=660306#c7 )

I'd be happy to help with troubleshooting / testing if it would be helpful.
Comment 22 Alex Markley 2015-02-22 06:20:54 UTC
By the way, this is the full error message that I'm seeing: 

Gdk:ERROR:gdkeventloop-quartz.c:562:select_thread_collect_poll: assertion failed: (ufds[i].fd == current_pollfds[i].fd)
Abort trap: 6
Comment 23 Matthias Clasen 2018-02-10 05:17:55 UTC
We're moving to gitlab! As part of this move, we are moving bugs to NEEDINFO if they haven't seen activity in more than a year. If this issue is still important to you and still relevant with GTK+ 3.22 or master, please reopen it and we will migrate it to gitlab.
Comment 24 Matthias Clasen 2018-04-15 00:13:33 UTC
As announced a while ago, we are migrating to gitlab, and bugs that haven't seen activity in the last year or so will be not be migrated, but closed out in bugzilla.

If this bug is still relevant to you, you can open a new issue describing the symptoms and how to reproduce it with gtk 3.22.x or master in gitlab:

https://gitlab.gnome.org/GNOME/gtk/issues/new