GNOME Bugzilla – Bug 674108
Hard crash due to wrong NSAutoreleasePool stacking
Last modified: 2018-04-15 00:13:33 UTC
Created attachment 212050 [details] [review] Patch which simplifies and fixes autorelease pool handling Apparently, some cases of recursive loops are not handled properly in gdkeventloop-quartz.c, I can reliably reproduce a hard crash with totally useless stack trace of two frames, ending in an objc memory function, when using "open recent" in GIMP. Paul Davis reported the same in Ardour. The attached patch is much less smart about refreshing the autorelease pool and doesn't try to figure at which loop levels to refresh the pool, it simply stacks the pools by having one pool per poll_func(), which completely seems to fix the crash.
I am not sure yet. The old code was refreshing an autorelease pool that stayed active during the event handling (and event handlers can do Cocoa calls). This pool is setup when the Glib main loop is in control. This serves the purpose of supplying an autorelease pool that is usually supplied by the Cocoa main loop if I am not mistaken. What is weird is that the code comment says this should happen in _prepare(), while it actually happens in _check(). To me, the patch appears to set up an auto release pool just for the Cocoa calls in poll_func and it does not supply a pool for when e.g. event handlers are run. I need to think about this some more (the fact that I am not very familiar with this code does not help much ;).
Created attachment 213579 [details] Crash report after using "Open Recent" in gimp. I have (I assume) the same crash in gimp, but for me it produces a full stack trace in the crash report. Michael's patch does prevent the crash, but instead produces a lot of "just leaking" warnings. (Using GTK 2.24.10 plus the patches from the gtk-osx-stable jhbuild module)
What "just leaking" warnings do you mean? I get no warnings. And what patches exactly?
Created attachment 213802 [details] Leaks caused by moving the autorelease pool into poll_func The patches are: http://git.gnome.org/browse/gtk-osx/plain/patches/0004-Bug-571582-GtkSelection-implementation-for-quartz.patch http://git.gnome.org/browse/gtk-osx/plain/patches/0006-Bug-658722-Drag-and-Drop-sometimes-stops-working.patch http://git.gnome.org/browse/gtk-osx/plain/patches/0008-Implement-GtkDragSourceOwner-pasteboardChangedOwner.patch http://git.gnome.org/browse/gtk-osx/plain/patches/gtk+-Bug-655065-Better-Fix.patch http://git.gnome.org/browse/gtk-osx/plain/patches/0001-Bug-670373-gtk2-modules-printing-cups-gtkprintbackendcups.patch http://git.gnome.org/browse/gtk-osx/plain/patches/0002-gtk2-Extract-reasons-and-reasons_desc-arrays-to-file-leve.patch http://git.gnome.org/browse/gtk-osx/plain/patches/0003-gtk2-Create-enum-PrinterStateLevel.patch http://git.gnome.org/browse/gtk-osx/plain/patches/0004-gtk2-Extract-printer-setup-variables-into-a-struct.patch http://git.gnome.org/browse/gtk-osx/plain/patches/0005-gtk2-Extract-function-cups_printer_handle_attribute.patch http://git.gnome.org/browse/gtk-osx/plain/patches/0006-gtk2-Extract-Function-cups_create_printer.patch http://git.gnome.org/browse/gtk-osx/plain/patches/0007-gtk2-Move-some-variable-declarations-into-the-scopes-in-w.patch http://git.gnome.org/browse/gtk-osx/plain/patches/0008-Bug-670373-gtk2-modules-printing-cups-gtkprintbackendcups.patch http://git.gnome.org/browse/gtk-osx/plain/patches/0009-Bug-670373-gtk2-modules-printing-cups-gtkprintbackendcups.patch 0001-Add-extended-input-support-to-GTK-Quartz.patch -> http://bugzilla-attachments.gnome.org/attachment.cgi?id=213140 0001-quartz-move-the-autorelease-pool-into-poll_func.patch -> http://bugzilla-attachments.gnome.org/attachment.cgi?id=212050 Note the Gdk:ERROR at the end, I hadn't noticed it before but it seems fairly persistent.
After hacking at this some more, while the autorelease pool is part of the issue it's not the entire solution. If I use a counter to prevent it from draining when poll_func recurses (or just never drain the pool) it will stop the send message crash. However, it replaces it with this error: Gdk:ERROR:gdkeventloop-quartz.c:564:select_thread_collect_poll: assertion failed: (ufds[i].fd == current_pollfds[i].fd) gimp: terminated: Abort trap I suspect this is because the Gimp menu action does some file operations and modifies ufds. It may even cause it to be deallocated, but I've only managed to get that error once in valgrind (and running gimp in valgrind is excruciatingly slow). My conclusion would be that [NSApp nextEventMatchingMask:] can't be called safely between select_thread_start_poll and select_thread_collect_poll, but I'm not sure what the correct fix is.
Several of those patches are a bit out of date and have been incorporated into master already. Perhaps you could do a build of gtk-2-24 from git (use modulesets-unstable if you're using gtk-osx) without any patches except the one Mitch proposed on this bug and work with that, just to make sure that we're all working from the same code. Also, Apple's malloc has some built-in debugging features that I've found work better than valgrind on OSX. Take a look at http://developer.apple.com/library/mac/technotes/tn2124/_index.html#//apple_ref/doc/uid/DTS10003391-CH1-SECMALLOCDEBUG for a summary.
Daniel, can you attach the patch that determines by counter when to drain the pool? If it fixed the crash *and* the leaks, then it's clearly preferrable over my patch.
Created attachment 213845 [details] [review] Hack for poll_func recursion errors Re: John I re-checked things with a clean git master with no other patches applied, it produces the same leaks and (sometimes) the GTK ERROR crash. Re: Michael I didn't make a patch because it just moved the crash to the more intermittent GTK ERROR assertion, poll_func still couldn't safely recurse. Changing this line: if (current_loop_level == 0 && g_main_depth() == 0) to this if (current_loop_level == 0 && g_main_depth() == 0 && getting_events == 0) is sufficient to fix the autorelease pool error but not the other crash. It also bothers me because g_main_depth() == 0 should already cover this condition. For reference sake I've attached the nasty hack I have going to avoid both these errors, but it's not something I'd want to commit.
Thanks. There is not only this case when the crash happens, e.g. bug #673264 gets the crash but not the assertion. I think we should commit your patch with a big fat comment that this entire file is making some false assumption we have not figured yet. Will take care of that.
I've debugged this with gimp with commit 668891e7455ff9107901b004e7ce9d422041ef71 reverted. First, I've had to move the auto release pool "refresh" from the check to the dispatch function. This avoids the crash with the smashed stack. Also, for some reason, I think it is better suited to do this in dispatch. With that change I get the above mentioned assert: Gdk:ERROR:gdkeventloop-quartz.c:564:select_thread_collect_poll: assertion failed: (ufds[i].fd == current_pollfds[i].fd) It turns out this happens when in a recursive main loop a fd is added/removed. In that case, g_main_context_iterate() re-allocates the cached fds (cached_poll_array). Note that the ufds parameter to poll_func() points at this. So, by the time we return in the base level main loop and want to call the collect function, the ufds array might no longer be there. Thus, recursive main loops spawned from nextEventMatchingMask: might invalidate ufds. Will attach a patch.
Created attachment 217096 [details] [review] Patch fixing the crash for me This patch moves the autorelease pool drain and introduces protection against the invalidated ufds. Basically, when we suspect ufds has been invalidated by a recursive main loop instance, we refrain from calling the collect function. The patch is not fully correct yet: 1) We still don't really have a good explanation why the autorelease pool drain needs to be moved. 2) I get warnings like "(gimp-2.9:20008): GLib-WARNING **: poll(2) failed due to: Undefined error: 0." probably caused by a wrong return value when we refrain from running the collect function. Perhaps we should just immediately bail out of the poll function with a sensible return value instead of just skipping collect. I have not tried this patch yet with the other GIMP segfault around select -> rounded rectangle.
The patch unfortunately does not fix the problems around select -> rounded rectangle.
To clarify, select -> rounded rectangle does not crash, but hangs blocking on a read (gimp wire read). Perhaps poll missed the incoming data on the fd? Will investigate.
(In reply to comment #11) > Created an attachment (id=217096) [details] [review] > Patch fixing the crash for me Applied to master and gtk-2-24. Leaving open because we still get this warning from poll. All crashes seem fixed though. commit 1ad25dfb812f53d1b288d0816b1d0d85c1d24d9f Author: Michael Natterer <mitch@gimp.org> Date: Fri Sep 14 15:18:33 2012 +0200 quartz: Bug 674108 - Hard crash due to wrong NSAutoreleasePool stacking Apply patch from Kristian Rietveld which addresses two issues in gdkeventloop-quartz.c: This patch moves the autorelease pool drain and introduces protection against the invalidated ufds. Basically, when we suspect ufds has been invalidated by a recursive main loop instance, we refrain from calling the collect function. (cherry picked from commit 79b3326eaab18b942bd7e03ae8d24544182cb3dd) gdk/quartz/gdkeventloop-quartz.c | 45 ++++++++++++++++++++++++------------- 1 files changed, 29 insertions(+), 16 deletions(-)
*** Bug 683207 has been marked as a duplicate of this bug. ***
This will still cause events for the outer loop to get lost when a recursion happens. There should at least be a visible g_warning so we know the corruption is happening, otherwise we're going to end up with very hard to track bugs whenever these calls get made (e.g. bug 683207 is going to become "one out of 100 times drag and drop doesn't open the file").
(In reply to comment #16) > e.g. bug 683207 is going to become "one out of > 100 times drag and drop doesn't open the file"). That would be bug 658722. Note also bug 701571, which reports anomalous behavior in gtkosx_application menus after this patch is applied.
I've made a gtk2 version of the patch on bug 701571: https://bugzilla.gnome.org/attachment.cgi?id=247787 It moves the autoreleasepool recycling to gdk_edit_prepare(), so the same pool is in effect for a whole cycle of the main loop. It also creates an outer pool so that the objects initialized before starting the main loop don't all get released in the first iteration. Mitch, could you test this for your crash in the Gimp?
The patch does release all objects properly contrary to the current code in master, but it does not solve the main problem we were seeing in gimp (this is only visible when reverting gimp commit 668891e7455ff9107901b004e7ce9d422041ef71). The patch we committed did fix this problem, but that happened because the body of the if-statement was never executed -- the autorelease pool was never drained. This can be easily fixed by comparing g_main_depth() to 1 instead of 0, so that the pool is drained correctly. However, also with this change, it does no longer solve the problem we saw in gimp. It turns out that the real problem is that the pool is currently drained when poll_func is active (and with NSApplication nextEventMatchingMask in the current stack trace). This indicates that something must be wrong with how recursive main loops are handled currently. I am thinking about a fix for both of these issues. Hopefully these problems are then gone forever.
Kris, I'm having trouble understanding how poll_func() can be active at the same time as the base gdk_event_prepare(). Can you post a stack trace showing that?
Not sure this is related, but I am also getting lots of assertion failed: (ufds[i].fd == current_pollfds[i].fd) error messages and abnormal program exits using Pidgin with the GTK Quartz backend. (I posted more details of my situation at https://bugzilla.gnome.org/show_bug.cgi?id=660306#c7 ) I'd be happy to help with troubleshooting / testing if it would be helpful.
By the way, this is the full error message that I'm seeing: Gdk:ERROR:gdkeventloop-quartz.c:562:select_thread_collect_poll: assertion failed: (ufds[i].fd == current_pollfds[i].fd) Abort trap: 6
We're moving to gitlab! As part of this move, we are moving bugs to NEEDINFO if they haven't seen activity in more than a year. If this issue is still important to you and still relevant with GTK+ 3.22 or master, please reopen it and we will migrate it to gitlab.
As announced a while ago, we are migrating to gitlab, and bugs that haven't seen activity in the last year or so will be not be migrated, but closed out in bugzilla. If this bug is still relevant to you, you can open a new issue describing the symptoms and how to reproduce it with gtk 3.22.x or master in gitlab: https://gitlab.gnome.org/GNOME/gtk/issues/new