GNOME Bugzilla – Bug 666551
Crash in g_thread_xp_SleepConditionVariableSRW
Last modified: 2018-05-24 13:38:05 UTC
Created attachment 203905 [details] [review] Patch to fix crash in gthreads-win32.c I've been getting some random crashes with the latest version of glib/gtk+ in Windows, particularly in the GtkFileChooserDialog and similar. After some hard debugging days, I think I've found it: the problem is in function g_thread_xp_SleepConditionVariableSRW() (gthread-win32.c). This function first inserts the current thread 'waiter' in the cv list of waiters; then it calls WaitForSingleObject(), but it never removes the 'waiter' from the list. This looks correct, because both g_thread_xp_WakeConditionVariable() and g_thread_xp_WakeAllConditionVariable() remove the 'waiter' of the awaken thread. But what happens if nobody awakes the thread, and WaitForSingleObject() returns with a timeout? The 'waiter' is kept in the cv list! After that, the thread finishes, the waiter is destroyed, and the next time the cv is used... crash! My proposed patch removes (I think it does, that list is tricky!) the 'waiter' before returning if present in the list. That should only be necessary if 'status == WAIT_TIMEOUT', but who knows... Please, feel free to change it if you see fit.
I agree with your analysis. I'm not totally crazy about the solution because it runs even in the non-timeout case and because it's so awkward. That's not your fault -- the data structure was optimised for exactly three operations: append, in-order traverse and popping. It's not really suitable for dealing with remove-from-middle. This could probably be fixed by switching to a full linked list, or by giving each node a 'my_owner' field, such that this invariant is held: waiter->my_owner == NULL || *waiter->my_owner == waiter such that removing any waiter from a list is just a matter of doing this: *waiter->my_owner = waiter->next; We could then do that on the waiter (that we already have hold of) after the WaitForSingleObject() call returns in the timeout case. We just need to be careful in the other places the linked list is updated (mostly in the wake-single case) that we update the my_owner field of the 2nd element in the list when we pop the first to point back at the CV rather than the next pointer on the 1st. I'll give this a try.
On reading this code I also notice that the fact that I drop the critical section early and assume that the waiter is still alive is possibly problematic in the timeout case. If I call signal() just as the thread is timing out, and the thread exits shortly after the timeout, the signalling thread may try to SetEvent() on a no-longer-existent event variable.
Created attachment 203921 [details] [review] winxp threads: fix some condition variable races There are some races in the condition variable emulation code for Windows XP with respect to timeouts while waiting. First, in the event of a timeout, we never remove the waiter from the condition variable. This can cause crashes later. That problem was found by Rodrigo Rivas Costa. Second, if the waiting thread times out and exits just as we were about to call SetEvent() on its waiter event, we could end up trying to access the waiter after it was closed/freed. We need to hold on to the lock a little bit longer to ensure that that's not possible.
Please review this patch and let me know if you think it will fix the issue.
I've noticed some races, but frankly, after about a week hunting this bug I just wanted to avoid the crash. Your patch looks fine, I'll test it as soon as I can. But anyway, implementing condition variables with windows primitives is very tricky. And all those 'volatile' variables make me cringe... Have you considered checking/copying code from pthreads-win32 (http://sourceware.org/pthreads-win32/), LGPL. They did just that, and they took quite some versions until they got it right. And as far as I've heard, the latest version, from 2006 works flawlessly.
Created attachment 203941 [details] [review] Correction of Ryan's patch
Ryan, I'm afraid that it crashes even before than the original version! I added a few 'g_assert' with the invariant you comment and it fails soon enough. The problem seems to be that each time you remove an item from the list, you have to check check the next one's my_owner, and fix it accordingly. And take into account the case when you delete the last of the list. Also, you forgot to assign NULL to 'my_owner' in g_thread_xp_WakeConditionVariable. Maybe it would be good to add a g_assert or two, just to be sure. g_assert(waiter->my_owner == NULL || *waiter->my_owner == waiter); Please, see my updated patch.
Created attachment 204030 [details] [review] Fix a few dangling pointers I've seen that Ryan's patch has been commited on Dec. 19th. Nice, but it still crashes because of the reasons I explained in my previous message. The attached patch corrects these problems against the already updated version.
Review of attachment 204030 [details] [review]: Of course. Thanks for the catch.
Review of attachment 204030 [details] [review]: Hi rodrigorivascosta, I have committed your patch as 1b033774 into master on your behalf after changing your diff into git patch format, retaining the items you have mentioned in comment 7. I have not closed this bug as we may want to check further until all things are well on Windows XP-I checked this patch on my side in Windows 7 XPMode with binaries compiled with Visual C++ 2008 and ran the tests related to GLib threading for 5 times each, which turned out to be fine on my side (previous versions of the XP threading code had intermittent problems on my side when running the rwlock tests, to let you know). Is there any opportunity that you could post your test case here that triggered the problem on your side, out of curiosity? Thanks, and God bless.
Hi, Chun-wei. I have to admit that I don't have a formal test case. I started having crashes with the GTK+-2 demo (gtk-demo.exe) compiled with gcc-4.6.3 (cross compiled under linux), and run in a WinXP. It crashed only with the "Pickers" applet (open, close, open, close, open, close, crash!). Then, I reduced the problem to the following program: #include <gtk/gtk.h> int main(int argc, char **argv) { gtk_init(&argc, &argv); GtkWidget *w = gtk_file_chooser_dialog_new("Choose", NULL, GTK_FILE_CHOOSER_ACTION_SELECT_FOLDER, "OK", GTK_RESPONSE_OK, NULL); gtk_widget_show(w); gtk_main(); return 0; } It failed when changing quickly the folder and clicking the mouse like crazy ;)
-- GitLab Migration Automatic Message -- This bug has been migrated to GNOME's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/491.