GNOME Bugzilla – Bug 779060
likely race in kqueue-helper.c
Last modified: 2017-05-08 09:31:11 UTC
It seems that kqueue code maintains subscriptions in subs_hash_table hash table. I think that there can be a race process_kqueue_notifications() and _kh_cancel_sub() as there is no synchronization between those methods beyond hash_lock that's held only while the table is being accessed. Imagine the following scenario: - process_kqueue_notifications() is called to process a notification - it looks up the hash table and finds a subscription - concurrently with that g_kqueue_file_monitor_cancel() is called in another thread - it calls _kh_cancel_sub() to remove the subscription - g_kqueue_file_monitor_cancel() proceeds to free the subscription object - after that process_kqueue_notifications() may access freed memory I think that either the subscription objects should be reference counted. Or there should be a more coarse lock that would protect the subscription object while it's being used. I believe that the following crash was caused by the described race condition: (gdb) bt
+ Trace 237165
(gdb) i loc n = {fd = 35, flags = 2} sub = 0x81be7b090 source = 0x81ce9da80 mask = G_FILE_MONITOR_EVENT_CHANGED (gdb) p *sub $8 = {filename = 0x5a5a5a5a5a5a5a5a <error: Cannot access memory at address 0x5a5a5a5a5a5a5a5a>, user_data = 0x5a5a5a5a5a5a5a5a, pair_moves = 1515870810, fd = 1515870810, deps = 0x819778b60, is_dir = 1515870810} (gdb) p subs_hash_table $9 = (GHashTable *) 0x81c16a4c0 (gdb) p *subs_hash_table $10 = {size = 32, mod = 31, mask = 31, nnodes = 10, noccupied = 11, keys = 0x81be87700, hashes = 0x81be69800, values = 0x81be87900, hash_func = 0x80cc79c80 <g_direct_hash>, key_equal_func = 0x80cc7ba70 <g_direct_equal>, ref_count = 1, version = 12, key_destroy_func = 0x0, value_destroy_func = 0x0} (gdb) p *subs_hash_table->keys@32 $14 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x24, 0x25, 0x26, 0x27, 0x28, 0x29, 0x2a, 0x2b, 0x2c, 0x2d, 0x0 <repeats 17 times>} Please note that 0x5a is a marker of freed memory (I enabled that feature in jemalloc) and that fd=35 is missing from the hash table.
> - concurrently with that g_kqueue_file_monitor_cancel() is called in another thread Why would that happen? process_kqueue_notifications() is called from an I/O watch on an FD in the same thread where g_kqueue_file_monitor_start() was called. GFileMonitor is not thread-safe, so g_kqueue_file_monitor_cancel() is only allowed to be called in that same thread. So there should be no need for locking here, as the subs_hash_table should only ever be accessed from one thread. Can you get a full backtrace of the crash using `thread apply all backtrace`?
(In reply to Philip Withnall from comment #1) > > - concurrently with that g_kqueue_file_monitor_cancel() is called in another thread > > Why would that happen? Sorry, I don't know. I don't know much about the concurrency model, I was just speculating based on what I saw in the dump. > Can you get a full backtrace of the crash using `thread apply all backtrace`? Attaching.
Created attachment 346557 [details] all stacks
Could you possibly reproduce the bug under Valgrind’s memcheck and drd tools? Either there’s some memory corruption happening which is causing non-kqueue code to free the kqueue_sub; or it is a threading problem (as you suggest) and we need to work out which thread is destroying or cancelling the GFileMonitor. Thanks.
Unfortunately, that's hard for me to do, because I can not reproduce the problem at will. It happens randomly and not very frequently. I noticed that only Mozilla products (firefox and thunderbird) seem to be affected. Not sure if that helps.
OK. If you do manage to reproduce it under those tools it would be very helpful. Otherwise I will try and find some time in the next few weeks to step through the code more closely and work out where a problem might exist.
Did you try: https://bugzilla.gnome.org/show_bug.cgi?id=779777 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=214338
This might be a duplicate of bug #739424. I’m going to mark it as such for now; please re-open it if you can show it’s a separate threading problem in the same code. Thanks. *** This bug has been marked as a duplicate of bug 739424 ***