After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 779060 - likely race in kqueue-helper.c
likely race in kqueue-helper.c
Status: RESOLVED DUPLICATE of bug 739424
Product: glib
Classification: Platform
Component: gio
2.46.x
Other FreeBSD
: Normal normal
: ---
Assigned To: gtkdev
gtkdev
Depends on:
Blocks:
 
 
Reported: 2017-02-22 10:59 UTC by Andriy Gapon
Modified: 2017-05-08 09:31 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
all stacks (71.93 KB, text/plain)
2017-02-23 10:36 UTC, Andriy Gapon
Details

Description Andriy Gapon 2017-02-22 10:59:57 UTC
It seems that kqueue code maintains subscriptions in subs_hash_table hash table.
I think that there can be a race process_kqueue_notifications() and _kh_cancel_sub() as there is no synchronization between those methods beyond hash_lock that's held only while the table is being accessed.
Imagine the following scenario:
- process_kqueue_notifications() is called to process a notification
- it looks up the hash table and finds a subscription
- concurrently with that g_kqueue_file_monitor_cancel() is called in another thread
- it calls _kh_cancel_sub() to remove the subscription
- g_kqueue_file_monitor_cancel() proceeds to free the subscription object
- after that process_kqueue_notifications() may access freed memory

I think that either the subscription objects should be reference counted.
Or there should be a more coarse lock that would protect the subscription object while it's being used.

I believe that the following crash was caused by the described race condition:

(gdb) bt
  • #0 dl_shallow_copy
    at dep-list.c line 101
  • #1 dl_calculate
    at dep-list.c line 495
  • #2 _kh_dir_diff
    at kqueue-helper.c line 254
  • #3 process_kqueue_notifications
    at kqueue-helper.c line 327
  • #4 g_io_unix_dispatch
    at giounix.c line 165
  • #5 g_main_dispatch
    at gmain.c line 3154
  • #6 g_main_context_dispatch
    at gmain.c line 3769
  • #7 g_main_context_iterate
    at gmain.c line 3840
  • #8 g_main_context_iteration
    at gmain.c line 3901
  • #9 nsAppShell::ProcessNextNativeEvent(bool)
    from /usr/local/lib/firefox/libxul.so
  • #10 nsBaseAppShell::OnProcessNextEvent(nsIThreadInternal*, bool)
    from /usr/local/lib/firefox/libxul.so
  • #11 non-virtual thunk to nsBaseAppShell::OnProcessNextEvent(nsIThreadInternal*, bool)
    from /usr/local/lib/firefox/libxul.so
  • #12 nsThread::ProcessNextEvent(bool, bool*)
    from /usr/local/lib/firefox/libxul.so
  • #13 NS_ProcessNextEvent(nsIThread*, bool)
    from /usr/local/lib/firefox/libxul.so
  • #14 mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*)
    from /usr/local/lib/firefox/libxul.so
  • #15 MessageLoop::Run()
    from /usr/local/lib/firefox/libxul.so
  • #16 nsBaseAppShell::Run()
    from /usr/local/lib/firefox/libxul.so
  • #17 nsAppStartup::Run()
    from /usr/local/lib/firefox/libxul.so
  • #18 XREMain::XRE_mainRun()
    from /usr/local/lib/firefox/libxul.so
  • #19 XREMain::XRE_main(int, char**, nsXREAppData const*)
    from /usr/local/lib/firefox/libxul.so
  • #20 XRE_main
    from /usr/local/lib/firefox/libxul.so
  • #21 main
  • #3 process_kqueue_notifications
    at kqueue-helper.c line 327

(gdb) i loc
n = {fd = 35, flags = 2}
sub = 0x81be7b090
source = 0x81ce9da80
mask = G_FILE_MONITOR_EVENT_CHANGED
(gdb) p *sub
$8 = {filename = 0x5a5a5a5a5a5a5a5a <error: Cannot access memory at address 0x5a5a5a5a5a5a5a5a>, user_data = 0x5a5a5a5a5a5a5a5a, pair_moves = 1515870810, fd = 1515870810, deps = 0x819778b60, is_dir = 1515870810}

(gdb) p subs_hash_table
$9 = (GHashTable *) 0x81c16a4c0
(gdb) p *subs_hash_table
$10 = {size = 32, mod = 31, mask = 31, nnodes = 10, noccupied = 11, keys = 0x81be87700, hashes = 0x81be69800, values = 0x81be87900, hash_func = 0x80cc79c80 <g_direct_hash>, key_equal_func = 0x80cc7ba70 <g_direct_equal>, 
  ref_count = 1, version = 12, key_destroy_func = 0x0, value_destroy_func = 0x0}

(gdb) p *subs_hash_table->keys@32
$14 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x24, 0x25, 0x26, 0x27, 0x28, 0x29, 0x2a, 0x2b, 0x2c, 0x2d, 0x0 <repeats 17 times>}

Please note that 0x5a is a marker of freed memory (I enabled that feature in jemalloc) and that fd=35 is missing from the hash table.
Comment 1 Philip Withnall 2017-02-23 10:20:25 UTC
> - concurrently with that g_kqueue_file_monitor_cancel() is called in another thread

Why would that happen? process_kqueue_notifications() is called from an I/O watch on an FD in the same thread where g_kqueue_file_monitor_start() was called. GFileMonitor is not thread-safe, so g_kqueue_file_monitor_cancel() is only allowed to be called in that same thread. So there should be no need for locking here, as the subs_hash_table should only ever be accessed from one thread.

Can you get a full backtrace of the crash using `thread apply all backtrace`?
Comment 2 Andriy Gapon 2017-02-23 10:35:53 UTC
(In reply to Philip Withnall from comment #1)
> > - concurrently with that g_kqueue_file_monitor_cancel() is called in another thread
> 
> Why would that happen?

Sorry, I don't know.  I don't know much about the concurrency model, I was just speculating based on what I saw in the dump.

> Can you get a full backtrace of the crash using `thread apply all backtrace`?

Attaching.
Comment 3 Andriy Gapon 2017-02-23 10:36:23 UTC
Created attachment 346557 [details]
all stacks
Comment 4 Philip Withnall 2017-02-24 10:37:47 UTC
Could you possibly reproduce the bug under Valgrind’s memcheck and drd tools? Either there’s some memory corruption happening which is causing non-kqueue code to free the kqueue_sub; or it is a threading problem (as you suggest) and we need to work out which thread is destroying or cancelling the GFileMonitor. Thanks.
Comment 5 Andriy Gapon 2017-02-28 09:25:45 UTC
Unfortunately, that's hard for me to do, because I can not reproduce the problem at will.  It happens randomly and not very frequently.  I noticed that only Mozilla products (firefox and thunderbird) seem to be affected.  Not sure if that helps.
Comment 6 Philip Withnall 2017-03-02 09:41:20 UTC
OK. If you do manage to reproduce it under those tools it would be very helpful. Otherwise I will try and find some time in the next few weeks to step through the code more closely and work out where a problem might exist.
Comment 8 Philip Withnall 2017-05-08 09:31:11 UTC
This might be a duplicate of bug #739424. I’m going to mark it as such for now; please re-open it if you can show it’s a separate threading problem in the same code. Thanks.

*** This bug has been marked as a duplicate of bug 739424 ***