Bug 779060 – likely race in kqueue-helper.c

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 779060 - likely race in kqueue-helper.c


Summary:	likely race in kqueue-helper.c


Status:	RESOLVED DUPLICATE of bug 739424

Product:	glib
Classification:	Platform
Component:	gio
Version:	2.46.x
Hardware:	Other FreeBSD

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2017-02-22 10:59 UTC by Andriy Gapon
Modified:	2017-05-08 09:31 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
all stacks (71.93 KB, text/plain) 2017-02-23 10:36 UTC, Andriy Gapon	Details

Description Andriy Gapon 2017-02-22 10:59:57 UTC

It seems that kqueue code maintains subscriptions in subs_hash_table hash table.
I think that there can be a race process_kqueue_notifications() and _kh_cancel_sub() as there is no synchronization between those methods beyond hash_lock that's held only while the table is being accessed.
Imagine the following scenario:
- process_kqueue_notifications() is called to process a notification
- it looks up the hash table and finds a subscription
- concurrently with that g_kqueue_file_monitor_cancel() is called in another thread
- it calls _kh_cancel_sub() to remove the subscription
- g_kqueue_file_monitor_cancel() proceeds to free the subscription object
- after that process_kqueue_notifications() may access freed memory

I think that either the subscription objects should be reference counted.
Or there should be a more coarse lock that would protect the subscription object while it's being used.

I believe that the following crash was caused by the described race condition:

(gdb) bt

+ Trace 237165

#0 dl_shallow_copy
at dep-list.c line 101
#1 dl_calculate
at dep-list.c line 495
#2 _kh_dir_diff
at kqueue-helper.c line 254
#3 process_kqueue_notifications
at kqueue-helper.c line 327
#4 g_io_unix_dispatch
at giounix.c line 165
#5 g_main_dispatch
at gmain.c line 3154
#6 g_main_context_dispatch
at gmain.c line 3769
#7 g_main_context_iterate
at gmain.c line 3840
#8 g_main_context_iteration
at gmain.c line 3901
#9 nsAppShell::ProcessNextNativeEvent(bool)
from /usr/local/lib/firefox/libxul.so
#10 nsBaseAppShell::OnProcessNextEvent(nsIThreadInternal*, bool)
from /usr/local/lib/firefox/libxul.so
#11 non-virtual thunk to nsBaseAppShell::OnProcessNextEvent(nsIThreadInternal*, bool)
from /usr/local/lib/firefox/libxul.so
#12 nsThread::ProcessNextEvent(bool, bool*)
from /usr/local/lib/firefox/libxul.so
#13 NS_ProcessNextEvent(nsIThread*, bool)
from /usr/local/lib/firefox/libxul.so
#14 mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*)
from /usr/local/lib/firefox/libxul.so
#15 MessageLoop::Run()
from /usr/local/lib/firefox/libxul.so
#16 nsBaseAppShell::Run()
from /usr/local/lib/firefox/libxul.so
#17 nsAppStartup::Run()
from /usr/local/lib/firefox/libxul.so
#18 XREMain::XRE_mainRun()
from /usr/local/lib/firefox/libxul.so
#19 XREMain::XRE_main(int, char**, nsXREAppData const*)
from /usr/local/lib/firefox/libxul.so
#20 XRE_main
from /usr/local/lib/firefox/libxul.so
#21 main
#3 process_kqueue_notifications
at kqueue-helper.c line 327


(gdb) i loc
n = {fd = 35, flags = 2}
sub = 0x81be7b090
source = 0x81ce9da80
mask = G_FILE_MONITOR_EVENT_CHANGED
(gdb) p *sub
$8 = {filename = 0x5a5a5a5a5a5a5a5a <error: Cannot access memory at address 0x5a5a5a5a5a5a5a5a>, user_data = 0x5a5a5a5a5a5a5a5a, pair_moves = 1515870810, fd = 1515870810, deps = 0x819778b60, is_dir = 1515870810}

(gdb) p subs_hash_table
$9 = (GHashTable *) 0x81c16a4c0
(gdb) p *subs_hash_table
$10 = {size = 32, mod = 31, mask = 31, nnodes = 10, noccupied = 11, keys = 0x81be87700, hashes = 0x81be69800, values = 0x81be87900, hash_func = 0x80cc79c80 <g_direct_hash>, key_equal_func = 0x80cc7ba70 <g_direct_equal>, 
  ref_count = 1, version = 12, key_destroy_func = 0x0, value_destroy_func = 0x0}

(gdb) p *subs_hash_table->keys@32
$14 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x24, 0x25, 0x26, 0x27, 0x28, 0x29, 0x2a, 0x2b, 0x2c, 0x2d, 0x0 <repeats 17 times>}

Please note that 0x5a is a marker of freed memory (I enabled that feature in jemalloc) and that fd=35 is missing from the hash table.

Comment 1 Philip Withnall 2017-02-23 10:20:25 UTC

> - concurrently with that g_kqueue_file_monitor_cancel() is called in another thread

Why would that happen? process_kqueue_notifications() is called from an I/O watch on an FD in the same thread where g_kqueue_file_monitor_start() was called. GFileMonitor is not thread-safe, so g_kqueue_file_monitor_cancel() is only allowed to be called in that same thread. So there should be no need for locking here, as the subs_hash_table should only ever be accessed from one thread.

Can you get a full backtrace of the crash using `thread apply all backtrace`?

Comment 2 Andriy Gapon 2017-02-23 10:35:53 UTC

(In reply to Philip Withnall from comment #1)
> > - concurrently with that g_kqueue_file_monitor_cancel() is called in another thread
> 
> Why would that happen?

Sorry, I don't know.  I don't know much about the concurrency model, I was just speculating based on what I saw in the dump.

> Can you get a full backtrace of the crash using `thread apply all backtrace`?

Attaching.

Comment 3 Andriy Gapon 2017-02-23 10:36:23 UTC

Created attachment 346557 [details]
all stacks

Comment 4 Philip Withnall 2017-02-24 10:37:47 UTC

Could you possibly reproduce the bug under Valgrind’s memcheck and drd tools? Either there’s some memory corruption happening which is causing non-kqueue code to free the kqueue_sub; or it is a threading problem (as you suggest) and we need to work out which thread is destroying or cancelling the GFileMonitor. Thanks.

Comment 5 Andriy Gapon 2017-02-28 09:25:45 UTC

Unfortunately, that's hard for me to do, because I can not reproduce the problem at will.  It happens randomly and not very frequently.  I noticed that only Mozilla products (firefox and thunderbird) seem to be affected.  Not sure if that helps.

Comment 6 Philip Withnall 2017-03-02 09:41:20 UTC

OK. If you do manage to reproduce it under those tools it would be very helpful. Otherwise I will try and find some time in the next few weeks to step through the code more closely and work out where a problem might exist.

Comment 7 rozhuk.im 2017-03-08 22:52:22 UTC

Did you try:
https://bugzilla.gnome.org/show_bug.cgi?id=779777
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=214338

Comment 8 Philip Withnall 2017-05-08 09:31:11 UTC

This might be a duplicate of bug #739424. I’m going to mark it as such for now; please re-open it if you can show it’s a separate threading problem in the same code. Thanks.

*** This bug has been marked as a duplicate of bug 739424 ***