GNOME Bugzilla – Bug 739424
Rewrite kqueue GFileMonitor backend to drop threading
Last modified: 2018-03-13 12:32:39 UTC
gnome-shell and other GNOME apps, such as evolution, crashes when any files are added, deleted, or modified in $(prefix)/share/applications. This is very annoying because I often get a gnome-session-failed when the package manager or JHBuild is running. GLib version: 2.43.0-2-gb3e3ed7 (git describe --long) I find this bug on FreeBSD 10.1, but I think it is not specific to FreeBSD because I also get a lot of crashes when running yum in Fedora 20. An OpenBSD GNOME package maintainer also told me he always upgrade packages in the text console because of this problem. After trying to debug gnome-shell, I guess this problem may come from GAppInfoMonitor. To ease the debugging of this problem, I wrote a very simple problem to reproduce the bug: #include <glib.h> #include <gio/gio.h> int main (int argc, char *argv[]) { GMainLoop *loop = g_main_loop_new (NULL, TRUE); GList *apps = g_app_info_get_all (); GAppInfoMonitor *monitor = g_app_info_monitor_get (); g_main_loop_run (loop); return 0; } This program also crashes when anything under $(prefix)/share/applications is modified. It may crashes with Bus error or Abort trap. I guess it is caused by a GFileMonitor which is used after it has been freed. When it crashes with Bus error in g_file_monitor_emit_event, the value of monitor is: {parent_instance = {g_type_instance = {g_class = 0xaaaaaaaaaaaaaaaa}, ref_count = 2863311530, qdata = 0xaaaaaaaaaaaaaaaa}, priv = 0xaaaaaaaaaaaaaaaa} When it crashes with Abort trap in g_mutex_lock, I see this message on my terminal: GLib (gthread-posix.c): Unexpected error from C library during 'pthread_mutex_lock': Invalid argument. The invalid argument error comes from pthread library because the mutex passed in pthread_mutex_lock equals THR_MUTEX_DESTROYED. Source code is available here: http://svn.freebsd.org/base/releng/10.1/lib/libthr/thread/thr_mutex.c (line 305, macro CHECK_AND_INIT_MUTEX) Therefore, I set a watchpoint on (*((GObject*)monitor)->ref_count) to get notified when g_object_ref or g_object_unref is called. The following backtraces are obtained when the ref_count of the GFileMonitor is changed. ref_count: 0 -> 1 (http://fpaste.org/146597/46933421/) The GFileMonitor object is created and kept inside gdesktopappinfo.c with global static variable. To test the problem, I create and remove an empty file in $(prefix)/share/applications: $ touch /home/lantw44/gnome/devinstall/share/applications/abc $ rm /home/lantw44/gnome/devinstall/share/applications/abc ref_count: 1 -> 2 (http://fpaste.org/146602/14146937/) The reference is explicitly increased by emit_cb. ref_count: 2 -> 3 (http://fpaste.org/146603/14693828/) The reference is increased by g_signal_emit. ref_count: 3 -> 2 (http://fpaste.org/146604/41469398/) The reference is decreased by desktop_file_dir_reset, so gdesktopappinfo.c no longer holds a reference now. ref_count: 2 -> 1 (http://fpaste.org/146607/94087141/) The reference is decreased by g_signal_emit. ref_count: 1 -> 2 (http://fpaste.org/146613/46946421/) ref_count: 2 -> 1 (http://fpaste.org/146614/46946711/) ref_count: 1 -> 0 (http://fpaste.org/146616/41469472/) The last reference is removed by emit_cb. The object is freed now. ref_count: 0 -> 2863311530
+ Trace 234279
Program terminated with signal SIGBUS, Bus error. It seems something still hold a reference to the GFileMonitor ...
> because I also get a lot of crashes when running yum in Fedora 20. An OpenBSD > GNOME package maintainer also told me he always upgrade packages in the text > console because of this problem. That would be me and yes, I can confirm this issue with GNOME 3.14.1 on OpenBSD.
I've also seen this on Fedora 20 with GNOME 3.12 but can't reproduce it with the above program. Maybe it's more reproducible when using the underlying file monitor used on FreeBSD and OpenBSD?
(In reply to comment #2) > I've also seen this on Fedora 20 with GNOME 3.12 but can't reproduce it with > the above program. Maybe it's more reproducible when using the underlying file > monitor used on FreeBSD and OpenBSD? Yes, I know it is not always reproducible. It is always reproducible on my desktop computer, but it is less reproducible on my virtual machine although the version of GNOME and FreeBSD are the same.
It seems g_app_info_monitor_get() is not needed. Calling g_app_info_get_all() can cause it to crash.
Can anyone get a backtrace?
I try to find where emit_cb and the monitor are added into a GSource. Here is the backtrace I find out by searching the source code and using GDB:
+ Trace 234287
One thing to note: GDesktopAppInfo has the monitor fire its events in the GLib worker thread (whereas most users of file monitors have their events go to the main thread). Perhaps the kqueue file monitoring backend is not safe with respect to this.
I dug into this a bit more and the situation is pretty bad. The file notification backends have more or less never been threadsafe (despite the API giving the appearance that it was intended to be used from multiple threads). In essence, if someone destroys a monitor just as an event is arriving, you can end up trying to emit a signal on an object that has been finalized (which is what we are seeing here). In the inotify case we get lucky most of the time because inotify does all of its work from the worker thread and that's where GDesktopAppInfo is also freeing things, so there is no race. It could race with other threads, though. In the FreeBSD case we are not so lucky: it notifies from the main thread and GDesktopAppInfo destroys from the worker thread. Trouble comes easy in this case. The only good solution here is to move the backends to make use of GWeakRef with respect to signal dispatches.
Seems that this broke back with bug 579984 when the concept of a thread-default main context was first introduced. GFileMonitor was updated to dispatch to different threads but the backend code was never cleaned up to actually be threadsafe...
*** Bug 720489 has been marked as a duplicate of this bug. ***
Let's reuse this for some more adjustments after the new file monitors landed (since things are still quite broken on FreeBSD).
Created attachment 300393 [details] [review] gpollfilemonitor: don't emit after cancellation GPollFileMonitor emits CHANGES_DONE_HINT after CHANGED signals, but it doesn't check to ensure that the file monitor wasn't cancelled before it does that. If the original signal caused the monitor to be unreffed, cancelled and destroyed, we would still end up emitting an extra signal on it. Avoid that by checking first for cancellation.
Created attachment 300394 [details] [review] gpollfilemonitor: send 'changes done' for creates The new rules of GFileMonitor says that users should expect to see a CHANGES_DONE_HINT following a CREATED as well as CHANGED.
Created attachment 300395 [details] [review] kqueue: add a bit of extra paranoia on cancel Cancellation of GPollFileMonitor is now handled correctly (in the sense that no further signals will follow) but let's be extra paranoid and disconnect our handler anyway, for good measure.
There is still some trouble here: the polling file monitor will be created in the default main context in effect at whatever point the monitor was created. This will not be right for monitors started in other contexts (as is the case with the appinfo monitor in the worker thread, for example). There is also a race there: we check for the initial state asynchronously. By the time that returns, a change could already have occurred, and we would miss it. I think it might be reasonable to use g_main_context_invoke() here, but we actually don't have access to the main context from the backend because we don't attach the source until after the start() vfunc returns. Another possibility may be to always use the worker and dispatch events via the source like usual (so that they end up in the correct thread). A final solution is to entirely avoid chaining to the poll file monitor from kqueue and implement the functionality in-place. kqueue already has to scan in order to figure out what changes occurred, so this would not be a substantial growth of that functionality...
> landed (since things are still quite broken on FreeBSD). Not just FreeBSD. OpenBSD and NetBSD suffers from the same issue.
Review of attachment 300393 [details] [review]: This looks right, as far as cancellation is concerned. But I'm not sure I fully grok the commit message: We seem to be holding our own ref on the poll_monitor all the way from poll_file_timeout to the end of got_new_info, so how could the poll monitor be destroyed before we're done ?
Review of attachment 300394 [details] [review]: ok
Review of attachment 300395 [details] [review]: looks good to me.
(In reply to Antoine Jacoutot from comment #16) > > landed (since things are still quite broken on FreeBSD). > > Not just FreeBSD. > OpenBSD and NetBSD suffers from the same issue. I have also seen the same (or very similar) issue on OpenIndiana and the latest Solaris. https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199872
Attachment 300393 [details] pushed as 62e5ee5 - gpollfilemonitor: don't emit after cancellation Attachment 300394 [details] pushed as a367921 - gpollfilemonitor: send 'changes done' for creates Attachment 300395 [details] pushed as 9e8f4d4 - kqueue: add a bit of extra paranoia on cancel
> Status: Resolved Fixed Good show! Thank you so much! :D
There are still issues not solved by attached patches. I have used the patches since it was posted, but gnome-shell, gnome-software, gnome-terminal, evolution, epiphany, firefox still crash with similar backtraces.
$ gnome-shell --version GNOME Shell 3.18.3 $ uname -a OpenBSD bofh.internal 5.9 GENERIC.MP#1870 amd64 This bug still affects us here on OpenBSD, doing a pkg_add that touches $(prefix)/share/applications still causes gnome-shell to have an anxiety attack and restart itself. gnome-terminal thankfully does /not/ close from this action, my last gnome3+openbsd test run a year or so ago had me quit using gnome-terminal entirely in favor of urxvt, as losing every open terminal at once was a bit painful. We don't have gnome-software here, evolution is not something I like and epiphany is strangely always otherworldly magnitudes of slower compared to both chrome or firefox so I haven't run it long enough to observe that kind of issue. Cheers!
This can help: https://bugzilla.gnome.org/show_bug.cgi?id=779777
Created attachment 350642 [details] [review] Fix kqueue_sub use-after-free
There's at least one use-after-free due to the use of a thread and not reference counting for ``kqueue_sub''. I took the simplest approach of serializing the threads by holding the hash_lock when a ``kqeueu_sub'' is manipulated. However a better fix and improvement would be to get rid of the custom thread. This fix the reported issues on OpenBSD at least.
> This fix the reported issues on OpenBSD at least. Indeed. The patch is on its way to our ports tree. Thanks Martin.
The "kqueue_sub use-after-free" patch by Martin also fixes the "restarting" problem with gnome-shell on FreeBSD. It made it's way to our ports tree as well.
Created attachment 351191 [details] [review] Better fix for the use-after-free This new version fix a mutex recursion introduced in the previous diff and protect a forgotten code path.
Review of attachment 351191 [details] [review]: Why not add reference counting to kqueue_sub? The growth of those locked sections in kqueue-helper.c doesn’t fill me with confidence, especially since GIO code (g_file_monitor_source_handle_event()) is now called from within the locked section.
If you want to add reference counting, sure, go for it. But IMHO adding reference counting is into wrong direction. It will add complexity to a backend which is already too complex. That's what I said in: https://bugzilla.gnome.org/show_bug.cgi?id=779777 I looked for the simplest non-intrusive change that fix the problem. I don't think there's any problem with calling g_file_monitor_source_handle_event() with the hash lock held because I don't see any other code path where a thread would grab the GFileMonitorSource's lock first. So I'm not concerned about lock ordering.
*** Bug 779060 has been marked as a duplicate of this bug. ***
(In reply to Martin Pieuchot from comment #32) > If you want to add reference counting, sure, go for it. But IMHO adding > reference > counting is into wrong direction. It will add complexity to a backend which > is already too complex. I’d argue that reference counting is simpler to reason about than locking is.
I disagree with you but I don't want to discourage you. If you can come with a better fix I'll be happy to review it.
After applying attachment 351191 [details] [review], I still see gnome-shell, gnome-terminal, epiphany, evolution crash when upgrading packages or running jhbuild. gnome-shell and epiphany crash much less often than before, so I think it is still a useful patch.
I've haven't seen any crash since I applied the latest patch from Martin. I'd really appreciate if I could get an OK to push it. Thanks.
(In reply to Antoine Jacoutot from comment #37) > I've haven't seen any crash since I applied the latest patch from Martin. > I'd really appreciate if I could get an OK to push it. > Thanks. No, I am not happy about the growth of the locked sections to include calls into large parts of GLib. Once you call into GFileMonitorSource, it can end up calling into a lot of the rest of GIO. A more appropriate fix would be to add reference counting or get rid of the thread, as Martin said in comment #27. I am not going to come up with that fix, as I do not have enough time nor a way to test it.
Created attachment 368210 [details] [review] Multiple fixes and simplifications for the kqueue(2) backend (In reply to Philip Withnall from comment #38) > A more appropriate fix would be to add reference counting or get rid of the > thread, as Martin said in comment #27. So I took a shot at removing the thread. This simplifies *a lot* the whole backend. There's no need for a global hash table anymore, so the lock protecting it goes away. The newly proposed diff is now enabled in OpenBSD's package and has been in use for a month now without issue. Here's the explanatory commit message: - Stop using a custom thread for listening to kqueue(2) events. Instead call kevent(2) in non blocking mode in a monitor callback. Under the hood poll(2) is used to figure out if new events are available. - Do not use a socketpair with a custom protocol requiring 2 supplementary context switches per event to commicate between multiple threads. Calling kevent(2), in non blocking mode, to add/remove events is fine from any context. - Add kqueue(2) events without the EV_ONESHOT flag. This removes a race were some notifications were lost because events had to be re-added for every new notification. - Get rid of the global hash table and its associated lock and races. Use the 'cookie' argument of kevent(2) to pass the associated descriptor when registering an event. - Fix _kh_file_appeared_cb() by properly passing a monitor instead of a source to g_file_monitor_emit_event(). - Porperly refcount sources. - Remove a lot of abstraction making it harder to fix the remaining issues.
It is still legacy code. - Separate thread is good, because too many work for main thread is evil. IMHO FAM may block main thread on open() and you GUI will freeze. That is why I create separate thread and async add/remove mon tasks via pipe(). - Socket pair/pipe useful to queue. - EV_ONESHOT was ugly design. - Hash table other ugly design example. - Ref count unneeded for this king of solution. I am working on rate limit for events in mine FAM backend. It is very useful in cases where file manager open /tmp and some port is compiling in same time and create/deletes many files with high rate. I also disable files changing monitoring for all non local fs, to prevent lags while FAM open() many files to get handle to mon. Can you try https://bugzilla.gnome.org/show_bug.cgi?id=779777 on OpenBSD?
(In reply to Martin Pieuchot from comment #39) > Created attachment 368210 [details] [review] [review] > Multiple fixes and simplifications for the kqueue(2) backend I applied attachment 368210 [details] [review], but gnome-shell and evolution still crash when a file is added to or deleted from the applications folder. It looks like a null pointer dereference problem. Program terminated with signal SIGSEGV, Segmentation fault.
+ Trace 238386
Review of attachment 368210 [details] [review]: ::: gio/kqueue/gkqueuefilemonitor.c @@ +403,3 @@ + if (kevent (kq_queue, &ev, 1, NULL, 0, NULL) == -1) + { + g_warning ("Unable to remove event for: %s", sub->filename, g_strerror (errno)); You need two '%s' in this format string.
(In reply to Ting-Wei Lan from comment #41) > (In reply to Martin Pieuchot from comment #39) > > Created attachment 368210 [details] [review] [review] [review] > > Multiple fixes and simplifications for the kqueue(2) backend > > I applied attachment 368210 [details] [review] [review], but gnome-shell and > evolution still crash when a file is added to or deleted from the > applications folder. It looks like a null pointer dereference problem. It seems that the problem is in the fallback code. I did the test in a JHBuild GNOME session installed in /home/lantw44/gnome/devinstall. The entire installation is on a file system mounted on /home/lantw44/gnome, which is marked as can_unmount=1 so _ke_is_excluded returns true. g_kqueue_file_monitor_start sets 'sub' only when _ke_is_excluded returns false, so accessing 'sub' in _fallback_callback is a null pointer dereference. No crashes were observed when I modified _ke_is_excluded to always return false for files under /home, so it seems to me that the kqueue code works without problem now.
Created attachment 368338 [details] [review] Multiple fixes and simplifications for the kqueue(2) backend, v2 Thanks for your tests, I just added a new version of the diff. (In reply to Ting-Wei Lan from comment #43) > It seems that the problem is in the fallback code. I did the test in a > JHBuild GNOME session installed in /home/lantw44/gnome/devinstall. The > entire installation is on a file system mounted on /home/lantw44/gnome, > which is marked as can_unmount=1 so _ke_is_excluded returns true. > g_kqueue_file_monitor_start sets 'sub' only when _ke_is_excluded returns > false, so accessing 'sub' in _fallback_callback is a null pointer > dereference. It makes no sense to not use the kqueue backend in such directories, so I just removed the fallback code. I also fixed the printf you pointed out.
(In reply to rozhuk.im from comment #40) > - Separate thread is good, because too many work for main thread is evil. > IMHO FAM may block main thread on open() and you GUI will freeze. That is > why I create separate thread and async add/remove mon tasks via pipe(). No it's not good. Using a secondary thread instead of making the syscall not blocking is adding complexity. With my diff kevent(2) is not blocking and only called when poll(2) is reporting there's some events to read. > - Socket pair/pipe useful to queue. No, it is wrong to use an inter process communication mechanism requiring multiple context changes and the kernel doing multiple copyin/copyout to communicate between threads using the same address space. > - EV_ONESHOT was ugly design. I agree, that's why I removed it ;) > - Hash table other ugly design example. Well the problem was not the hash table per se but the need to have a global data structure. > - Ref count unneeded for this king of solution. Please read my diff, it *is* needed. But I like the "king of solution" ;)
(In reply to Martin Pieuchot from comment #44) > (In reply to Ting-Wei Lan from comment #43) > > It seems that the problem is in the fallback code. I did the test in a > > JHBuild GNOME session installed in /home/lantw44/gnome/devinstall. The > > entire installation is on a file system mounted on /home/lantw44/gnome, > > which is marked as can_unmount=1 so _ke_is_excluded returns true. > > g_kqueue_file_monitor_start sets 'sub' only when _ke_is_excluded returns > > false, so accessing 'sub' in _fallback_callback is a null pointer > > dereference. > > It makes no sense to not use the kqueue backend in such directories, so I > just removed the fallback code. The fallback code is meant to be used on removable devices, such as USB flash drives and SD cards. It is a workaround for lacking O_EVTONLY because having a file descriptor opened with O_RDONLY can prevent the file system from being unmounted. If the fallback code is removed, what should we do when users want to remove their drives? By the way, using the kqueue backend instead of the polling fallback seems to reduce the CPU usage from about 40% to less than 10% at idle. GNOME desktop works much more smoothly, and I saw zero application crash when updating packages or running JHBuild. Thanks for your great work!
(In reply to Martin Pieuchot from comment #45) > (In reply to rozhuk.im from comment #40) > > - Separate thread is good, because too many work for main thread is evil. > > IMHO FAM may block main thread on open() and you GUI will freeze. That is > > why I create separate thread and async add/remove mon tasks via pipe(). > > No it's not good. Using a secondary thread instead of making the syscall > not blocking is adding complexity. > > With my diff kevent(2) is not blocking and only called when poll(2) is > reporting there's some events to read. Are you sure that all open(), stat(), readdir() and etc.. syscalls is non blocking? I am - not. Also, comparing current and prev dir listing multiple time requires some CPU time. For example Thunar create multiple instances, but all gio-fam work in one process and some times eat many CPU, so if no separate thread then thunar gui will freeze. > > - Socket pair/pipe useful to queue. > No, it is wrong to use an inter process communication mechanism requiring > multiple context changes and the kernel doing multiple copyin/copyout > to communicate between threads using the same address space. Generally yes, but in my implementation it used only to add/delete monitoring object, messages are small -> pipe() give very small overhead, because call rate is very low. Same time call rate kqueue() + getdents() + fstatat() may be very high if we monitoring for changes /tmp and /proc (Thunar monitoring it, and it multiple x2 create/delete events rate) and same time some big port (like llvm/ff) is building. That is why I implement adaptive rate limit control. Take first event and process it. Take second event and if time less then "initial time" then do not process, but schedule timer to "initial time". If more events occur then count them. On timer event if "event counter" > 0 then compare prev dir listing with current and fire events to glib. Increment timer time by "multipe factor" time and schedule it. If event counter = 0, off timer. > > - Hash table other ugly design example. > Well the problem was not the hash table per se but the need to have a global > data structure. My solution have no rw data that requires multiple threads access. > > - Ref count unneeded for this king of solution. > > Please read my diff, it *is* needed. But I like the "king of solution" ;) My bad english can cause misunderstanding. That is because you try to improve legacy code. If you totally redesign it then no ref count will be used, probably :) Maybe we will join forces and finish mine version to the state when it will be accepted? I can put on a githab or a local pfabricator for teamwork. > The fallback code is meant to be used on removable devices, such as USB flash > drives and SD cards. It is a workaround for lacking O_EVTONLY because having a > file descriptor opened with O_RDONLY can prevent the file system from being > unmounted. If the fallback code is removed, what should we do when users want > to remove their drives? umount -f I use xfce4-mount plugin and set custom unmount command. > By the way, using the kqueue backend instead of the polling fallback seems to > reduce the CPU usage from about 40% to less than 10% at idle. GNOME desktop > works much more smoothly, and I saw zero application crash when updating > packages or running JHBuild. Thanks for your great work! Probably you have new good CPU, then I start mine CPU was coreduo E8400 and CPU usage > 50% most time, even no user activity. Polling is terrible and unusable solution!!!
(In reply to rozhuk.im from comment #47) > > The fallback code is meant to be used on removable devices, such as USB flash > > drives and SD cards. It is a workaround for lacking O_EVTONLY because having a > > file descriptor opened with O_RDONLY can prevent the file system from being > > unmounted. If the fallback code is removed, what should we do when users want > > to remove their drives? > > umount -f > I use xfce4-mount plugin and set custom unmount command. 'umount -f' itself is a workaround. How can a user know whether a drive is safe to remove if 'umount -f' is always used? > > By the way, using the kqueue backend instead of the polling fallback seems to > > reduce the CPU usage from about 40% to less than 10% at idle. GNOME desktop > > works much more smoothly, and I saw zero application crash when updating > > packages or running JHBuild. Thanks for your great work! > > Probably you have new good CPU, then I start mine CPU was coreduo E8400 and > CPU usage > 50% most time, even no user activity. > Polling is terrible and unusable solution!!! Polling is bad and slow. The problem is that we still don't have an alternative which doesn't block unmounting. I don't think we can tell users to use 'umount -f' blindly.
(In reply to Ting-Wei Lan from comment #48) > Polling is bad and slow. The problem is that we still don't have an > alternative which doesn't block unmounting. I don't think we can tell users > to use 'umount -f' blindly. This is BSD problem, not glib or other apps. IMHO bad idea to make workarounds and work slow/bad/terrible.
It may be better to have an environment variable which users can set to disable the fallback code. I think we can document the variable in GLib manual and FreeBSD handbook, so users who are willing to resolve the unmount problem themselves can find it.
(In reply to Martin Pieuchot from comment #45) > > - Hash table other ugly design example. > > Well the problem was not the hash table per se but the need to have a global > data structure. > > > - Ref count unneeded for this king of solution. > > Please read my diff, it *is* needed. But I like the "king of solution" ;) Probably this used for deduplicate monitored objects - thunar add multiple times same dirs. IMHO this functional should be moved to one level up in glib, to prevent code/func duplication in kqueue()/inotify/etc backends.
(In reply to Ting-Wei Lan from comment #50) > It may be better to have an environment variable which users can set to > disable the fallback code. I think we can document the variable in GLib > manual and FreeBSD handbook, so users who are willing to resolve the unmount > problem themselves can find it. Well, during last 4 years that I use FreeBSD as desktop, fam in glib: - works ok (as I remember) first 8 month - was disabled 1 year - freeze and crash many apps ~2 years I waiting long time to some glib or FreeBSD devs fix it. Then I do it by myself to get max user experience/comfort. Many users will not wait, will not google to some strange env tuning, they just drop *BSD and switch to linux/windows/MAC. And told friends that *BSD is bad and ugly, because most apps freezes and crashes on simple operations. Default poll time - 800ms, now try imagine that user open few thunars/etc..., this will get to monitor few dozen dirs. Then user navigate and get n*100 monitored folders. And all that n*100 monitored dirs updates every 800ms. Even on modern top CPU this will consume at least half of core, on mine coreduo I get 100% after 5 minutes after active thunar use. Is it usable? Is it good user experience? Switching poll on by default so ugly and terrible behavior - this is a BIG mistake for all BSD systems. Missing support of O_EVTONLY should be escalated to kernel developers.
(In reply to rozhuk.im from comment #52) > (In reply to Ting-Wei Lan from comment #50) > > It may be better to have an environment variable which users can set to > > disable the fallback code. I think we can document the variable in GLib > > manual and FreeBSD handbook, so users who are willing to resolve the unmount > > problem themselves can find it. > > Well, during last 4 years that I use FreeBSD as desktop, fam in glib: I assume you mean GFileMonitor here. Calling it 'fam' is very confusing because fam is the name of a different but related package. GFileMonitor also has a fam backend. > - works ok (as I remember) first 8 month > - was disabled 1 year > - freeze and crash many apps ~2 years Crashing is a known problem, as you can see in this bug report, and yes, unfortunately, we were unable to resolve it before importing GNOME 3.14 to FreeBSD ports. GNOME 3 import had been delayed for many times, and we really didn't want to delay it again. We just applied a workaround patch for gnome-session and didn't consider it as a blocker bug. > I waiting long time to some glib or FreeBSD devs fix it. > Then I do it by myself to get max user experience/comfort. It seems to me that nobody from FreeBSD worked for this problem. It is annoying and I have a lot of core dumps in my home directory because of this problem. I even disabled core dumps with 'sysctl kern.coredump=0' because I didn't want to wait a few minutes for gnome-shell to restart. > Many users will not wait, will not google to some strange env tuning, they > just drop *BSD and switch to linux/windows/MAC. And told friends that *BSD > is bad and ugly, because most apps freezes and crashes on simple operations. This is what patches attached here are for. I never see a crash caused by GFileMonitor kqueue backend after applying the latest patch. > Default poll time - 800ms, now try imagine that user open few > thunars/etc..., this will get to monitor few dozen dirs. Then user navigate > and get n*100 monitored folders. And all that n*100 monitored dirs updates > every 800ms. > Even on modern top CPU this will consume at least half of core, on mine > coreduo I get 100% after 5 minutes after active thunar use. > > Is it usable? > Is it good user experience? > > Switching poll on by default so ugly and terrible behavior - this is a BIG > mistake for all BSD systems. > Missing support of O_EVTONLY should be escalated to kernel developers. I agree that the current way of only using kqueue backend on whitelisted mount points can be improved. I think we can switch from the whitelist approach to a blacklist one. Instead of marking all unknown mount points as kqueue-unsafe, we can assume everything which is not mounted under a well-known directory used for mounting removable media, such as /media and /run/media, is safe to use kqueue. It should reduce the chance of switching to the polling fallback, and we can also provide an environment variable for users who want to disable the polling fallback completely.
> Many users will not wait, will not google to some strange env tuning, they > just drop *BSD and switch to linux/windows/MAC. And told friends that *BSD > is bad and ugly, because most apps freezes and crashes on simple > operations. This is a very important statement and very true. I abandoned FreeBSD for Linux some time ago because of the effects of this bug. Although not a problem with the core of FreeBSD, it is nearly impossible to avoid this issue as a desktop user. I hope to see this fixed so I can go back someday. > This is what patches attached here are for. I never see a crash caused > by GFileMonitor kqueue backend after applying the latest patch. This makes a big assumption about the user. I am a software developer and I tinker with my system frequently but would not apply some random patch to my system. A non-technical FreeBSD user most likely would not find this bug report, nor would they apply any patches here. The fix needs to come from FreeBSD package maintainers directly to be successful.
(Just so people don’t think I’ve abandoned this: it is on my review queue, and I will get to it and review the latest patches soon. Thank you, Martin, for the work you’ve done on this.) If people could keep the well-meaning but unnecessary comments to a minimum, that would be useful. There is a large CC list on this bug. Thank you.
(In reply to Philip Withnall from comment #55) > If people could keep the well-meaning but unnecessary comments to a minimum, > that would be useful. There is a large CC list on this bug. Thank you. Unsubscribed. Sorry to have troubled you.
(In reply to Ting-Wei Lan from comment #53) > > Well, during last 4 years that I use FreeBSD as desktop, fam in glib: > > I assume you mean GFileMonitor here. Calling it 'fam' is very confusing > because fam is the name of a different but related package. GFileMonitor > also has a fam backend. Yes, GFileMonitor too hard to remember and to long to write many times. :) > > - works ok (as I remember) first 8 month > > - was disabled 1 year > > - freeze and crash many apps ~2 years > > Crashing is a known problem, as you can see in this bug report, and yes, > unfortunately, we were unable to resolve it before importing GNOME 3.14 to > FreeBSD ports. GNOME 3 import had been delayed for many times, and we really > didn't want to delay it again. We just applied a workaround patch for > gnome-session and didn't consider it as a blocker bug. But at least FreeBSD can switch to libinotify GFileMonitor backend. Vladimir K report that it works stable and submit patch to bugzilla. > > Many users will not wait, will not google to some strange env tuning, they > > just drop *BSD and switch to linux/windows/MAC. And told friends that *BSD > > is bad and ugly, because most apps freezes and crashes on simple operations. > > This is what patches attached here are for. I never see a crash caused by > GFileMonitor kqueue backend after applying the latest patch. And I see small CPU usage in my backend, and looks fixed last potentialy crashes (senk to valngrind) few days ago. > > Switching poll on by default so ugly and terrible behavior - this is a BIG > > mistake for all BSD systems. > > Missing support of O_EVTONLY should be escalated to kernel developers. > > I agree that the current way of only using kqueue backend on whitelisted > mount points can be improved. I think we can switch from the whitelist > approach to a blacklist one. Instead of marking all unknown mount points as > kqueue-unsafe, we can assume everything which is not mounted under a > well-known directory used for mounting removable media, such as /media and > /run/media, is safe to use kqueue. It should reduce the chance of switching > to the polling fallback, and we can also provide an environment variable for > users who want to disable the polling fallback completely. I can try to add to my backend option to disable any open() on all mount points that not present in fstab. But I do not want add some timer polling, at least with interval less than 30sec. I already done "no watch for subfiles/subdirs" for all non local filesystems.
(In reply to Philip Withnall from comment #55) > If people could keep the well-meaning but unnecessary comments to a minimum, > that would be useful. There is a large CC list on this bug. Thank you. Sorry, but it is political, because it is critical for *BSD users and it is not fixed during 3+ years. No patches from glib devs, no work around (like switching to libinotify), no activity. Then I come here (because FreeBSD ports maintainers do no want get so big patch because it hard to maintenance - it is another sad story) with: https://bugzilla.gnome.org/show_bug.cgi?id=779777 - still no activity. Is glib/gnome alive? Is glib/gnome cares about user experience on *BSD? My point/target: I want to use FreeBSD on desktop and do not want freezes, crashes and other bad/annoying behavior. I use ms windows while many years, since 98se to win7 and never see crashing file explorer or file explorer that consume 100% CPU. Is I m wants too many or impossible things? I spent some time to investigate WTF on my workstation and found that current glib kqueue() backend has very ugly design. (I think all GFileMonitor sybsys need more love :) ). Ugly - mean that it can be fixed by series of patches. Many peoples scares to do many changes or totally remove old, but in this case all code unusable. Better to remove it and switch to libinotify than spent time trying to fix it. I do not understand why it is not done yet. Probably no one who have "commit bit" here and in FreeBSD does not use any *BSD for workstation. I write my own kqueue() backend and ask community to help with testing and integration to mainstream. Current codestyle/formatting and using OS staff instead of glib - is not problem. I prefer to change it after implement all futures and done with debuging. I want fix this annoying thing and forget about it for long time. Lets talk about tech things. 1. IMHO polling is bad. Current Polling is evil! Current polling implemented outside kqueue() backend and does not discussed here. Same time, as I point in prev messages, poll time = 800ms, if we have more than 50-100 monitored dirs than we get freezes and CPU -> 100% usage. If by some reasons polling need then at least poll time should be increased, IMHO 30 seconds is ok. And in this case I prefer to disable watching for files arrts/perm/sizes changes, only watch new/del/rename. Probably some one need this until O_EVTONLY implemented. 2. Some apps like thunar add multiple times same dirs to monitoring. This should be handled by some code that shared between all GFileMonitor backends. 3. GFileMonitor needs rate limit settings. I see some code from redhat in 2007 but it is undone. I am implement adaptive rate limit in my backend. I prefer to start with 1 event per second and increase x2 up to 8 seconds. Sets via options. 4. There is no limits for subfiles/subdirs in monitored dir. If some dir contain million dirs/files then we reach sys limit (and app will fail/unusable). IMHO reasonable limit is 128-1024, if more - then disable watching for files arrts/perm/sizes changes, only watch new/del/rename. 5. IMHO, for all non local file systems should be disabled watching for files arrts/perm/sizes changes, only watch new/del/rename. Multiple open()+stat()++ - takes many time on dir open=freezes, even then separate thread do this, because it done via one sys queue with high round-trip-time. Or have their own limits from p4. My backend do this. 6. Good to have tunable to disable watching for subdirs arrts/perm/sizes changes. 7. All fs syscals should be done in separate thread, to prevent any block and prevent any GUI freezes in cases where need many CPU time (for example: dir with 1m objects). Tell me that I am wrong or help me and all other *BSD users!
Review of attachment 368338 [details] [review]: This looks pretty good to me. I have a few comments, mostly minor. I think we should aim to get this in to the 2.58 release early in the cycle. It’s a bit late to get in for 2.56 with enough time for testing (https://wiki.gnome.org/Schedule). In the time between 2.56 and 2.58, it would be helpful if this was carried as a vendor patch by FreeBSD/OpenBSD (if they want to), and then we can get some real-world testing on it. ::: gio/kqueue/gkqueuefilemonitor.c @@ +160,3 @@ + struct timespec ts; + + memset(&ts, 0, sizeof(ts)); Nitpick: Missing space before `(` in two places here. @@ +189,3 @@ + * if possible: + * - G_FILE_MONITOR_EVENT_PRE_UNMOUNT + */ How hard would that be to do? @@ +216,3 @@ + if (mask) + g_file_monitor_source_handle_event (source, mask, NULL, NULL, NULL, now); + Nitpick: Spurious empty line. @@ +223,3 @@ + +static gboolean +g_kqueue_file_monitor_is_supported (void) This function could be called from any thread, so the kq_queue and kq_source variables need to be protected with a lock. See what _ih_startup() does in inotify-helper.c. @@ +241,3 @@ + kq_source = g_unix_fd_source_new (kq_queue, G_IO_IN); + g_source_set_callback (kq_source, (GSourceFunc) g_kqueue_file_monitor_callback, NULL, NULL); + g_source_attach (kq_source, GLIB_PRIVATE_CALL(g_get_worker_context) ()); Nitpick: Missing space before `(` in `GLIB_PRIVATE_CALL (`. @@ +266,3 @@ + +kqueue_sub* +_kqsub_new (const gchar *filename, GLocalFileMonitor* mon, GFileMonitorSource *source) This function can be static. Nitpick: Need a space before the `*` and not after in the arguments and return type. @@ +271,3 @@ + + sub = g_slice_new (kqueue_sub); + if (sub == NULL) g_slice_new() will never fail or return NULL, so this branch will never be taken. You might want to use g_slice_new0() just to make sure that everything is initialised. @@ +275,3 @@ + + sub->filename = g_strdup (filename); + if (sub->filename == NULL) g_strdup() will never fail (if GLib runs out of memory, it abort()s), so this branch will never be taken. @@ +285,3 @@ + sub->fd = -1; + sub->deps = NULL; + /* I think that having such flag in the subscription is not good */ Is this comment something which is going to be addressed later? Perhaps make it a ‘TODO’ comment? @@ +292,3 @@ + +void +_kqsub_free (kqueue_sub *sub) This function can be static. @@ +300,3 @@ + sub->source = NULL; + g_free (sub->filename); + sub->filename = NULL; No need to clear those two members to NULL before freeing the containing `sub`. @@ +305,3 @@ + +gboolean +_kqsub_cancel (kqueue_sub *sub) This function can be statuc. @@ +328,3 @@ + } + + close(sub->fd); Use g_close() so we use a consistent abstraction. Nitpick: Missing space before `(`. @@ +357,3 @@ + /* I know, it is very bad to make such decisions in this way and here. + * We already do have an user_data at the #kqueue_sub, and it may point to + * GKqueueFileMonitor or GKqueueDirectoryMonitor. For a directory case, There’s no such thing as GKqueueDirectoryMonitor. ::: gio/kqueue/kqueue-helper.h @@ +34,3 @@ + * @user_data: the pointer to user data + * @pair_moves: unused (currently not implemented) + * @fd: the associated file descriptor (used by kqueue) This documentation comment doesn’t match what’s in the struct. ::: gio/kqueue/kqueue-missing.c @@ +81,3 @@ +_kh_file_appeared_cb (kqueue_sub *sub) +{ + GFile* child; Nitpick: Space goes before the `*`, not after.
(In reply to Philip Withnall from comment #59) > (https://wiki.gnome.org/Schedule). In the time between 2.56 and 2.58, it > would be helpful if this was carried as a vendor patch by FreeBSD/OpenBSD > (if they want to), and then we can get some real-world testing on it. Hi Philip. That's already the case on OpenBSD. We've always run with Martin's latest patch iteration since day one :-) All of our official GLib packages have included it for a while now, including the most recent patch. Thanks and cheers!
Created attachment 368660 [details] [review] Multiple fixes and simplifications for the kqueue(2) backend, v3 New version of the diff that restore the fallback code without NULL dereference and addresses all previous comments. (In reply to Philip Withnall from comment #59) > Review of attachment 368338 [details] [review] [review]: > @@ +189,3 @@ > + * if possible: > + * - G_FILE_MONITOR_EVENT_PRE_UNMOUNT > + */ > > How hard would that be to do? This is not my comment. I removed it because it is confusing. The whole backend must be improved to generate the events glib wants. Currently gio/tests/testfilemonitor do not pass. This is a different issue though. My diff only improve the design of the backend and the kqueue(2) missuses. > @@ +223,3 @@ > + > +static gboolean > +g_kqueue_file_monitor_is_supported (void) > > This function could be called from any thread, so the kq_queue and kq_source > variables need to be protected with a lock. See what _ih_startup() does in > inotify-helper.c. Done. > @@ +285,3 @@ > + sub->fd = -1; > + sub->deps = NULL; > + /* I think that having such flag in the subscription is not good */ > > Is this comment something which is going to be addressed later? Perhaps make > it a ‘TODO’ comment? This comment is also not mine and it is not helping. I removed it. > @@ +357,3 @@ > + /* I know, it is very bad to make such decisions in this way and here. > + * We already do have an user_data at the #kqueue_sub, and it may > point to > + * GKqueueFileMonitor or GKqueueDirectoryMonitor. For a directory > case, > > There’s no such thing as GKqueueDirectoryMonitor. Same here.
(In reply to Martin Pieuchot from comment #61) > Created attachment 368660 [details] [review] [review] > Multiple fixes and simplifications for the kqueue(2) backend, v3 > > New version of the diff that restore the fallback code without NULL > dereference > and addresses all previous comments. Thanks for the updates. > (In reply to Philip Withnall from comment #59) > > Review of attachment 368338 [details] [review] [review] [review]: > > @@ +189,3 @@ > > + * if possible: > > + * - G_FILE_MONITOR_EVENT_PRE_UNMOUNT > > + */ > > > > How hard would that be to do? > > This is not my comment. I removed it because it is confusing. The whole > backend must be improved to generate the events glib wants. Currently > gio/tests/testfilemonitor do not pass. Are you saying that gio/tests/testfilemonitor does not pass *before* applying your patch, *after* applying your patch, or both? How does it fail?
Review of attachment 368660 [details] [review]: This looks good to me. As I said in comment #59, I’d like to get this into 2.58 early in the cycle. The fact that it’s getting tested (and has been for a while) on OpenBSD is great; but it would also be good to get some testing on FreeBSD and OS X. Could people running on those platforms try the latest patch out for a while and report back please? I’m assuming the kqueue() behaviour is the same on all three platforms, but it would be good to check.
(In reply to Philip Withnall from comment #62) > Are you saying that gio/tests/testfilemonitor does not pass *before* > applying your patch, *after* applying your patch, or both? Both, my patch doesn't improve the situation nor introduce regression.
Now that GLib 2.56 has been released and we have branched, I have pushed this to master (which will eventually become 2.58). I am going to close this bug. If any follow-up work or fixes are needed, please file new bug reports. I realise the patch has been carried by OpenBSD for a while (great!), but if distros could prioritise testing this early, rather than later, that would maximise the chance for any latent bugs being found. Along those lines, would anybody from the FreeBSD or OpenBSD communities be interested in running a CI server for GNOME? Having CI on BSD platforms would help prevent regressions in GLib's support for those platforms, and improve the testing all of our patches get. Get in touch with me (bugzilla@tecnocode.co.uk) if that is something anyone is interested in. It would be much appreciated. (We can now start setting up CI servers as part of our move to gitlab.gnome.org.)