GNOME Bugzilla – Bug 761102
Increase performance for main loop
Last modified: 2018-05-24 18:30:38 UTC
Created attachment 319699 [details] [review] First patch Trying to understand why glib loop was doing too much iterations came out that code is waking up the loop too much. Basically many poll record change cause wake_up event to be triggered. Also the event is not cleared the first time causing other iterations. I wrote too patch which are working fine on my system.
Created attachment 319700 [details] [review] Second patch Limit event acknowledges. Note that break on average cases should limit the loop.
ping
Thanks Frediano, you caught a real bug there. (Please stick to the GLib (GNU) coding style and put opening braces on their own line.) Matthias, context->wake_up_rec.revents is currently read out before the new poll results are applied, the patch figures the correct revents field to flush notifications. Please apply.
*** Bug 750206 has been marked as a duplicate of this bug. ***
I'm not familiar enough with this code to have any opinion on this. I'll leave this to Allison or Owen.
Review of attachment 319699 [details] [review]: This patch makes a good deal of sense and, indeed, it ought to remove a useless iteration of the loop in the very common case of adding a source from the currently-running thread. Thanks!
Review of attachment 319700 [details] [review]: Good catch, but I don't like the solution. Why not just move this fragment down a bit to the part after the revents are copied out of the fds array and into the pollrecs?
(In reply to Allison Ryan Lortie (desrt) from comment #7) > Review of attachment 319700 [details] [review] [review]: > > Good catch, but I don't like the solution. > > Why not just move this fragment down a bit to the part after the revents are > copied out of the fds array and into the pollrecs? Because that part is not reached as context->poll_changed is true.
(In reply to Matthias Clasen from comment #5) > I'm not familiar enough with this code to have any opinion on this. > I'll leave this to Allison or Owen. Yeah - I *am* familiar enough, checked the patch and gave approval. (In reply to Frediano from comment #8) > > Good catch, but I don't like the solution. > Because that part is not reached as context->poll_changed is true. Exactly, thanks Frediano for replying on why the logic shouldn't be changed.
(In reply to Tim Janik from comment #9) > > Exactly, thanks Frediano for replying on why the logic shouldn't be changed. Unfortunately the patch it's quite old and my memory not so good and it's hard to retrieve back all checks I did :( I remember I used strace and lot of debug. Surely a single thread is enough to test.
Any news?
Pushed the second patch with some cosmetic changes and adapted to apply to the current code.
I have opened bug 768968 about a regression in a test after the first patch.
This is causing hangs in QEMU. While this may be a bug in QEMU, depending on how glib handles backwards bug-compatibility you may want to revert this patch. glib is expecting QEMU to use g_main_context_acquire around accesses to GMainContext. However QEMU is not doing that, and instead it is taking its own mutex. Therefore, the g_wakeup_signal is culled by Frediano's patch. QEMU does the following: thread A thread B ------------ ------------ take mutex for (;;) { prepare query release mutex ppoll() g_source_attach take mutex check dispatch } Because thread A doesn't use acquire/release, the g_source_attach is not waking up thread A to recalculate the pollfds. In all fairness, the docs do say "You must be the owner of a context before you can call g_main_context_prepare(), g_main_context_query(), g_main_context_check(), g_main_context_dispatch()". However, it has worked until now and the documentation does not say exactly why that is necessary...
As far as I understand Qemu relied on an undefined behaviour of GLib. As fas as I'm concern users of any API should not rely on undefined behaviour and this should be considered a user (Qemu) bug and not a library (GLib) bug.
> As fas as I'm concern users of any API should not rely on undefined behaviour > and this should be considered a user (Qemu) bug and not a library (GLib) bug. That's the theory, but backwards compatibility sometimes matter more than the theory. There are lots of forks of QEMU in existence that will never have the bugfix backported to them, for example.
And we _are_ going to fix QEMU of course.
The docs are pretty clear on this. Are any *other* applications affected by this? So far, this is the first report. Which isn't surprising since I suspect the vast majority of GLib mainloop users are just doing the highlevel g_main_context_iteration(). Paolo, we could consider carrying a revert of this in downstream locations where it's affecting qemu, if it's hard to get qemu changed. We could revert this for e.g. a release cycle and give time for the fixed qemu to propagate. (Has a qemu patch for this been written yet?)
> The docs are pretty clear on this. Actually, now that I read them again, they aren't clear at all. They say "You must be the owner of a context before you can call g_main_context_prepare(), g_main_context_query(), g_main_context_check(), g_main_context_dispatch()." They do not say that you need to be the owner of a context between those calls. In particular, if you release the context after g_main_context_query returns and reacquire it before g_main_context_check, it would cause the bug again: thread A thread B ------------ ------------ for (;;) { acquire prepare query release ppoll() g_source_attach acquire check dispatch release } > Paolo, we could consider carrying a revert of this in downstream locations > where it's affecting qemu, if it's hard to get qemu changed. No, it's easy and we're testing the fix over the weekend. Once it's tested I'll also backport it to Fedora 25. I'm worried more about the endless amount of forks (including the Android emulator) and about the above imprecision in the document. However I wonder if it's possible to change the patch so that it does the wakeup if no one has acquired the context. This should still provide the optimization for the users that are doing g_main_context_iteration in a loop, and it would fix both QEMU and the hypothetical case at the top of this comment. For example something as simple as this: - if (context->owner && context->owner != G_THREAD_SELF) + if (context->owner != G_THREAD_SELF) g_wakeup_signal (context->wakeup); (for all three occurrences) should do it.
Created attachment 349077 [details] [review] Create a helper function for owner wakeup optimization Just a quick preparatory patch here.
Review of attachment 349077 [details] [review]: Looks good.
Review of attachment 349077 [details] [review]: Pushed.
Created attachment 349194 [details] [review] [PATCH] gmain: Signal wakeups if context has never been acquired as well
Would be great to add these as test cases. I honestly still think this as a Qemu workaround, not a fix. When I wrote the initial patch I was thinking a different implementation of this. Before preparing the poll list reset the condition. This should be done inside the lock as all conditions changes are done inside the lock.
Review of attachment 349194 [details] [review]: > /* Note that if the context has never been acquired, context->owner > * will be NULL, and hence we'll signal a wakeup in that case, as > * well as the case where it's owned by another thread. This is > * intended to ensure backwards compatibility with at least qemu's > * use of GMainContext. I would rephrase the comment, since it also applies when the context is not currently acquired. But I think that even this is not enough, because there is another race, that is not fixed by this patch: thread A thread B ------------ ------------ for (;;) { acquire prepare query g_source_attach release ppoll() acquire check dispatch release } and I'm not sure if there is any way to work around this. Debian Code Search brought up at least two instances of this idiom: 1) Faumachine http://sources.debian.net/src/faumachine/20160511-1/lib/glue-gui-gtk.c/?hl=1796#L1796 2) GNU Smalltalk (I wrote that one too and it's pretty much dead, so alone it would not be a big deal) Plus another package (http://sources.debian.net/src/afterstep/2.2.12-9/src/ASMount/main.c/?hl=1129#L1129) which seems not to use acquire/release like QEMU did.
@Paolo - But this bug requires a "thread B", not just an instance of the "exploded loop" pattern. Some quick use of grep says Faumachine does not call any g_source_attach (or indirectly via g_timeout_* etc.) from multiple threads. (Does smalltalk?) I doubt Afterstep does. My instinct here is to expedite a fix for (older versions of) QEMU as quickly as possible. The "revert" option feels like a big hammer when we have a known fix for that in hand. That said: @Frediano - Mmm, that sounds promising, yes.
Actually @Paolo - in your example (let's call this case "acquire-exploded"?) - won't the value of context->owner be "Thread A" and hence we *will* queue a wakeup? Or what is the bug?
> Actually @Paolo - in your example (let's call this case "acquire-exploded"?) - > won't the value of context->owner be "Thread A" and hence we *will* queue a > wakeup? Or what is the bug? Nevermind, context->owner is protected by (UN)LOCK_CONTEXT, not by the "acquire lock" itself. So your patch should be fine for the time being. But then, is anything even protected by the "acquire lock"? I am convinced that threads are a red herring, they don't really matter. Until Frediano's patch, g_main_context_is_owner and g_main_context_wait were the only functions for which acquire/release mattered. Nothing else looked at context->owner. Now, g_main_context_wait is kinda deprecated (it's only runtime-deprecated because EFL uses it, bug 733537) and it is not clear from the docs why external locking (acquire/release) would be required except in combination with g_main_context_wait. The only reason is that the docs say so. If the application knows that the context is only used by one thread, it should not need to acquire/release anything! What you really want to check is the phase in the prepare/query/poll/check/dispatch sequence: - finished preparing, n_ready > 0: no need to wakeup until next prepare - finished preparing, n_ready == 0: need to wakeup - started check: no need to wakeup until next prepare Because conditional_wakeup runs under (UN)LOCK_CONTEXT, the phase cannot change during conditional_wakeup and this works irrespective of which thread is calling g_source_attach. I'll attach a patch as soon as I finish testing.
Created attachment 349219 [details] [review] [PATCH] gmain: only signal GWakeup right before or during a blocking poll
I believe I have verified Paolo's patch (comment 29). This was my method. (1) Compile qemu from git (d9123d09f711b). This does NOT contain any fix for acquiring the main context lock. (2) Compile glib2 from git (028a597d645). This contains Colin's patch (comment 23), but not Paolo's patch. (3) Run virt-rescue (NB: you will need virt-rescue >= 1.37): LIBGUESTFS_BACKEND=direct \ LIBGUESTFS_HV=~/d/qemu/x86_64-softmmu/qemu-system-x86_64 \ LD_LIBRARY_PATH=~/d/glib/gthread/.libs:~/d/glib/gio/.libs:~/d/glib/gmodule/.libs:~/d/glib/gobject/.libs:~/d/glib/glib/.libs \ virt-rescue --scratch (4) In the rescue shell, run: ><rescue> while true; do ls -l /usr/bin; done Observe that there are occasional hangs. => This reproduces the original bug. (5) Apply Paolo's patch to glib2 & recompile. (6) Repeat test steps (3) and (4). Observe that the occasional hangs either go away, or become so rare that I did not see any. I ran the loop for about 20 minutes. => This verifies the fix.
Review of attachment 349219 [details] [review]: ::: glib/gmain.c @@ +3591,3 @@ } g_source_iter_clear (&iter); + context->need_wakeup = (n_ready == 0); I'll be honest, I'm not an expert in this mainloop code. I can somewhat convince myself this is correct, but it doesn't feel as "obviously correct" as the (above) "thread owner" patch. Particularly looking at it from the risk standpoint. The "thread owner" approach might result in spurious wakeups, but it feels very unlikely to result in lost wakeups. In contrast, this patch is also going to affect the single-threaded case, and changes the logic for how we compute those wakeups. And while spurious wakeups aren't a big deal, lost wakeups are. I see where this logic is going, but I'm leaning towards applying the thread owner patch. Maybe I just need to think more about this patch though, and certainly open to other's opinions here. @@ +3726,3 @@ TRACE (GLIB_MAIN_CONTEXT_BEFORE_CHECK (context, max_priority, fds, n_fds)); + /* We do need to be woken up during check or dispatch, because s/do/don't/?
> > g_source_iter_clear (&iter); > + context->need_wakeup = (n_ready == 0); > > I'll be honest, I'm not an expert in this mainloop code. I can somewhat > convince myself this is correct, but it doesn't feel as "obviously correct" as > the (above) "thread owner" patch. Particularly looking at it from the risk > standpoint. I am okay with your "thread owner" patch for stable releases, but I would prefer mine for devel. Even though it's not found in the wild, the "thread owner" patch still has a hole for the "exploded loop" pattern, together with deprecating acquire/release. I will open another bug if you apply your patch, and maybe change it slightly so that it is easier to understand. > > + /* We do need to be woken up during check or dispatch, because > > s/do/don't/? Yes. Of course I picked the third (wrong) option between "don't need to be" and "need not be". I'll avoid the passive voice and use "need no wakeup".
(In reply to Paolo Bonzini from comment #25) > Review of attachment 349194 [details] [review] [review]: > > > /* Note that if the context has never been acquired, context->owner > > * will be NULL, and hence we'll signal a wakeup in that case, as > > * well as the case where it's owned by another thread. This is > > * intended to ensure backwards compatibility with at least qemu's > > * use of GMainContext. > > I would rephrase the comment, since it also applies when the context is > not currently acquired. Isn't that what the first part of the first sentence says? =)
(In reply to Paolo Bonzini from comment #32) > > > > g_source_iter_clear (&iter); > > + context->need_wakeup = (n_ready == 0); > > > > I'll be honest, I'm not an expert in this mainloop code. I can somewhat > > convince myself this is correct, but it doesn't feel as "obviously correct" as > > the (above) "thread owner" patch. Particularly looking at it from the risk > > standpoint. > > I am okay with your "thread owner" patch for stable releases, but I would > prefer mine for devel. Mmm, splitting the difference. OK.
> > I would rephrase the comment, since it also applies when the context is > > not currently acquired. > > Isn't that what the first part of the first sentence says? =) Perhaps instead of "Note that if the context has never been acquired", "note that if no thread is currently owning the context". But just go ahead with whatever wording you prefer, hopefully it's only for stable glib.
Review of attachment 349194 [details] [review]: With just the small nit on the initial /* */ (as in comment 35). Sorry for the delay, I wasn't sure if you needed an explicit "reviewed" tag to commit.
Review of attachment 349219 [details] [review]: https://git.gnome.org/browse/glib/commit/?id=0c0469b56d7e6b2533760d5d821076c88b05dfb0
Backported to 2-52 and 2-50 at least.
Responding to comment 28: > Until Frediano's patch, g_main_context_is_owner and g_main_context_wait were the only functions for which acquire/release mattered. Nothing else looked at context->owner. ... If the application knows that the context is only used by one thread, it should not need to acquire/release anything! Yes, but `g_main_context_invoke()` uses `g_main_context_is_owner()`, and a *lot* of things in GLib use `g_main_context_invoke()`. It's a general pattern we want to encourage. That said in the end I think it is mostly an optimization - it feels like the "exploded loop" pattern mostly should work without it.
Yeah, this makes sense. Though then I wonder if it's okay for g_main_context_invoke_full to invoke the function immediately when there is a stack of thread-default contexts, and @context is on the stack but not on the top. That is, thread1: g_main_context_push_thread_default(context1); g_main_context_push_thread_default(context2); g_main_context_invoke(context1, func, NULL); g_main_context_pop_thread_default(context2); Should func be invoked immediately, or after the pop? Both have advantages and disadvantages. If the latter, the is_owner check would be done second and (like the g_main_context_default case) it would be just for compatibility. g_main_context_push_thread_default would still call g_main_context_acquire/release of course, but hidden under a much saner API that logs a critical message if the acquire fails.
Review of attachment 349219 [details] [review]: OK, I - Rebased this on top of the "no owner" patches (which were pushed to master, and stable branches) - added some more comments - Convinced myself this made sense - Verified that we have *some* test coverage of this in glib/tests/mainloop.c (at least test_ready_time()) - pushed: https://git.gnome.org/browse/glib/commit/?id=0c0469b56d7e6b2533760d5d821076c88b05dfb0
Can 9ba95e25b74a be backported to 2.52? Currently 2.52 causes LibreOffice to hang: https://bugs.archlinux.org/task/53730
(In reply to Jan Alexander Steffens (heftig) from comment #42) > Can 9ba95e25b74a be backported to 2.52? Currently 2.52 causes LibreOffice to > hang: > > https://bugs.archlinux.org/task/53730 Pushed to glib-2-52 as 5d74233. Colin, is there anything else to do on this bug?
Some of the scripts at https://github.com/swilmet/gnome-c-utils (e.g. gcu-smart-c-comment-substitution) act strangely after commit 9ba95e25b74adf8d62effeaf6567074ac932811c on X11. The script hangs and pressing the Ctrl key makes the script to continue its execution (it's a command line utility, there is no GUI). What gcu-smart-c-comment-substitution does is to call several times gtk_main(). I don't know if GLib is supposed to support that, but it worked fine before and it was in my opinion an easy way to write the script. So what the script does is basically: int main (int argc, char **argv) { int i; for (i = 1; i < argc; i++) { launch_async_operation_on_file (argv[i]); /* gtk_main_quit() is called in a callback * when the operation is finished. */ gtk_main (); } } Before commit 9ba95e25b74adf8d62effeaf6567074ac932811c, the script executed fine on X11. But on Wayland it has never worked: it hangs after the first gtk_main_quit(). So I don't know if GLib is supposed to support that use-case. If yes, it's broken on Wayland, and on X11 it's broken (differently) after commit 9ba95e25b74adf8d62effeaf6567074ac932811c (I've done a git bisect). I've tried with a GMainLoop instead of the GTK main loop, and there is the same problem on X11. It's of course possible to refactor the script to call gtk_main() only once, to also make it work on Wayland. But I wanted to report the bug anyway.
Hmm. gtk_main.c has some nontrivial logic here...it's not clear to me whether it intends to support what you're doing or not. As far as GLib - yes, there may be a regression here indeed if a context is reused across multiple iterations. Hmm...we don't signal a wakeup if we have a ready source, but what if the source is removed in dispatch?
I think this is causing hangs in WebKit, but only the last commit, the one adding need_wakeup. Let me first explain how the WebKit RunLoop implementation works. RunLoop uses a single GSource that is used to dispatch tasks. So, when a task is dispatched in the loop (from any thread) it's queued and the source ready time is set to current to schedule it immediately. The source dispatch sets the ready time to -1 again and calls the callback that performs all the queued tasks. We are doing in this way, because we used to use g_idle_add but creating and destroying the source every time we wanted to schedule a task to the main loop was very inefficient. In WebKit there are a lot of secondary threads running a RunLoop each and with multiple timers. Since commit 9ba95e25b74adf8d62effeaf6567074ac932811c the main loop sometimes doesn't wake up after setting the ready time to current. It's not easy to reproduce, but it's happening in some cases. When this happens in a worker thread running a main loop with that single source, the loop might end up polling forever. The hangs are happening when a process connection tries to send a sync IPC message to another process. The connection has a worker thread to do the socket IO, so what we do is something like this: sendSyncMessage() - queue sync message - worker RunLoop schedule dispatch send pending messages - wait for sync reply If the worker thread is polling and the schedule dispatch doesn't wake up the context, the message is never sent. With non sync messages this not a problem because sooner or later another message is sent, and quite likely the context will be woken up next time (this is not easy to reproduce). But with sync message the connection is blocked waiting for the sync reply, so no other messages are sent. See this backtrace:
+ Trace 238279
Since it's not easy reproduce I can't confirm at 100%, but I haven't been able to reproduce the hangs after reverting 9ba95e25b74adf8d62effeaf6567074ac932811c (or by making g_source_set_ready_time() wake up the context unconditionally). Looking at the code the only way I think this could happen is if ready_time is set right after g_main_context_prepare() and before g_main_context_query() in g_main_context_iterate(). At that point the context lock has been released, and need_wakeup has been set to FALSE, because the ready time was -1 during the prepare(). Does this make sense? I guess it's a very short period of time, but that's why it's difficult to reproduce (and almost impossible to write a test). I don't know if the need_wakeup optimization is valid in all other places where conditional_wakeup is called, so maybe we can add an additional check only to set_ready_time, maybe bring back the previous owner check. And this could be fixed in WebKit itself by always doing an explicit wakeup after set ready time, but I prefer not to do that unless we are misusing set_ready_time API in WebKit.
I don't think you are misusing set_ready_time, but there must be something more to this bug: > need_wakeup has been set to FALSE, because the ready time was -1 during the > prepare(). need_wakeup should actually be *TRUE* if the ready time was -1. That's because when source->priv->ready_time ==1, the source is not ready. If there are no other sources, then /* See conditional_wakeup() where this is used */ context->need_wakeup = (n_ready == 0); would have actually requested a wakeup. In fact, if need_wakeup is TRUE, then you also have context->timeout = 0; from the same "if" that increments n_ready. This should cause poll not to block (and that's *why* you don't need a wakeup).
Sorry, that should have read "if need_wakeup is FALSE, then you also have context->timeout = 0".
(In reply to Paolo Bonzini from comment #47) > I don't think you are misusing set_ready_time, but there must be something > more to this bug: > > > need_wakeup has been set to FALSE, because the ready time was -1 during the > > prepare(). > > need_wakeup should actually be *TRUE* if the ready time was -1. That's > because when source->priv->ready_time ==1, the source is not ready. If > there are no other sources, then > > /* See conditional_wakeup() where this is used */ > context->need_wakeup = (n_ready == 0); You are right, I don't know why I assumed the condition was n_ready > 0. > would have actually requested a wakeup. In fact, if need_wakeup is TRUE, > then you also have > > context->timeout = 0; > > from the same "if" that increments n_ready. This should cause poll not to > block (and that's *why* you don't need a wakeup). hmm, then the issue must be somewhere else.
I haven't had a hang since I reverted the patch locally, so it's clear that 9ba95e25b74adf8d62effeaf6567074ac932811c introduced the bug. However, I don't know exactly how and why, I'm not familiar enough with the main loop code, so I can't really propose a fix, just workarounds. The bug is a race condition, which makes it difficult to reproduce when you want to, but it can easily happen when normally using epiphany for a while (and it's very annoying). Could we revert the patch until we figure it out? I'm afraid that applying any workaround will hide the actual problem forever.
In comment #44 it's the same culprit commit that caused the regression, commit 9ba95e25b74adf8d62effeaf6567074ac932811c.
I’ve reverted 9ba95e25b74adf8d62effeaf6567074ac932811c on master and glib-2-54 while someone (who?) figures out what the problem is. 4976e8109 (glib-2-54, origin/glib-2-54) Revert "gmain: only signal GWakeup right before or during a blocking poll" e91c11841 (master, origin/master, origin/HEAD) Revert "gmain: only signal GWakeup right before or during a blocking poll"
Apparently GLib 2.54.3 (probably caused by the revert in comment #52, but nobody’s actually established that) has broken timedatex: https://github.com/mlichvar/timedatex/issues/4#issuecomment-356598256 They have a minimal reproducer of their issue there.
The handling of this issue looks a bit weird to me. Shouldn't we have some strong tests on how we want these events to be handled? If we have a new issue we should analyze the new problem, see if is something broken on GLib or not, in the first case write a new test on how should be handled that new case and fix it otherwise tell the other project how to use properly Glib, maybe adding some documentation if is lacking some detail. The fact that we keep adding and reverting looks like we don't understand or agree on how we want the APIs to behave or we don't have enough tests to validate the code.
LibreOffice is also having problems again, now (comment #42). https://bugs.archlinux.org/task/57036 Can the entire chain of optimizations that caused this mess be reverted until someone figures out what went wrong where?
(In reply to Jan Alexander Steffens (heftig) from comment #55) > LibreOffice is also having problems again, now (comment #42). > > https://bugs.archlinux.org/task/57036 > > Can the entire chain of optimizations that caused this mess be reverted > until someone figures out what went wrong where? If you read comment 42 (https://bugs.archlinux.org/task/53730 link) you can see this was already fixed by a following patch.
Which got reverted (comment 52).
FWIW: I'm fairly confident that reverting 9ba95e25b74adf8d62effeaf6567074ac932811c did indeed fix WebKit. I have not seen any web processes hang since I upgraded glib. So if we revert the revert to fix LibreOffice, we will likely break WebKit. Now, Carlos has already pointed out that this is easy enough to work around in WebKit, but if g_source_set_ready_time() does not work reliably, any other applications relying on g_source_set_ready_time() to schedule a callback will surely be broken too. So we probably should not do that....
(In reply to Michael Catanzaro from comment #58) > FWIW: I'm fairly confident that reverting > 9ba95e25b74adf8d62effeaf6567074ac932811c did indeed fix WebKit. I spoke too soon, I've seen two more web process hangs today. :( Still, the rest of my comment stands.
In other words, my patch fixed LibreOffice and, if anything, it made preexisting bugs happen more frequently. Comment #44 says that the code in question has never even worked on Wayland. Either everything is reverted (my patch and Frediano's), or it's pointless to go on with the finger pointing. There should be minimal testcases for at least comment #44, to understand what's going on and whether it should be fixed for Wayland or considered a bug. Maybe add code to gmain.c that detects the problematic situation. In the meanwhile, WebKit seems to have issues of its own, and from comment #50 in fact it's not clear at all to me that the culprit is g_source_set_ready_time().
(In reply to Michael Catanzaro from comment #59) > (In reply to Michael Catanzaro from comment #58) > > FWIW: I'm fairly confident that reverting > > 9ba95e25b74adf8d62effeaf6567074ac932811c did indeed fix WebKit. > > I spoke too soon, I've seen two more web process hangs today. :( Did you check the backtrace to ensure it's the same hang? I haven't seen a hang since I upgraded glib and they used to be quite frequent (several times every day). > Still, the rest of my comment stands.
I agree we should probably revert all the chain and try to come up with a new single patch. I'll be happy to test any new patch proposed. It would be ideal to try to cover all the cases in unit tests, but I understand it's not easy since these bugs are mostly race conditions.
(In reply to Carlos Garcia Campos from comment #62) > I agree we should probably revert all the chain and try to come up with a > new single patch. I'll be happy to test any new patch proposed. It would be > ideal to try to cover all the cases in unit tests, but I understand it's not > easy since these bugs are mostly race conditions. One you understand the cause of the race (which is the hard part) usually is not hard to write a unit test, just question on where to insert a sleep. Are there any core of these hangs or the root causes of them? Some informations are in the comments, I'll try to came up with something.
(In reply to Carlos Garcia Campos from comment #61) > Did you check the backtrace to ensure it's the same hang? I haven't seen a > hang since I upgraded glib and they used to be quite frequent (several times > every day). No I did not. I highly doubt it is the same hang, because I trust your debugging when you say that 9ba95e25b74adf8d62effeaf6567074ac932811c broke g_source_set_ready_time(), even if we don't know how. I imagine you spent a long time debugging to come to that conclusion. There are many unrelated bugs in WebKit that could cause a web process hang. It was misleading of me to fail to mention that. (In reply to Frediano Ziglio from comment #63) > One you understand the cause of the race (which is the hard part) usually > is not hard to write a unit test, just question on where to insert a sleep. It seems hard to me, unless you're proposing to insert a sleep inside gmain itself.
> I imagine you spent a long time debugging to come to that conclusion. I don't know about the time he spent, but I know his reasoning was wrong (comment 49). All I see here is people "doubting" that the problem is in their code, reverting left and right without understanding neither the patches nor the consequences. Instead of reverting the patch, you should have just done - if (context->need_wakeup) - g_wakeup_signal (context->wakeup); + g_wakeup_signal (context->wakeup); in conditional_wakeup which would have brought the logic to what it was before Frediano's patch (and would not have broken LibreOffice). Here is some reasoning about why my patch is correct: 1) Conditional_wakeup should always be called under LOCK_CONTEXT. Unlocked access to context->need_wakeup should not be an issue. (or at least it wasn't when the patch was first written). 2) context->need_wakeup is set in exactly two places, g_main_context_prepare and g_main_context_check. There are not that many ways to order g_source_set_ready_time against prepare/poll/check. The logic of g_main_context_prepare is explained in comment ... g_main_context_check disables wakeups after check entirely; a g_source_set_ready_time before check sees need_wakeup equal to TRUE and wakeups, a g_source_set_ready_time after check is evaluated by the next g_main_context_prepare before and after my patch. And in the meanwhile we had commit a4686b8ea18c585666edece7b92147a92fcfb841 ("g_source_set_ready_time: Move no-op fast-path under the lock") which fixed _another_ bug in g_source_set_ready_time. It's quite plausible that my patch, by removing pointless wakeups, made that bug more likely to happen. Can anyone try again glib with my patch applied, and see if WebKit is still broken?
(In reply to Paolo Bonzini from comment #65) > And in the meanwhile we had commit a4686b8ea18c585666edece7b92147a92fcfb841 > ("g_source_set_ready_time: Move no-op fast-path under the lock") > which fixed _another_ bug in g_source_set_ready_time. It's quite plausible > that my patch, by removing pointless wakeups, made that bug more likely to > happen. True, that is quite plausible....
Comment on attachment 349194 [details] [review] [PATCH] gmain: Signal wakeups if context has never been acquired as well (This was committed ages ago.)
Comment on attachment 319700 [details] [review] Second patch (A modified version of this was committed ages ago as 48ea710ee.)
Created attachment 366926 [details] [review] gmain: Partial revert of recent wakeup changes to gmain.c This reverts the following commits (but keeps the other recent changes to gmain.c): • e4ee3079c Do not wake up main loop if change is from same thread • 208702404 main: Create a helper function for "owner wakeup" optimization • 0c0469b56 gmain: Signal wakeups if context has never been acquired as well • 9ba95e25b gmain: only signal GWakeup right before or during a blocking poll Some combination of them is causing problems with LibreOffice and/or WebKit, and the safest thing to do at the moment is revert them all until we work out what’s going on. The previous revert (4976e8109) was not sufficient (it fixed WebKit, but re-broken LibreOffice). By reverting, we gain some spurious wakeups, but avoid dropping necessary wakeups, which is presumably what’s causing problems in the other modules. Signed-off-by: Philip Withnall <withnall@endlessm.com>
Right, here’s a patch to revert the various changes from this bug. It applies to master and glib-2-54. If someone could give this a review, I’ll push it, and then at least the current problem of LibreOffice and/or WebKit being broken will be solved. Since QEMU has been shipping a fix for the original problem for almost a year, I am less concerned about that. Older versions of QEMU will break again, though. I do not currently have time to investigate this properly and work out where the problem with the fixes is. I suggest that the WebKit and LibreOffice people revert the reverts on GLib master, add some debug printfs to: • conditional_wakeup() • any changes to context->in_check_or_prepare • any changes to context->need_wakeup and then post a log from that when you next see a WebKit/LibreOffice hang (don’t worry about explicitly trying to trigger a hang; just wait until one shows up during normal usage), plus a full backtrace of all threads. That’s the only way we’re going to make significant headway with working out what’s going on here. Thanks.
Please don't do this. Shotgun reverting makes no sense if you read all the latest comments. Michael said "I have not seen any web processes hang since I upgraded glib" but upgrading glib also brought in commit a4686b8ea18c585666edece7b92147a92fcfb841 which fixed a (real, not presumed!!) bug in g_source_set_ready_time. It goes without saying, that commit was just one "git log -- gmain.c" away and should have been noticed by whoever first reverted my patch. Now, WebKit people should reapply my patch and try to reproduce the hangs again. If they don't see a problem, LibreOffice can simply be fixed by leaving the optimization in.
Created attachment 366932 [details] [review] DO NOT PUSH gmain: Add debugging to need_wakeup This is to be applied on top of attachment #366926 [details] from bug #761102 in order to debug missing wakeups in gmain.c. It is a revert of commits e2e4055fd498fd97aa58018796b0c0b279c55026 and e91c11841808ccca408da96136f433a82b2e2145, plus some g_printerr() calls. Signed-off-by: Philip Withnall <withnall@endlessm.com>
(In reply to Paolo Bonzini from comment #71) > Please don't do this. Shotgun reverting makes no sense if you read all the > latest comments. Reading all the latest comments, I see people saying one thing and then changing their minds a few comments later. I want to see test cases for these changes (for the different situations described in comments #14, #19, #25, …) before they go in again. Coming up with reasoning for why a given patch should work is necessary, but won’t help 2 years down the line when some other changes in gmain.c cause a regression. > Now, WebKit people should reapply my patch and try to reproduce the hangs > again. If they don't see a problem, LibreOffice can simply be fixed by > leaving the optimization in. Attachment #366932 [details] might be better in this respect: it takes master back to the state where all the patches are applied, but also adds some g_printerr()s which should help debug the state leading up to any hangs which do occur.
> Reading all the latest comments, I see people saying one thing and then > changing their minds a few comments later Yup, that's what I have complained about in comment #65 already. > Attachment #366932 [details] might be better in this respect: it takes master > back to the state where all the patches are applied, but also adds some > g_printerr()s which should help debug the state leading up to any hangs which > do occur. That might very well mask race conditions, so it is not a good idea.
Review of attachment 366926 [details] [review]: Let's certainly do this on glib-2-54, where we shouldn't be taking any more risks. But Paolo has a point: the complaint was that g_source_set_ready_time() was broken, and a patch just went in fixing a race condition there. So my suggestion is to un-revert his patch on master, and then see if there are actually any remaining problems or not. If someone complains again, then we can push this to master as well. Thanks for the debug patch.
That's fair enough!
I think https://github.com/mlichvar/timedatex/issues/4 was a clear case of where code was relying on the spurious wakeups. One thing I'd still like to standardize better is the "run mainloop until termination condition" pattern. See for example https://github.com/ostreedev/ostree/commit/9f3d5869935c7f75fa06b0cfbabf9a49cd90f0f2 and https://github.com/ostreedev/ostree/commit/82b756783be8c0bcbf7b84d2694dcbca9f58cd0e timedatex was trying to do something like this as well. Now you can see in my original patch I was (correctly) calling g_main_context_wakeup(). But it's very very subtle. One option is to add a `g_main_context_set_i_know_about_main_context_wakeups()`... Anyways I think it's clear we need to carry a revert for a while in stable branches right? Even with e.g. timedatex fixed we need to handle the case of new glib but old timedatex etc.
hello problem also known here with Libreoffice 5.4.4-2 on Debian Sid with Cinnamon with glib2 (version 2.54.3-1) but apparently the problem is not known with Libreoffice 6 in the repository experimental. known problem also on my ArchLinux Cinnamon but repaired with the version of glib2 (version 2.54.3-2) it is possible to compile the self-same glib 2.54.3-2.?? or an update is planned for Debian Sid also.?? thank you.
hello problem also known here with Libreoffice 5.4.4-2 on Debian Sid with Cinnamon with glib2 (version 2.54.3-1) a core CPU busy at 100%, bedlam of noise of the fan and heats up fast of the pc. but apparently the problem is not known with Libreoffice 6 in the repository experimental. known problem also on my ArchLinux Cinnamon but repaired with the version of glib2 (version 2.54.3-2) it is possible to compile the self-same glib 2.54.3-2.?? or an update is planned for Debian Sid also.?? thank you.
Hi, Same issue here: Debian 9 4.14.0-2-amd64 libglib2.0-0/testing 2.54.3-1 amd64 libreoffice-core/testing 1:5.4.4-1 X1 Carbon i7 Starting LibreOffice will "eat up" 25% of the CPU power, and make the cooler fan spin very fast. Cheers, Tony
Can people please not post ‘me too’ comments here. It is not helpful. Please use your distribution’s support channels.
Comment on attachment 366926 [details] [review] gmain: Partial revert of recent wakeup changes to gmain.c I have pushed the reversion patches to glib-2-54 and master. As per comment #73, I would like to see some unit tests before the changes go back into master again. I am not happy with relying on the fact that g_source_set_ready_time() has been changed independently to assume that the issues might be fixed. Waiting on feedback from LibreOffice/WebKit developers from running with the debug patch (or otherwise trying to reproduce the bug with Paolo and Frediano's patches applied). Attachment 366926 [details] pushed as 94b38be - gmain: Partial revert of recent wakeup changes to gmain.c
Pushed to master as 1e6803be3 gmain: Partial revert of recent wakeup changes to gmain.c --- For any distributions who are CCed, please package the latest patches on the glib-2-54 branch (specifically, 94b38beff, since the 2.54.3 release), and let us know if users continue to have issues with LibreOffice or WebKit (or any other apps which use the main context oddly). Please don’t direct end users here.
Spend some to try to dig on the issues I wrote this test: https://github.com/freddy77/glib_tests/blob/873847ab2953dd77977e4d3485ac0f5c8c02579f/mainloop.c Trying with various commits: - e4ee3079c5afc3c1c3d2415f20c3e8605728f074 "Do not wake up main loop if change is from same thread" Attaching a source from another thread if not owned does not work. This is more or less the Qemu issue - 48ea710ee38c1b54e30e5df411d841c8974cce0d "Do not use revents as not updated" No changes in regards of the tests - 0c0469b56d7e6b2533760d5d821076c88b05dfb0 "gmain: Signal wakeups if context has never been acquired as well" This tests was supposed to fix everything. Actually all test cases I wrote are broken. This caused the timedateex and LibreOffice CPU consumption problem. - 9ba95e25b74adf8d62effeaf6567074ac932811c "gmain: only signal GWakeup right before or during a blocking poll" This patch fixed all test cases I wrote (currently 3) - e91c11841808ccca408da96136f433a82b2e2145 "Revert "gmain: only signal GWakeup right before or during a blocking poll"" This reverted last basically making all test cases fail again. Note that after 9ba95e25b74adf8d62effeaf6567074ac932811c only the WebKit had issues and the WebKit issue is a race condition. After 9ba95e25b74adf8d62effeaf6567074ac932811c there are d73f8eec4839e35f308d90a321b4bc121f8408e4 ("gmain: Make GSourceCallback thread-safe") and a4686b8ea18c585666edece7b92147a92fcfb841 ("g_source_set_ready_time: Move no-op fast-path under the lock") which are fixing some thread race conditions. In particular last one (a4686b8ea18c585666edece7b92147a92fcfb841) seen related on how WebKit uses GLib (see comment 46).
(In reply to Frediano Ziglio from comment #84) > Spend some to try to dig on the issues Thanks a lot, that’s good work. From a very brief look at them, those test cases look like they should be integrated into GLib eventually. If you have a bit more time, would you be able to try and reproduce the WebKit race with a minimal example? As a race, it’s probably less of a candidate for turning into a unit test in GLib, but it would be good to have a reproducer to test against on this bug.
(In reply to Philip Withnall from comment #83) > For any distributions who are CCed, please package the latest patches on the > glib-2-54 branch (specifically, 94b38beff, since the 2.54.3 release), and > let us know if users continue to have issues with LibreOffice or WebKit (or > any other apps which use the main context oddly). Please don’t direct end > users here. Thanks, I've applied it to Fedora's glib2-2.54.3-2.fc27 build which is on its way to the updates-testing repo now. I'll report back when testers have gotten their hands on it.
> If you have a bit more time, would you be able to try and reproduce the WebKit > race with a minimal example? As a race, it’s probably less of a candidate for > turning into a unit test in GLib, but it would be good to have a reproducer to > test against on this bug. The WebKit race has never been analyzed. Perhaps you meant commit a4686b8ea; that one may be reproducible if you run a testcase in a loop for a few seconds.
This is becoming a mess of commits and reverts, what I've done is reverting 1e6803be3b3ec5005335f1fa7d53e57bbc0b8bba (the last revert) and e91c11841808ccca408da96136f433a82b2e2145 (the one that made WebKit work again). So, now I have also a4686b8ea18c585666edece7b92147a92fcfb841 (the one moving the early return under the lock). It didn't take too much to get a hang, similar to previous one, when sending a sync IPC message, see the bt:
+ Trace 238343
Thread 1 (Thread 0x7f4fc8119a80 (LWP 24005))
I'll try again with the commit adding debug printfs.
It took a bit more with the debug, but I managed to make it hang. I've included the pid in the logs, because I need to use different tabs to reproduce the issue, and there's also the UI, network and storage processes. The one that usually hangs is the web process. Here is the full log: https://people.igalia.com/cgarcia/wk-glib-dbg.txt.xz The blocked process is 11264, to make it easier, I've grepped its log and moved it to its own file: https://people.igalia.com/cgarcia/wk-glib-dbg-11264.txt.xz Logs are huge, but I hope this helps. I'll take a look at them too.
Got it! Using this command: xzcat wk-glib-dbg-11264.txt.xz | perl -ne 'next if !/^5291155/; s/ 11264://; print' | less you can see that 2 thread try to wake up the same loop (running in another thread) but using the same monotonic time (as expressed in microseconds) so the g_source_set_ready_time check "if (source->priv->ready_time == ready_time)" for the second thread is taken and conditional_wakeup is not even called. The loop thread (0x55f98b8bb4f0 in the log) is stuck with need_wakeup == 1 waiting for an event that won't arrive. Not sure about what the fix should be. Basically g_source_set_ready_time is called with same value from 2 thread and "at the same time" is processed once instead of twice. Does not seems wrong to me if the second is ignored.
I'm currently trying without the early return in set_ready_time. If set ready time is called again with the same value and wake up is not needed, it won't happen in any case. And the check was already inside the lock, so it shouldn't affect the performance, I guess.
> you can see that 2 thread try to wake up the same loop (running in another > thread) but using the same monotonic time (as expressed in microseconds) so the > g_source_set_ready_time check "if (source->priv->ready_time == ready_time)" for > the second thread is taken and conditional_wakeup is not even called. > The loop thread (0x55f98b8bb4f0 in the log) is stuck with need_wakeup == 1 > waiting for an event that won't arrive. I cannot make much sense of this. Carlos, can you add a trace for set_ready_time too?
It's still hanging, even without the early return :-( I'll add another debug message to set_ready_time
I have a new trace with set_ready_time: https://people.igalia.com/cgarcia/wk-glib-ready-time-dbg.txt.xz 17680 is the web process
And what is the hung thread?
This is with name soiurces included in ready time: https://people.igalia.com/cgarcia/wk-glib-ready-time-name-dbg.txt.xz I don't know the pid of the web process in this case, sorry.
(In reply to power3d from comment #80) > Hi, > > Same issue here: > > Debian 9 > 4.14.0-2-amd64 > libglib2.0-0/testing 2.54.3-1 amd64 > libreoffice-core/testing 1:5.4.4-1 > X1 Carbon i7 > > Starting LibreOffice will "eat up" 25% of the CPU power, and make the cooler > fan spin very fast. > > Cheers, > Tony Hi, For those who really need to use any of the components of LibreOffice, and you can't wait for LO 6.0 or a glib fix, you can avoid the CPU issue if you uninstall LibreOffice and, then, only install the component you need. For instance, I just needed Calc and Writer. I uninstalled the whole LibreOffice thing, and installed only Calc and Writer. No CPU problem anymore. It must be some of the other components causing the issue. It worked for me. I hope it helps.
(In reply to Kalev Lember from comment #86) > (In reply to Philip Withnall from comment #83) > > For any distributions who are CCed, please package the latest patches on the > > glib-2-54 branch (specifically, 94b38beff, since the 2.54.3 release), and > > let us know if users continue to have issues with LibreOffice or WebKit (or > > any other apps which use the main context oddly). Please don’t direct end > > users here. > > Thanks, I've applied it to Fedora's glib2-2.54.3-2.fc27 build which is on > its way to the updates-testing repo now. I'll report back when testers have > gotten their hands on it. I promised to report back here -- glib2-2.54.3-2.fc27 was moved to the stable updates repository a while ago and I haven't heard of any regressions from the reverts we backported there.
Arch Linux has released glib2-2.54.3+2+g94b38beff-1 two weeks ago and it seems to be running fine as well.
This didn’t get done for 2.56. ⇒ 2.58
-- GitLab Migration Automatic Message -- This bug has been migrated to GNOME's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/1128.