GNOME Bugzilla – Bug 152000
A variety of focus race conditions in Metacity
Last modified: 2004-12-22 21:47:04 UTC
There's a race condition affecting focus consistency for both sloppy and mouse focus. If you have three windows on your desktop (let's call them A, B, and C), with A focused and on top of B (or at least with the upper right corner over B) and C visible elsewhere, then clicking the close button for window A and rapidly moving the mouse to window C can trigger this problem. Despite the mouse being in window C, you can end up with window B having focus in sloppy focus, or end up with no window having focus in mouse focus. The cause of this race condition is the patch in bug 131582 that was applied. That patch was needed at that time, because we had a dual policy of "(A) sloppy and mouse focus means a window gets focus when the mouse enters them. (B) when a window is closed or minimized, we should focus the mru window" Since (A) and (B) were sometimes contradicting policies, we had races (which could be triggered even without moving the mouse, let alone rapidly). Because of bug 135810, we no longer have contradicting policies, and thus the patch in bug 131582 is not needed. Reverting it solves this race condition with window close and rapid mouse movement, and doesn't regress the previous bugs. I'll attach the patch in a moment.
Created attachment 31334 [details] [review] Revert the patch in bug 131582
I'll take your word for it on this, the patch looks sensible to me since it removes code ;-)
I committed this, but further testing shows that this only fixes part of the issue. Before, if we moved the mouse fast enough, the EnterNotify for window C would be ignored and thus window C wouldn't be focused at all. Now, if we move the mouse fast enough, window C will be focused but due to the asynchronousness of things it may be focused before window B is. So we can still end up with the wrong window having focus. I played around a bit and logging event->xany.serial for all these different events gives us the proper ordering; we could probably use that to make sure that focus is correct (only focus a window if the serial of the event causing it to be focused is greater than the serial causing the previous window to have been focused)
Ignore the crap above about serials. It's really invasive, since it's just trying to do the same thing as timestamps which means a lot of extra argument passing. It's really stupid, because XSetInputFocus supports timestamps but not serials. So: One way to fix this would be to make sure we always provide a valid timestamp to meta_window_focus (because then XSetInputFocus knows how to do the right thing). In particular, this means making meta_workspace_focus_default_window take a timestamp. That looks hard. The easy way to fix this is to just allow EnterNotify events to do the focusing like we claim we're doing. In other words, if the mouse is over a window don't manually focus it (since we know that an EnterNotify event will occur that will cause it to get focus). focuses from EnterNotify events do have valid timestamps, so this works great. I'll attach the patch in a minute.
Created attachment 31596 [details] [review] Allow EnterNotify events to do their job Note that we do have to do one additional thing in this patch because of how events can be ordered. By not focusing a window in focus_default_window, we might get a notification that no window has focus before the EnterNotify will happen. In this case, we want to ignore the workaround in display.c of focusing the no_focus_window because we know the EnterNotify will happen and focus a window for us.
Um...we ignore EnterNotify events on upon workspace switching, so this makes things look a little weird. I think this patch is part of the solution, but that we need the rest of the solution before acting. I know, I need to be smacked upside the head. ;-) I'll test my next solution much more thoroughly before posting.
Joy oh joy! There's multiple focus race conditions! The race condition I described above is much easier to trigger on one of my machines than another. In fact, applying the attachment 31334 [details] [review] patch makes the race condition mentioned here impossible for me to trigger on my main machine at home--which is why I thought the patch was sufficient to fix the problem (I had tested the patch, but only on that machine). Anyway, as part of my more thorough testing, I have discovered that focus races exist under at least 3 additional similiar conditions. With the attachment 31334 [details] [review] patch, some are harder or impossible for me to trigger, but some still persist. And I can trigger at least one of the focus races using click-to-focus (which, of course, the patches I have generated don't help with), so this isn't specific to sloppy and mouse focus. Anyway, I haven't finished all my testing to determine all the cases where I can trigger races. I'm going to do some more testing and coding and I'll report back later... Hopefully this heavily repetitive testing (many cases are hard to trigger and require lots of attempts) doesn't give me carpal tunnel syndrome in the meantime. :-)
Created attachment 31672 [details] The results of my extensive testing I thought it might be useful to provide a write-up of all the testing I did and all the race conditions I have uncovered. So this is a short (in comparison to how much work went into generating it) text file about that. Now that I've identified all the ones I'm able to personally trigger, I can get back to making a patch and seeing which of these I can prevent with it...
Okay, so I have some patches that I've been running with for a few days, just to make sure that there's no surprises. I've also done further stress testing. I believe that the patch removes some race conditions entirely, but that some of the race conditions remain for all focus modes. However, even for those that I suspect remain, they are /much/ harder to trigger (and in many such cases, I have been unable to trigger at all) There are some interesting things about this bug (bugs?) too. It is virtually impossible to trigger any of these race conditions when the system is under load (i.e., if I'm running some big computation when trying to do the testing, or am compiling, or if I'm running Metacity under valgrind). Further, the machine that I'm most easily able to trigger the race conditions under is a dual processor system. (Note that my single processor machine at home is a little faster than the processors in my SMP system at school, but it's still easier to trigger this on my school machine). So it appears that a fast unloaded, system coupled with SMP make these race conditions easier to detect. The race conditions that I have been able to trigger even with the patch I'll attach in a minute are cases 6, 8, and 9 for click-to-focus. And, as I said before, these three cases are harder for me to trigger than before (and they were already fairly difficult cases--definitely more so than most the other cases on the list). I have been unable to trigger any of the race conditions for sloppy or mouse focus methods (whereas without the patch, the race conditions are definitely easiest to trigger under those focus methods).
Created attachment 32034 [details] [review] Patch to fix race conditions (or at least make them harder to trigger) This patch basically implements what I said would work in comment 4--add a timestamp to focus_default_window so that XSetInputFocus calls have correct timestamps and can thus do their work correctly. However, unlike I originally envisioned (and I started writing a patch for), I didn't keep propogating timestamp passing to more and more functions throughout the code--doing that was kind of ugly, and besides, in most cases we just didn't have a timestamp anyway. For the calls to focus_default_window that wouldn't have a readily available timestamp (and for which meta_display_get_current_time only returns CurrentTime), I added a call to a new function: meta_display_thou_shalt_return_a_timestamp. This new function first tries meta_display_get_current_time, and if that doesn't provide a timestamp then it gets one manually from the X server. I would really rather not have to use this new function, but many of the race conditions remain and are reasonably easy to trigger without it. We could add a warning to that function so that whenever it is called, people get a reminder that it'd be nicer if we could find another way to get a real timestamp for the various cases. *shrug* Finally, I also remove the place where we currently use the metacity sentinel property. It was causing part of the race condition problems for sloppy and mouse focus, and removing it does not cause any ill effects (since the bug it was designed to fix was fixed in another way--see the comment I added yesterday to bug 110970)
you've got a if (timestamp == CurrentTime && FALSE) in there; probably want to figure out what that's supposed to be. I think we can come up with a better name for meta_display_thou_shalt_return_a_timestamp. Perhaps meta_display_retrieve_timestamp or _get_current_time_roundtrip to make clear that its a server roundtrip and should be avoided like the plague. Does this patch make the focus sentinel unnecessary? I notice that you removed the code in idle_calc_showing that sets the focus sentinel, and also the code in the event handler that checks the sentinel. The original reason for it was race conditions on workspace switching, where we'd get an EnterNotify event on a window on the new workspace after we'd already tried to set our own default focus window. I don't see how having more timestamps would remove the need for that, since the timestamp on the EnterNotify will come after the timestamp of the event which triggered the workspace switch. But regardless if we're removing the code that uses the sentinel we should remove all of it and not leave little bits of it anywhere.
Oops, yeah, I forgot about that. Currently, timestamp will never equal CurrentTime in that function, and thus the "&& FALSE" is useless (for now). I had tested the if portion of the code in attachment 31596 [details] [review], and in preliminary patches I wanted to force testing of the else portion of the loop even in the cases where I didn't yet have a valid timestamp. The "&& FALSE" was meant as a temporary measure to allow me to do that--and I forgot to remove it. The reason the if portion of the patch is there is in case later patches add calls to focus_default_window but a timestamp isn't provided. We really shouldn't allow that to happen (and thus the warning emitted when it does), but if it does then that if section of the loop is there to provide the best possible behavior given the circumstances. Yeah, I figured it needed a better name, but I couldn't think of a better one. _get_current_time_roundtrip sounds good to me--I'll make the change. I addressed the focus sentinel stuff in comment 10. My patch doesn't make it unnecessary--it was already unnecessary (and introduced race conditions), as I explained in my comment to bug 110970. That comment in bug 110970 also explains why I disagree with removing all the code regarding the use of the sentinel (namely, that I think it's needed infrastructure for handling bug 151996).
If you need the code for bug 151996, should we leave in the call to focus_sentinel_clear in display.c:event_callback?
Oh, right. Yeah, I'll add that line back in.
Created attachment 32171 [details] [review] Updated patch Okay, this gets rid of the &&FALSE, renames to get_current_time_roundtrip, and returns the meta_display_focus_sentinel_clear call. Thinking about this further, there's a way we could avoid the roundtrip. If we added a field of most_recent_time to the MetaDisplay struct and add if (display->current_time != CurrentTime && display->current_time > display->most_recent_time) { display->most_recent_time = display->current_time; } right after the setting of display->current_time, then we could probably use the value of display->most_recent_time instead of querying the server in get_current_time_roundtrip. This idea had occurred to me before, but I rejected it for some reason. I don't remember the exact reason, but I believe it was probably unfounded. Would that be worth testing?
I think Owen can probably explain why "most recent timestamp" isn't as good as the round trip better than I can, but it amounts to "it doesn't fix race conditions" iirc.
Comment on attachment 32171 [details] [review] Updated patch If the function is user_lower_and_unfocus() doesn't that mean it's in response to user action and we should have a key/button event timestamp? In fact this is called from the button press handler in frames.c, so why not just pass in a timestamp. I like adding the _roundtrip() function anyhow, just so we have it. meta_screen_get_mouse_window() is a round trip, are you calling that in response to every event? That's going to be bad, and also a race condition I think. adding timestamp arg to workspace_activate() which seems like the bulk of the patch seems very sensible. Thanks as always
Also, try to keep your code below 80 columns wherever possible. Function calls can be changed to put one argument in each line if they have wide or a large number of arguments.
Rob: I tried to do that, but in each case where I let lines extend beyond 80 columns, putting arguments 1-per-line would not fix the problem. That is, unless I violate a different problem such as making all arguments be left aligned, or having all arguments appear just to the right column-wise of where the function name is. If there's something better I could do, could you demonstrate with a concrete example? I'm trying to follow the style I've found elsewhere in the code and I'm willing to do better if I can see how to do so. I guess I could put them one-per-line anyway, since that would at least make it closer to 80 columns. Is that what you mean?
Created attachment 32185 [details] [review] Patch with changes suggested by Havoc I went ahead and made user_lower_and_unfocus take a timestamp. Good point on the meta_screen_get_mouse_window stuff. It didn't call that on every event, but rather only on every FocusIn/FocusOut event received _for the root window_. I could move the code inside the if section of code below so that it would be called less frequently, but I think you're right that calling this function may be a race condition. Further, it's only purpose was to avoid a race condition which would probably be quite difficult to trigger (I hadn't tried but from my other experience I expect that it would be very rare if I could trigger at all) so it might be introducing a bigger bug than it solves. Besides, hopefully we can make that section of code unnecessary anyway eventually with an X extension (bug 125492). So, in summary, I just pulled it out as you suggest. Okay to apply?
Comment on attachment 32185 [details] [review] Patch with changes suggested by Havoc It looks OK to me. On the 80-column issue generally I'd rather keep the other formatting rules than the 80-column rule, when they conflict.
committed.