GNOME Bugzilla – Bug 785468
glib/gpoll: Unnecessary if conditional included in the poll_rest() function
Last modified: 2018-05-24 19:45:57 UTC
Created attachment 356446 [details] [review] Patch to fix the issue In the poll_rest() function in glib/gpoll.c there is an unnecessary if condition that isn't required. I have included a patch that fixes the issue.
Is there any test case that made you look at this code?
Review of attachment 356446 [details] [review]: This looks correct. The code in question comes from a combination of commit 605682521 and commit f911ead38 by tml. He originally used memmove, but then, due to a bug, decided to use a simple loop instead, and didn't check that the loop condition matched an earlier if statement.
is the extra condition showing up in any profile? even if always true, it seems a good thing to check before accessing the array.... if it is always true and it is not a hot path, maybe it should be a g_assert
This is an equivalent of > for (i = 0; i < n_items; i++) > do_something (items[i]); where n_items is the size of the items array. If it happens to be 0, then the for() loop does not execute even one iteration. This is the same thing, just it's > i=something; i < n_items The condition is not always true. The "ready" can be anything in [0,nhandles). If it happens to be "nhandles-1", then there's no need to adjust the array, and the for() loop will do no iterations. Otherwise it will iterate between "ready" and the end of the array, shifting the items to account for the item at the index "ready" being removed. (There's a separate question in my mind - why is it necessary at all to remove an item and then recursively call the same function again? AFAIU, this is tml's lazy way of checking the handles' readiness without actually writing any code for doing that. I assume that this also works around the presumed fact that WaitForMultipleObjectsEx() returns as soon as one handle is ready and does not check the rest). (Another question is whether shifting the array is necessary - it might be sufficient to move the last handle in place of the handle being removed, and just reduce the array size by 1). I generally find this code to be kind of loopy. Someone should probably try to rewrite it to be more clear and less recursive, and see if that does anything to performance (my expectation is that the performance would increase).
I think I could have done a better job explaining why I sent this patch. I was seeing some slowness issues running QEMU on Windows (which uses Glib) and as part of trying to fix it up I was digging around this code. The real problem with the if statement is that it doesn't subtract WAIT_OBJECT_0 from ready before doing a comparison. According to Miscosoft's documentation this is required. Luckily WAIT_OBJECT_0 is always 0, so this doesn't cause any problems, but it possibly could in the future? As part of fixing the if statement I realised that it isn't actually needed as the for loop handles it for you. This really is a very small change and won't cause any fixes, but I did find the logic around here pretty confusing at first and hopefully this will help a little bit. As for fixing the recursion that seems pretty difficult. It is possible to get the Wait calls to wait for all objects, but I don't think that is a great solution as it could cause long delays.
My main issue with this recursive waiting is the fact that the code ignores a feature of WaitForMultipleObjectsEx() - the "ready" value it returns (when it isn't negative to indicate some alternative conditions) is the index of the *first* handle that made it wake up. That is, the documentation claims that all handles located in the array before that index were not ready. Thus checking *other* handles for being ready should only be done on the handles with indices > than ready. Even if that is still done with recursion, there is no need to copy handles around, just pass &handles[ready+1] as the array and "nhandles - ready" as its size. Unless tml knew something about WaitForMultipleObjectsEx() that i don't. Also, after-wait checking (once we know that at least one handle is ready) can be done using multiple WaitForSingleObject() calls in a loop, instead of calling WaitForMultipleObjectsEx() recursively (or, really, you can call WaitForMultipleObjectsEx() in a loop as well). I'm not sure what the performance impact that would have - that's something that has to be measured first. Also, the code does a full > for (f = fds; f < &fds[nfds]; ++f) cycle for every recursive invocation, looking for FDs that match the handles. This is suboptimal. It should be possible to allocate an extra array of pointers to GPollFD objects earlier, along with the handle array, and populate it at the same time the handle array is populated. Then later on locating the right FD object would be trivial, as it will be in the FD array at the same index as the handle being considered.
Ah! I think that would actually improve performance a lot. If you recursively passed down only the handles after the one that just fired you could also recursively pass down the timeout (something which doesn't happen now) and then use the Windows blocking calls a lot more. I could actually look into that, the problem is that I don't have any nice way to test performance (or really test at all). Is there something in Glib that I could use for this?
No, AFAIK. Though it should not be overly difficult to write a test. That said, why would you want to pass down the timeout recursively? Once you wake up from a timed wait once, you don't need to do timed waits again within on g_poll() call.
Created attachment 356509 [details] [review] gio: add a simple gpoll performance test It just creates a number of socket pairs, then triggers read-ready status on these pairs in different patterns (none, one, half, all) and checks how much time it takes to g_poll() those. This code works on W32. It *should* work on other platforms, but i didn't test it there at all (not sure whether it compiles at all). The g_main_context_new() is necessary to initialize g_poll() debugging on W32. How to use: * (cd builddir/glib && make libglib-2.0.la) * (cd builddir/gio/tests && make check TESTS= -j && ../../libtool --mode=execute ./gpoll.exe)
Created attachment 356514 [details] [review] glib - optimize W32 g_poll() 1) Do for (f = fds; f < &fds[nfds]; ++f) only once at the beginning 2) Preserve pointers to GPollFD structs in a special array. Use these later on to figure out which GPollFD corresponds to which handle, without looping through all the GPollFDs again. 3) Use a pointer to GPollFD that wants G_WIN32_MSG_HANDLE polled instead of a gboolean that indicates the desire to poll for messages. Again, this is eliminates the need to loop through all the GPollFD structs to find the one with G_WIN32_MSG_HANDLE. 4) Don't call poll_rest() recursively. Split poll_rest() into 2 functions - poll_wait(), which just calls Wait*() functions, and poll_loop(), which interprets the results and calls poll_wait() again in a loop, if needed. It's technically possible to merge poll_wait() into poll_loop(), but unnecessary (gcc probably inlines it already). 5) Don't use SleepEx(t), use WaitForSingleObject (GetCurrentProcess(), t). This is supposed to be more efficient. 6) Trust WaitForMultipleObjectsEx() to return the index of the *first* handle that made it wake up, thus assuming that all handles before it were inactive. Therefore, there's no need to rearrange the handles array - just take the pointer to all the items after "ready", and use that as the new (shorter) array. 7) Count the triggered handles, return that number from g_poll(). Previous code had some comments that indicated that this was not always done. 8) Use spaces instead of tabs. Because i can.
Created attachment 356515 [details] [review] glib - make W32 g_poll() use single-handle wait for late-checkups Instead of looping around and calling WaitForMultipleObjectsEx() again, with 0 timeout, call WaitForSingleObject() multiple times, also with 0 timeout. Requires attachment 356514 [details] [review].
I have not yet looked at the patch but I would expect: 1. the result of the test before and after the test 2. one patch for each of those points that you put there so it is easier to fix a possible regression
Okay. Here it is. Running the g_poll test on my machine suggests that attachment 356514 [details] [review] (less iterations, no recursion) improves average performance a bit, and that attachment 356515 [details] [review] (use single-handle wait) improves it even more. Minimum wait time also improves. Maximum wait time does not. One concern i have is that this code ditches the MSG wait as quickly as it can. Therefore there could be situations where it never reports that messages are available, because other handles make it wake up before it gets around to check for messages (Internet says, MsgWaitForMultipleObjectsEx is implemented as a wrapper around WaitForMultipleObjectsEx(), with an extra handle in the array *at the end*). There are two ways around this. One - don't NULL wait_msg_fd and keep using Msg-wait function while checking other handles (this applies to the first modification, which still uses multiple calls to WaitForMultipleObjectsEx() in a loop - it will do MsgWaitForMultipleObjectsEx() instead). Two - use a special function to check for messages, once MsgWaitForMultipleObjectsEx() wakes up. A simple PeekMessage (&msg, NULL, 0, 0, PM_NOREMOVE) should do. There may be subtle differences in the way message processing is interacted with (PeekMessage() looks simple, but can do a lot). I'm not sure whether things will get better or worse (performance-wise as well as correctness-wise) when doing this. This applies to the second modification, which calls WaitForSingleObject() multiple times. It will just do an extra check after this loop.
Original code (more than 1000ns - that's the worst time i was able to get): empty poll time: 6ns - 40ns, average 10ns 1-socket poll time: 9ns - 28ns, average 17ns half-socket poll time: 62ns - 702ns, average 197ns 63-socket poll time: 85ns - 1071ns, average 233ns No recursive wait: empty poll time: 5ns - 20ns, average 10ns 1-socket poll time: 7ns - 31ns, average 14ns half-socket poll time: 58ns - 131ns, average 115ns 63-socket poll time: 80ns - 296ns, average 158ns Single-handled waits: empty poll time: 6ns - 22ns, average 10ns 1-socket poll time: 7ns - 527ns, average 33ns half-socket poll time: 24ns - 225ns, average 86ns 63-socket poll time: 22ns - 366ns, average 89ns Though the values change somewhat between runs (average values - not much, min and max values change more), and this might not be as representative as i want. Maybe i should increase the number of iterations from 100 to 1000... As for splitting the patch into smaller parts - that's doable, but i want us to come to an agreement about the contents, before i do the splitting.
(In reply to LRN from comment #13) > Okay. Here it is. > Running the g_poll test on my machine suggests that attachment 356514 [details] [review] > [details] [review] (less iterations, no recursion) improves average > performance a bit, and that attachment 356515 [details] [review] [review] (use > single-handle wait) improves it even more. Minimum wait time also improves. > Maximum wait time does not. > > One concern i have is that this code ditches the MSG wait as quickly as it > can. Therefore there could be situations where it never reports that > messages are available, because other handles make it wake up before it gets > around to check for messages (Internet says, MsgWaitForMultipleObjectsEx is > implemented as a wrapper around WaitForMultipleObjectsEx(), with an extra > handle in the array *at the end*). > > There are two ways around this. > > One - don't NULL wait_msg_fd and keep using Msg-wait function while checking > other handles (this applies to the first modification, which still uses > multiple calls to WaitForMultipleObjectsEx() in a loop - it will do > MsgWaitForMultipleObjectsEx() instead). > > Two - use a special function to check for messages, once > MsgWaitForMultipleObjectsEx() wakes up. A simple PeekMessage (&msg, NULL, 0, > 0, PM_NOREMOVE) should do. There may be subtle differences in the way > message processing is interacted with (PeekMessage() looks simple, but can > do a lot). I'm not sure whether things will get better or worse > (performance-wise as well as correctness-wise) when doing this. This applies > to the second modification, which calls WaitForSingleObject() multiple > times. It will just do an extra check after this loop. For this I would like to initially go as conservative as possible, that's why I was saying to split things in smaller pieces so we can test them step by step by running the unit test of glib/libsoup etc on each patch.
Created attachment 356539 [details] [review] gio: add a simple gpoll performance test v2 v2: * Sends Windows message in some subtests, and asks g_poll() to poll for it too. * New subtests - a couple of older subtests, but with messages, and a new subtest that changes the number of active sockets in each iteration. * Buckets-based time measurement. This gives somewhat better view of the timing changes (a fine-grained graphical plot would have been better, but seems impractical at this point). * The number of iterations is increased to 1000 * Slight delays before g_poll() calls, to ensure that all pollable objects have enough time to change their state.
Created attachment 356540 [details] [review] glib/gpoll W32: use WFSOE() instead of SleepEx() WaitForSingleObjectEx() is supposed to be a more efficient sleep method. It waits on the handle of the current process. That handle will be signaled once the process terminates, and since we're *inside* the process, it'll never happen (and if it does, we won't care anymore). The use of an alertable wait ensures that we wake up when a completion routine wants to run.
Created attachment 356541 [details] [review] glib/gpoll W32: fold f->revents = 0 into for() loop GCC most likely optimizes that already, but no harm in trying.
Created attachment 356542 [details] [review] glib/gpoll W32: pass along GPollFD ptr for msg Instead of just indicating that messages should be polled for, save the pointer to GPollFD that contains G_WIN32_MSG_HANDLE, and pass it along. This way it won't be necessary to loop through all fds later on, searching for G_WIN32_MSG_HANDLE.
Created attachment 356543 [details] [review] glib/gpoll W32: faster GPollFD lookup Put all ptrs for GPollFDs that contribute handles into an array, the layout of which mirrors the handles array. This way finding GPollFD for a handle is a simple matter of knowing this handle's index in the handles array (which is something we always know). Removes the need to loop through all fds looking for the right one. And, with the message FD also passed along, it's now completely unnecessary to even pass fds to poll_rest() at all.
Created attachment 356544 [details] [review] glib/gpoll W32: trust WFMOE() return value WaitForMultipleObjectsEx() returns the index of the *first* handle that triggered the wakeup, and promises that all handles with lower index are still inactive. Therefore, don't check them, only check the handles with higher index. This removes the need of rearranging the handles (and, now, handle_to_fd) array, it's enough to take a pointer to the next item and use it as a new, shorter array.
Created attachment 356545 [details] [review] glib/gpoll W32: use a loop instead of recursion Do multiple WaitForMultipleObjectsEx() invocations in a loop, don't use recursion. Also, pay more attention to counting the number of activated handles for the return value.
Created attachment 356546 [details] [review] glib/gpoll W32: use multiple calls to WFSO() instead of WFMOE() On the assumption that setting up WaitForMultipleObjectsEx() is more expensive, try to use WaitForSingleObject() when we don't need to wait anymore. This commit also adds explicit message check (using PeekMessage()). Previous code, in line with the original tml commits, set msg_fd to NULL (and, earlier, passed poll_msgs as FALSE) on subsequent iterations or recursive invocations, thus only checking for messages once, at the beginning. This doesn't seem to make any difference in practice (the tests pass either way), but seems safer. Also, this commit makes the polling of other handles unconditional (doesn't check that timeout == 0). This will likely reduce performance in cases when handles are activated sparsely.
Created attachment 356547 [details] [review] glib/gpoll W32: simple statistic to decide between WFMOE() and WFSO() Instead of always using multiple calls to WaitForMultipleObjectsEX() (works faster when few handles are activated) or WaitForSingleObject (works faster when lots of handles are activated), remember the results of the last 10 invocations, and use that to decide which branch to take. I can't say how this interacts with CPU branch predictions. Probably badly. This code also uses static variables and is not thread-safe.
All of the above is based on attachment 356446 [details] [review] (although it eventually overwrites its changes). The last two attachments (attachment 356546 [details] [review] and attachment 356547 [details] [review]) are somewhat controversial. Using WFSO() is faster when a lot of handles are activated, but a bit slower otherwise. The use of a primitive statistical predictor is an attempt to get the advantages of both WFSO() and WFMOE() methods, but in my experience its results are literally a halfway blend between the two, not the best of both worlds. I'll try to provide measurement results for each patch later on. That said, it's a *lot* of results. Also, they still change a bit between runs, though buckets help a lot in presenting the overall picture unchanged. Still, small performance gains/losses can be lost due to this variance.
Not to mention the fact that attachment 356547 [details] [review] is not thread-safe (and there's probably no way to make it thread-safe without losing more performance, and i don't have a way to test its performance in multithreaded environment anyway).
Another thing that we should think about which might be related is the usage of a tree of handles. I recall we had a hard limit of 256 (was this the limit?) handles that we can wait for...
It's 64, actually (the value of MAXIMUM_WAIT_OBJECTS). Or 63, if you check for messages (message check consumes one slot internally). And yes, the way around this is to spawn multiple threads, each waiting on 63 handles + 1 wakeup event (to be able to wake it up, once some other thread wakes up). And the main thread waits on these threads instead (and on messages, if needed). This increases the handle limit from 63 to 63*63 = 3969. When handle number is <= 63 a fastpath could be used where the main thread just does the waiting by itself, without using threads. Obviously, the use of threads slows things down, as you need to wait for them to synchronize.
Created attachment 356552 [details] timing 867b5e6 00 unpatched
Created attachment 356553 [details] timing 867b5e6 01 fold-f-revents
Created attachment 356554 [details] timing 867b5e6 02 pass-msg-ptr
Created attachment 356555 [details] timing 867b5e6 03 faster-gpollfd-lookup
Created attachment 356556 [details] timing 867b5e6 04 trust-WFMOE-return-value
Created attachment 356557 [details] timing 867b5e6 05 loop-instead-of-recursion
Created attachment 356558 [details] timing 867b5e6 06 use-multiple-WFSO-calls
Created attachment 356559 [details] timing 867b5e6 07 statistic
What i think so far: attachment 356544 [details] [review] is the first one that gives consistent timing improvement. Any improvements that previous attachments might have had are just drowned in the noise. attachment 356545 [details] [review] doesn't seem to have any effect (if anything, it looks slightly slower) attachment 356546 [details] [review] causes dramatic timing shifts. Low-activity cases are roughly 2x slower, while high-activity cases are roughly 3x, 4x or even 5x faster. attachment 356547 [details] [review] tries to have the same speed at low-activity cases, while improving high-activity cases, although the speed increase is roughly half of what attachment 356546 [details] [review] gives. Considering the fact that it's not thread-safe, i don't see it as a good trade-off. Another thought: i'm only testing g_poll() itself here. But one could look at the bigger picture: in high-activity cases you have multiple handles that are ready once g_poll() finished, which means more handles that will need to be _check()'ed. I don't know how long these checks take, but they are likely to be at least as slow as the waiting functions. Therefore, any gains we get in g_poll() might be lost once the post-polling phase hits. Whereas in low-activity cases there will be no checks, and we get to keep any speed gains from g_poll().
From a first look #356544 seems like the safe initial option vs speed improvements. How about we push for this one initially and after a cycle so it gets proper testing after a major release we get another round to this? Also if you see that any of the other minor improvements are safe I would say to go for them.
Okay, i'll need signoffs for attachments 356540-356544. What about attachment 356539 [details] [review]? Should it go in? I've actually made it W32-only, after adding code that sends and receives messages. So that will either have to be fixed, or the test will have to be moved into W32-only section of the makefile.
Review of attachment 356540 [details] [review]: Looks good.
Review of attachment 356544 [details] [review]: Let's get this in since from the options it improves performance and at the same time it is quite safe.
About the patch for the unit test, I am fine getting it in being windows-only as far as we have it with the other win32-only unit tests.
Created attachment 356587 [details] [review] gio: add a simple gpoll performance test v3 v3: * Move the test into W32-only section of the makefile * Reflect that in the commit message
Review of attachment 356587 [details] [review]: Do you envision making this test platform independent? Otherwise I would say to remove all the ifdefs...
(In reply to Ignacio Casal Quinteiro (nacho) from comment #44) > Review of attachment 356587 [details] [review] [review]: > > Do you envision making this test platform independent? Otherwise I would say > to remove all the ifdefs... Unlikely to be useful. Okay, i'll get rid of all ifdefs. Any other nitpicks while i'm at it?
Created attachment 356708 [details] [review] gio: add a simple gpoll performance test for W32 v4 v4: * Convert to braindead glib indentation style * Remove all attempts at portability * Use a simplified W32 socket initialization function (stolen from gio) * Remove the [un]use of GIOChannels (turns out, i've already grabbed all the necessary bits from GIOChannel socket implementation) * Move the test from gio to glib (since it doesn't use gio anymore)
Review of attachment 356708 [details] [review]: Fine for me. Just fix the params to be one per line.
Created attachment 356723 [details] [review] gio: add a simple gpoll performance test for W32 v5 v5: * One param per line
Attachment 356723 [details] pushed as 425a9f5 - gio: add a simple gpoll performance test for W32 Attachment 356446 [details] pushed as d67b58a - glib/gpoll: Remove if conditional Attachment 356540 [details] pushed as cb2316a - glib/gpoll W32: use WFSOE() instead of SleepEx() Attachment 356541 [details] pushed as 1f3da92 - glib/gpoll W32: fold f->revents = 0 into for() loop Attachment 356542 [details] pushed as 2019779 - glib/gpoll W32: pass along GPollFD ptr for msg Attachment 356543 [details] pushed as 226ea94 - glib/gpoll W32: faster GPollFD lookup Attachment 356544 [details] pushed as b267f64 - glib/gpoll W32: trust WFMOE() return value
Not sure how to reproduce but today (also without your patches, using plain 2.52) I had this problem: WaitForMultipleObjectsEx failed: The handle is invalid. I wonder if we should detect such case and remove the handle from the list and do not try to poll anymore from it. I think the problem is that I launched a process which crashed, leaving an invalid handle from which GSubprocess was not able to poll from anymore. Thoughts?
There's no such thing as "invalid handle". As long as you have a handle, the object to which the handle is pointing is, in fact, valid. It might be in a state that you don't like, but it's valid, and you can wait for it (waiting on a handle of a process wakes up when the process is terminated). The only way to get invalid handle is to either make it up, or close it and forget to remove the GSource based on it. More than anything, this highlights the difficulty of tracking handles though. Back when i was trying to use GTK3-built Pidgin from git master, i kept running into handle limit exhaustion, and i couldn't figure out how so many handles ended up being polled, because there was no easy way to go from handles in the array to sources that caused these handles to appear in the array.
-- GitLab Migration Automatic Message -- This bug has been migrated to GNOME's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/1282.