GNOME Bugzilla – Bug 687465
compositor thread blocks while opening the calendar menu
Last modified: 2012-11-27 17:49:59 UTC
Created attachment 227911 [details] log from strace -T -f -ttt -p $pid +++ This bug was initially created as a clone of Bug #687362 +++ The first time you open the calendar menu (i.e. click on the clock), the Shell's UI freezes for a moment. The second and subsequent time, it's much quicker. While watching it with strace, it goes for 363ms without the main thread re-entering the main loop at all, and 473ms to enter the main loop 5 times. These numbers are pessimistic, because of the tracing overhead, but the pause is noticeable even when not tracing.
Interesting events: Several times, the main thread does this: 1928 1351873227.032694 access("/usr/share/fonts/opentype/cantarell/Cantarell-Bold.otf", R_OK) = 0 <0.000039> (or the equivalent for Regular.) I haven't been able to track down which library wants to find out whether this font is still readable. I suspect it might be a tracing point analogous to <http://people.gnome.org/~federico/news-2006-03.html#timeline-tools> rather than something functional? Note that this is *after* login, so strings have already been drawn in this font (notably, my name and the word "Activities"). The particularly long iteration starts by doing an ioctl on the DRI device (line 3), then waking up the GWakeup for its own main-context (line 4). Next, it goes for 117ms without doing any I/O at all (lines 4-5). The next 40ms are spent manipulating the DRI device. It then goes for 26ms without touching the DRI device (CPU-bound?), in the middle of which it wakes up its own main-context. Next, it does some more quick DRI ioctls, then spends 14ms without doing any I/O (CPU-bound?), then spends < 1ms reading /etc/localtime. After pausing (CPU-bound?) for 4ms, it wakes up the GDBus thread to send a message of signature 'xxb'. They fight over a mutex for a short time (~1ms), then the GDBus thread sends the message on fd 7 while the main thread blocks for 1ms on a couple of IPC calls (probably X11) on fd 6. Next, there's a 4ms pause in I/O, a couple of ms spent stat()ing icon directories, and 53ms without I/O, other than a brk() for more memory. During that 53ms pause, the GDBus thread receives an a(sssbxxa{sv}), probably the method reply. It wakes up the main loop, although that won't have any effect for some time. It also wakes itself up, repeatedly - I'm not sure what's going on there - and for some reason closes one of the fds it was monitoring. Back in the main thread, it spends 55ms on DRI activity and reading SVG files - a total of about 18ms of this is reading SVG or stat()ing MIME-related files. The last 2ms of this over-long dispatch are spent blocking on fd 6 (X11?) and doing more DRI.
In the next iteration, the part that jumps out at me is at the end: 58ms between the main thread waking up its own main context, and the next I/O (reading from maybe-X11 to see if there are any more events, which there are not). Presumably this is CPU-bound on something. Bug #522314 can also be seen here. The next iteration is nice and short, but the one after spends 20ms doing DRI, with some multi-ms pauses between ioctls (CPU-bound, drawing things?). The last main loop iteration in the file (21ms) blocks on a probably-X11 round trip for nearly 10ms.
Created attachment 227917 [details] sysprof XML Unfortunately, sysprof-cli didn't recognise all of my build-ID-linked debug symbols, making this less useful than it might have been; I might have to recompile everything unstripped to get anywhere with this. :-( 25% of CPU usage was in libgnome-shell.so, so that's a good first place to look. 10% in the kernel, 10% in libc. Basically no time spent in "No map [/usr/bin/gnome-shell]" or "In file [heap]" (which I assume would be where the JIT'd Javascript code lives?), which is reassuring; 5% of the time was in things called from "No map", mostly ioctl() and mozjs. 4% in the Intel driver. cr_utils_is_white_space() looks like a potential hot-spot at 3.69% - which I find surprising, because it's a trivial switch(). Maybe it needs marking G_GNUC_CONST or inlining? string_in_list() in gnome-shell also looks as though it could benefit from some optimization - it's using cr_utils_is_white_space() in a really naive way. 2.68% inside clutter and 1.51% inside cogl, which isn't bad. 1.51% inside g_hash_table_lookup seems surprisingly much. 1.51% inside st_theme_node_get_type looks alarmingly high?
(In reply to comment #3) > string_in_list() in gnome-shell also looks as though it could benefit from some > optimization - it's using cr_utils_is_white_space() in a really naive way. Patch coming > 2.68% inside clutter and 1.51% inside cogl, which isn't bad. > > 1.51% inside g_hash_table_lookup seems surprisingly much. Where is it being called from? > 1.51% inside st_theme_node_get_type looks alarmingly high? Especially because it's marked as G_GNUC_CONST, like all the other get_type funcs. See if you can figure out what's going wrong there?
Created attachment 228178 [details] [review] st-theme: Optimize string_in_list Rather than using a complicated set of function calls across library boundaries and our own scanning logic, use strtok(), which glibc already provides, and is probably much more optimized. With this patch, I can finally see the calendar fade down. This could certainly be improved a lot more, by storing the tokenized string directly in the theme node. I have a patch cooking that does this, but I doubt it would save much more CPU time for the amount of code it would have to introduce, given that an immediate strdup/free is already plenty fast in glibc.
Review of attachment 228178 [details] [review]: I didn't measure whether or by how much the code is faster then the old one .. but it looks correct (as in does what it is supposed to do). Maybe a sysprof run with to verify it is worth a shot? Otherwise fine to commit.
Attachment 228178 [details] pushed as 73b4a0e - st-theme: Optimize string_in_list It could just be the placebo effect talking, but I certainly see a noticable improvement, especially when opening things like menus.
Sorry, I'm not seeing an improvement here. Profiling: I'm now seeing 20.18% of CPU time in strtok_r, if sysprof is to be believed. string_in_list is down to 1.44%, though. Blocking: the main thread is now blocked for 400ms in the worst iteration.
Created attachment 228255 [details] strace with that patch applied to Shell 3.6.1
Created attachment 228256 [details] corresponding sysprof This goes with Attachment #228255 [details].
We talked about this on IRC. I first suggested using strstr() and checking for a whitespace or boundry condition at the end points of the returned result, then Jasper reiterated his suggestion in comment 5 to cache the tokenized strings, which is probably better.
I think it might be worth cloning a separate bug for the style-assignment bits, or for the rest of what I originally reported: I'm not sure how large a proportion of the ~400ms pause they actually account for, and there is certainly blocking I/O going on here too. Regarding styles specifically: I'll try to capture a trace of what happens in terms of assigning styles to elements so we can see what happens, and have a chance to benchmark it "controllably" without all of Shell running. It seems as though at the moment, we iterate over all rules asking "does this rule match this element?", where the definition of matching involves splitting a whitespace-delimited string of (pseudo-)classes. Computing the list of (pseudo-)classes once rather than re-tokenizing the string every time would obviously help, but I wonder whether it might also make more sense to have hash tables { class => [rule, ...] } and { pseudo-class => [rule, ...] }?
Created attachment 228415 [details] [review] Store CSS classes and pseudo-classes pre-split in the StThemeNode Tokenizing every time we want to evaluate CSS seems a poor plan.
Created attachment 228416 [details] [review] Use pre-split element classes and pseudo-classes to optimize matching In my testing this cuts the longest time to dispatch(), when showing the calendar menu for the first time, from 604 to 442 milliseconds, while reducing additional_selector_matches_style() from 32% to 13% of CPU time used.
Created attachment 228417 [details] [review] st_theme_node_equal: use split (pseudo-)classes This is a prerequisite for getting rid of the unsplit versions.
Created attachment 228418 [details] [review] Remove unsplit element classes and pseudo-classes
Created attachment 228419 [details] strace log from commit 644c210 as a baseline This is not directly comparable to my earlier testing (I'm now using jhbuild, and was accidentally not using -O), but it is directly comparable to what I'm about to attach. Slow dispatch() calls: 22, 604, 93, 31 ms.
Created attachment 228420 [details] strace log with the patches above, with everything else the same Same conditions as Attachment #228419 [details]. Slow dispatch() calls: 37, 442, 68, 30 milliseconds. There's more to be done here - 442ms is still rather slow - but it's a good start.
Review of attachment 228416 [details] [review]: ::: src/st/st-theme.c @@ +458,1 @@ + for (i = 0; list[i] != NULL; i++) This is usually done as: gchar **it; for (it = list; *it != NULL; it++) // ... But I'm fine with this.
Review of attachment 228415 [details] [review]: ::: src/st/st-theme-node.c @@ +149,3 @@ +static GStrv +split_on_whitespace (const gchar *s) g_strsplit_set ?
Review of attachment 228417 [details] [review]: Yeah, this is why I didn't cache the split versions. I sort of wish we had a g_strequalv, but I guess this will have to do.
Review of attachment 228418 [details] [review]: Sure.
(In reply to comment #20) > Review of attachment 228415 [details] [review]: > > ::: src/st/st-theme-node.c > @@ +149,3 @@ > > +static GStrv > +split_on_whitespace (const gchar *s) > > g_strsplit_set ? I could do that if you prefer, but if I'm reading its documentation correctly, I'd have to filter out empty elements afterwards? (For instance: class="badger mushroom", with three spaces, would split to ["badger", "", "", "mushroom"], afaics)
Profiling highlights after those patches, having recompiled everything with -O2 (cutting the longest dispatch down to 377ms). _st_theme_get_matched_properties() is still taking 21% of CPU time (cumulative), with these hot spots: add_matched_properties 3.78% on its own sel_matches_style_real 1.99% on its own additional_selector_matches_style 3.39% on its own strlen 1.20% (is strqcmp really faster than strncmp?) g_type_from_name (as called by sel_matches_style_real) 5.18% (cumulative) st_theme_node_get_type 1.79% (it's all those ST_IS_THEME_NODE checks...) st_theme_node_get_element_classes 1.00% (could do with caching this?) string_in_list 1.20% (still noticeable!)
Created attachment 228499 [details] [review] Use pre-split element classes and pseudo-classes to optimize matching --- Modified for coding style: iterating via gchar **it
Created attachment 228500 [details] [review] Remove unsplit element classes and pseudo-classes --- Now correctly removes them from the .h as well.
Created attachment 228501 [details] [review] Keep similar theme nodes so we don't have to recompute CSS so often Because we calculate and cache CSS properties once per StThemeNode, and only a certain set of attributes can affect the CSS properties, it's advantageous for as many widgets as possible to share a single StThemeNode. Similarly, if a widget changes state and then changes back (e.g. gaining and losing the :hover pseudo-class), it should ideally get its original StThemeNode back again when it returns to the old state. Here, I'm using the StThemeContext as the location for a cache. StThemeNodes are currently never freed: this seems OK for Shell's usage (a finite number of IDs, classes, pseudo-classes and types). --- If an infinite cache is undesirable, I suppose we could do some sort of periodic GC (evict the cache and start again every few minutes?). The longest dispatch is down to 277ms (better, but still long...) in what I believe to be the same conditions as Comment #24. _st_theme_get_matched_properties is no longer such a hot spot (3.5% cumulative).
btw. this https://bugzilla.gnome.org/show_bug.cgi?id=627083#c9 is probably worth mentioning in this context.
(In reply to comment #27) > Here, I'm using the StThemeContext as the location for a cache. > StThemeNodes are currently never freed: this seems OK for Shell's usage > (a finite number of IDs, classes, pseudo-classes and types). Does this seem OK to the Shell maintainers or should I try to do something more elaborate? (In reply to comment #28) > btw. this https://bugzilla.gnome.org/show_bug.cgi?id=627083#c9 is probably > worth mentioning in this context. Thanks, that's very relevant. I'll see whether I can rebase Owen's unmerged patch, but it does look like a lot more complexity than "have a cache, so the expensive computation only happens once per distinct StThemeNode". > It seems that if we add a bunch of children to a container, each of them is > transiently the last child of the container, and this triggers a transition > from the style with :last-child to the style with :last-child. In this situation, my last patch here will use at most four StThemeNodes - first-and-last child (for the first item added), first child, intermediate child and last child - with the resulting set of style properties only computed once for each of those four StThemeNodes. Without my last patch here, you'll end up with a duplicate StThemeNode for each intermediate child, and each of those will have its own (identical, but redundantly computed) set of style properties.
With lighter-weight profiling (no sysprof, no strace, the GLib patch from Bug #588139) and the patches from this bug, opening the calendar menu for the first time takes 148ms, and subsequent times take about 80ms.
Review of attachment 228499 [details] [review]: Sure. ::: src/st/st-theme.c @@ +458,3 @@ + for (it = list; *it != NULL; it++) + { + if (!strqcmp (*it, stryng->str, stryng->len)) It's probably not worth it, but maybe you should check if strqcmp is "better" than strncmp. It's possible that higher optimization levels make gcc recognize the case where we don't need a return value, and call a variant of the function or something.
Review of attachment 228500 [details] [review]: Sure.
Review of attachment 228501 [details] [review]: StThemeNodes really should get destroyed at some point; it's possible for them to be in use for a small amount of time. It might be possible to keep a counter of StThemeNodes, and we periodically GC: of all the unused nodes (ref count 1), check if the timer was less than 500 times ago. This might be a bit fancy for no gain, though. Code looks fine, but an idea in the comments. Push these for now, we'll get back to it later, hopefully. ::: src/st/st-widget.c @@ +633,3 @@ + context = st_theme_context_get_for_stage (stage); + tmp_node = st_theme_node_new (context, parent_node, priv->theme, It feels a bit strange to have to create a temporary node. Given how StThemeNode doesn't use any GObject features, I wonder if it makes sense to make StThemeNode a simple ref-counted boxed type, which should hopefully be lighter.
Review of attachment 228415 [details] [review]: Darn, you're right, sorry. Good to go.
(In reply to comment #33) > It feels a bit strange to have to create a temporary node. Given how > StThemeNode doesn't use any GObject features, I wonder if it makes sense to > make StThemeNode a simple ref-counted boxed type, which should hopefully be > lighter. Yeah, I wondered about this. In a way, I think what we're really implementing here is a map { collection of attributes => collection of CSS properties } - except that the collection of attributes "owns" its collection of CSS properties, in an object-oriented way. (Is there a better word than "attribute" for the classes, IDs, etc. that can affect styling?) The CSS matching is currently done lazily, when we need it. I haven't done enough rummaging through the Javascript yet to work out whether this is feasible, but if the StWidgets are created before they're made visible, it could perhaps be done from an idle handler or something while the Shell is otherwise idle? Then we wouldn't be saving up so much work for when the widgets become visible. (In reply to comment #33) > StThemeNodes really should get destroyed at some point; it's possible for them > to be in use for a small amount of time. GC'ing them sounds like the only way to do this. (In reply to comment #31) > It's probably not worth it, but maybe you should check if strqcmp is "better" > than strncmp. Small stupid benchmark (just opening the calendar menu once each, on my test laptop): strqcmp 131ms, strncmp 135ms, strcmp (and removing the length parameter) 128ms. So they're pretty close, and quite possibly within the margin of measurement error. Does stryng->len ever actually matter? The string is 0-terminated anyway (GString always has a \0 at data[len], beyond the declared length) so if it doesn't improve performance, I don't think we need it?
(All patches so far merged.)
(In reply to comment #35) > Does stryng->len ever actually matter? The string is 0-terminated anyway > (GString always has a \0 at data[len], beyond the declared length) so if it > doesn't improve performance, I don't think we need it? No. I think the logic was that otherwise strcmp would have to check the length..
Simon, the last patch dc2ec0a.. about caching the nodes is breaking the rounded borders somehow, see the attached screenshot.
Created attachment 229035 [details] Rounded borders are eaten The rounded borders are not rendered if you go to the overview twice. Just restarting the shell + pressing thrice the overlay key is enough to get to that point.
Created attachment 229064 [details] [review] st_theme_node_copy_cached_paint_state: allow self-assignment If you copy a theme node's paint state into itself, it should be an inexpensive no-op. What actually happened was that we destroyed the old paint state, re-initialized to blank, then copied the blank state back into itself. In the process, we lost (for instance) the textures for rounded corners. Until I introduced the texture cache, this never actually happened, because when st_widget_recompute_style() calls st_widget_get_theme_node(), we'd always get a fresh theme node. Now, we get a theme node T back from the cache, notice that paint_equal(T, T) is true, short-circuit slightly by copying its drawing state into itself, and destroy drawing state that we still needed. I'm going to fix this in recompute_style() too, but as a general principle, self-assignment ought to be harmless.
Created attachment 229065 [details] [review] st_widget_recompute_style: short-circuit if the theme node is the same It appears to be somewhat common for st_widget_style_changed() to be called when no style-relevant attributes have, in fact, changed. Now that we cache theme nodes, we're likely to get the same theme node back from the cache. If we do, we don't need to waste time asking whether its geometry and painting are equal to itself: we can just note that nothing really changed and get on with our lives.
Created attachment 229066 [details] [review] StThemeNode: add a trivial-case short cut to all comparisons for equality
(In reply to comment #39) > The rounded borders are not rendered if you go to the overview twice. Either of Attachment #229064 [details] or Attachment #229065 [details] should be enough to fix this, but for least-astonishment, I would suggest having both. There might be a minor speed improvement too, but I haven't profiled it.
These looks fine too. ACN.
Comment on attachment 229066 [details] [review] StThemeNode: add a trivial-case short cut to all comparisons for equality Thanks, committed as 08d2ca3, 4fc6a80, 5203687.
(In reply to comment #45) > (From update of attachment 229066 [details] [review]) > Thanks, committed as 08d2ca3, 4fc6a80, 5203687. It seems those 3 commits solved the borders issue, but introduced occasional crashes while opening the overview. See https://bugzilla.gnome.org/show_bug.cgi?id=688640 for a trace of this crash.
This crash has been fixed again, thanks to Giovanni.