GNOME Bugzilla – Bug 95802
metacity performance
Last modified: 2004-12-22 21:47:04 UTC
I performed a test where I started up a desktop, launched 8 gnome-calculators in 3 different workspaces, and then switched between the 3 workspaces 10 times (a total of 30 switches). Performance analysis with teh Forte Analysis programs highlighted that that metacity spent 10.58 user seconds and 5.610 system seconds in this test. 8.260 user seconeds is spent in meta_frames_expose_event. most of this time is spent in a static function called by pixops_scale. meta_frames_expose_event 8.260 user, 0.820 system meta_frames_paint_to_drawable 8.260 user, 0.820 system meta_theme_draw_frame 7.950 user, 0.820 system meta_frame_style_draw 7.940 user, 0.820 system meta_draw_op_list_draw 7.890 user, 0.820 system meta_draw_op_draw_with_env 7.870 user, 0.820 system draw_op_as_pixbuf 5.410 user, 0.190 system scale_and_alpha_pixbuf 5.400 user, 0.190 system gdk_pixbuf_scale_simple 5.490 user, 0.190 system pixops_scale 5.340 user, 0.190 system static function 5.340 user, 0.190 system [ note gdk_pixbuf_scale_simple is also called by scaled_from_pixdata which accounts for 0.100 user time ] Not sure if this code has opportunities for tuning, but this is where metacity is spending most of its time when switching desktops.
This is going to be theme-dependent; for a theme with no pixmaps, it wouldn't show up.
Understandable. Still, it would be nice if metacity were a bit more snappy at dealing with themes with pixmaps. Why is it necessary to rescale the images more than once? For example, the pixmap used in the titlebar is the same scale for all windows in all workspaces. Couldn't it be scaled one time, and then cache the scaled image for reuse? Obviously it would need to be rescaled if a user preference were changed that affected the pixmap sizes. So it might be also necessary to cache such preferences and rescale if any has changed. Or am I completely misunderstanding the situation? Or are themes with pixmaps going to be such a rarity that optimizing their behavior in metacity is not really useful?
I've verified with the Forte analyzer that switching my theme to Atlanta removes pixops_scale from being a significant contributor to the time spent in metacity.
I ran a metacity test where I compared continually resizing a window for 3 minutes with Crux vs. with Atlanta themes. Overall time with Crux: 16.270 user, 8.550 system (of which 0.480 user and 6.570 system was spent in _poll) Overall time with Atlanta: 5.880 user, 8.620 system (of which 0.440 user and 6.960 system was spent in _poll) Using Crux, 8.390 user and 0.320 system was spent in the same stack trace as with switching workspaces. I didn't give similar overall time specs with my previous switching workspace test, so this information follows for comparison: In the switching workspace test the times were: Overall time with Crux: 10.580 user, 5.610 system (of which 0.150 user and 3.930 system was spent in _poll) Overall time with Atlanta: 6.270 user, 7.550 system (of which 0.280 user and 6.120 system was spent in _poll)
It'd be nice to optimize, sure. However a pixmap theme is always going to have more overhead than a regular theme (that's why metacity supports non-pixmap themes). It doesn't do caching because 1) it would use a ton of RAM and 2) it would not help for resizing, because the size keeps changing. Maybe the workspace switch is more important, dunno. Also, it's possible that the MMX stuff on x86 reduces the impact of the scaling. Note that metacity-theme-viewer will time the theme being viewed for you, to give an idea how fast/slow it is.
Thanks for your comments. Aren't there still opportunities for improving the speed? The code is spending a lot of time resizing the icons while the user is in the process of resizing a window. I am not sure this is really necessary. Couldn't the resizing be delayed until the user is actually finished with the resize operation? Or at least make it possible to tune the code to do the resizing less often during a resize? I still suspect that caching the resized icons would be useful. Not just for switching workspaces and de-iconifying windows, but also for some window resize operations. In other words, it is not necessary to resize the x-axis icons if only the y-axis is being resized, and vice versa. Some icons (like corners and the titlebar buttons) should never need to be resized. So using a cached version of icons that don't resize seems appropriate. Or perhaps the code is already smart enough to only resize those icons that need to be resized? Are the other opportunities to avoid resizing by just being smarter about identifying the need to do so? Also, do you suspect that there may be opportunties to make gdk_pixbuf_scale_simple() more effecient? I realize that this isn't so much of a performance issue running on Intel (perhaps because of MMX). However, the delays during window resizing and workspace switching is very noticable and are often complained about issues on Solaris. Therefore, we would be delighted to address any issues that would improve performance in these areas. I'm mostly looking, at this point, for advice about the best ways to approach tuning this behavior. I am hoping that we don't just have to grin and bear these performance problems. :)
> Couldn't the resizing be delayed until the > user is actually finished with the resize operation? That would look pretty ugly. > Or at least > make it possible to tune the code to do the resizing less often > during a resize? It's already rate-limited. You could tune the rate to which it's limited, though using less CPU just for the sake of less CPU seems pointless; you really want to update as fast/often as you can when resizing, I would think. I personally doubt there are many opportunities to benefit from caching during resize; most resizes are in both dimensions. A cache may help more during workspace switching, but to help you'd need to cache all pixbufs for all windows, and you will just trade speed complaints for memory complaints. Maybe the answer to speed complaints is to use a faster theme by default? scale_simple can probably be sped up a bit, but it's going to be a matter of fooling around with inner loops and assembly and stuff like that, most likely. It's probably also possible to tweak Crux itself to use fewer or smaller pixmaps, if you were so inclined, and maybe willing to change the design somewhat if required. I dunno, you'd have to experiment.
Suggestion in http://bugzilla.gnome.org/show_bug.cgi?id=95520 may reduce expose events when switching spaces and improve things a bit. It would only matter if your gnome-calculators were overlapping each other though.
Looking more closely at the scaling code I notice that resizing is happening in other situations where caching could speed things. 1. When a window is moved or resized over another window, the window in back constantly resizes its borders to meet the window in front. So if window A moves across the left-side border of window B, the top and bottom of that left-hand border are resized to meet the edges of window A. Obviously this resize happens over & over again while window A is being moved. Caching would completely avoid the need to rescale here. 2. When a window gains or loses focus. I suspect that this is because different graphics are used to indicate the "selected" window. Caching both the "selected" and "unselected" graphics would eliminate the need for resizing all the icons. [ and the following two points which have been discussed already ] 3. When a window is resized. Caching would benefit some situations Specifically when the window is resized only in the x or y dimension. 4. Some icons only need to be rescaled when the font used on the titlebar changes (e.g. buttons). Caching would avoid rescaling them when. It seems the main reason for not caching these images is because it would use so much memory. However, don't the scaled borders that are displayed on the screen take up memory anyway? Some benefit (to 1, 3, and 4) would be gained from simply checking to see if the current image simply does not need to be refreshed, and avoiding the rescale completely. Are the scaled images so large that they would create a memory usage problem if they were cached? While I realize that to support such caching, every window on the screen would need to cache its own borders, these scaled images are typically quite small aren't they? Also, only those windows that are not iconified would really get the most benefit from such caching...so it would perhaps be acceptable to throw out the cache when a window becomes iconified. This might help keep the caching from eating up too much memory. Just tossing out some ideas based upon my better understanding of exactly when resizing is happening. Your suggestion of tuning the pixbuf scaling that is done when a window is resized might be useful to us. These opporations are sluggish on Solaris, so I don't think it is a matter of saving CPU for just saving CPU. It is an attempt to address the sluggishness, especially on lower-end Sun hardware (Ultra 10/60/etc). I am currently working to implement pixbuf scaling code that uses the VIS architecture on Sparc chips (similar to MMX on Intel). This is coming along nicely and seems to speed up the crux theme by twofold (according to the metacity-theme-viewer timings). While this is an important improvement, it doesn't completely address the sluggishness problem. So we may have to also do some other tuning to get the behavior to an acceptable level.
> Are the scaled images so large that they would create a memory usage > problem if they were cached? While I realize that to support such > caching, every window on the screen would need to cache its own > borders, these scaled images are typically quite small aren't they? Here is the math as I see it, assume a 400x400 window, with 40-pixel titlebar and 10-pixel sides, and a theme which has one pixbuf scaled to cover each of the 4 edges. 400*400 - 360*380 = 23200 pixels 23200 pixels * 4 bytes per pixel = 92800 So 90K for a 400x400 window. A larger window more, a smaller window less. Some themes might use less than that, if they say only use pixmaps for a part of the frame area; some might use more, if for example they overlap pixmaps. Some cheesy math, 100 windows = 9 megs of cache, so 10 windows is 1 meg of cache. Whether that's OK I don't know.
Owen points out another possible optimization that will probably fix Crux. The code for this can be copied from the pixbuf engine in gtk-engines. The idea is to optimize scaling pixbufs that are just a bunch of vertical or horizontal stripes. When we load each pixbuf, we analyze whether all the pixels in each row are the same. If so, we scale by taking the first pixel of each row and copying it over and over, instead of using gdk_pixbuf_scale. Similarly, if all the pixels in each column are the same, we scale by memcpy()'ing the first row to each subsequent row in the new image. The code in pixbuf-engines assumes we're scaling in only a single direction. To handle Crux we should handle scaling in both directions, as follows. If a pixbuf is made up of horizontal stripes, we first scale a single-pixel column vertically using gdk_pixbuf_scale, then we copy the pixels from that into the destination image's rows. If a pixbuf is made up of vertical stripes, we scale the first row horizontally using gdk_pixbuf_scale(), then memcpy the result to each row of the destination image. Crux has quite a few stripey images, so this should speed it up substantially. At that point if we still need the cache, we can cache only non-stripey images.
Thanks for your & Owen's comments. We're looking into making a Metacity-specific scaling function to speed things up. I've been thinking about this and have a few more thoughts to share: 1. Note that moving/resizing as wireframes would eliminate all of the time spent in resizing border pixmaps during move/resize or when a window is moved over another window. Perhaps a good reason to support wireframe move/resize. 2. Would it be possible to cache the height/width of a given pixmap and not redraw that border element to the frame buffer if the size hasn't changed? This would speed the situations where windows are de-iconified, workspaces are switched, and when a window is resized in just 1 dimension. It would also benefit a 2-dimension resize for some of the pixmaps (like corners and other elements that do not actually change size). I suspect there might be a bit of work involved with this since some icons are on top of other icons (like buttons) and if the underlying graphic changes, then the icon would need to be redrawn even if it doesn't change size. Still, I suspect this would eliminate quite a lot of rescaling when not needed. Caching just the width/height (as integers) for each border image would be much less heavy on memory than caching the actual scaled images.
*** Bug 96042 has been marked as a duplicate of this bug. ***
#2 is quite hard to implement I think. I can't think of how offhand. Let's do the optimizations with no extra memory or UI cost, and then if there's still a problem bad enough to worry about, we can evaluate the cost/benefit of optimizations that have negative cost. Right now all our time is in pixops_scale, and the stripey-pixbuf optimization should almost eliminate that time for any theme that looks good when scaled. (Themes that are scaling non-stripey pixbufs look really bad anyway - indeed Crux does, if it has to scale its nonstripey bits. We should be able to avoid scaling those bits at least at the default font size.)
Batch adding GNOME2 keyword to Metacity bugs. Sorry for the spam.
Here's some more profiling data from oprofile on Linux, which is a whole system profiler. Profile is of a screen with nautilus, panel, and several terminals; one terminal is resized quickly for a long time, causing exposes for the other three terminals and their window frames, and for nautilus, and updating the pager applet etc. CPU used per process, the second column is percentage, first column is absolute sample count. 2271 1.3186 0.0000 /usr/bin/nautilus 3323 1.9294 0.0000 /usr/bin/gnome-panel 25022 14.5283 0.0000 /usr/bin/gnome-terminal 31812 18.4708 0.0000 /usr/X11R6/bin/XFree86 40654 23.6046 0.0000 /usr/bin/metacity 66949 38.8721 0.0000 /boot/vmlinux-2.4.18-14smp I interpret this to mean that we're spending 40% time with the kernel doing context switches and pushing network data between the panel, nautilus, X server, and 4 terminals. I'll attach the per-function profile, which basically agrees with yours (time spent scaling pixbufs as far as metacity is concerned).
Created attachment 11922 [details] [review] profile by function
The profile is essentially unchanged if I remove the three terminals other than the one being resized, or if I profile workspace switching instead of resizing. The conclusion I draw from that is that the slow part is redrawing, not resizing in particular. I guess this was obvious.
Turns out the kernel isn't context switching or network, it's just: c0133c40 10952 6.61133 __constant_c_and_count_memset /boot/vmlinux-2.4.18-14smp (this is 7% in brk() or mmap(), i.e. memory allocation from userspace malloc() usage) c01070d0 39900 24.0862 default_idle /boot/vmlinux-2.4.18-14smp (this is doing nothing, just sitting around waiting for an interrupt, I suppose that may mean waiting for IO to complete or something)
The patch to optimize stripey patches that went in on 11/04 has made a huge difference, bringing down the time metacity-theme-viewer from around 2.5 seconds to 1.10 seconds (or 1.05 with mediaLib). I ran the Forte Performance Analyzer tool against metacity-theme-viewer to see where time is now being spent. The tool lets you add up the performance of multiple runs, so I ran metacity-theme-viewer 10 times. So these timings are adding up 10 separate runs: 20.100 user time, 45.540 wait time, 2.580 system time (12.510 seconds of the wait time was in _poll). 11% of user time (2.220 user seconds and 0.010 system) seconds was spent in: gdk_pixbuf_render_to_drawable_alpha _gdk_draw_pixbuf gdk_drawable_real_draw_pixbuf composite_0888 2.3% of user time (0.480 user/0.520 Wall/0.040 system) was spent here: .gdk_pixbuf_render_to_drawable_alpha _gdk_draw_pixbuf gdk_drawable_real_draw_pixbuf gdk_draw_rgb_image_dithalign gdk_draw_rg_image_core gdk_rgb_convert_0888 12.4% of user time and 58% of system time (2.500 user seconds, 12.140 wait seconds and 1.520 system seconds) were spent in: gdk_pixbuf_render_to_drawable_alpha _gdk_draw_pixbuf gdk_drawable_real_draw_pixbuf _gdk_drawable_copy_to_image _gdk_x11_copy_to_image XGetSubImage XGetSubImage spent it's time in XGetImage (1.430 user/1.680 Wall) _XSetImage (1.030 user/10.420 Wall) and a little bit in _XDestroyImage (0.020 user/0.020 Wall). _XSetImage spent roughly equal time in _XPutPixel32 and _XGetPixel32. Here's the detail on XGetImage: XGetImage spent its time in these functions: _XReply (0.510 user/9.510 Wall) malloc (0.140 user/0.150 Wall) XCreateImage (0.120 user/0.130 Wall) _XReply spent 0.270 user/6.440 Wall in _XFlushInt 0.270 user/6.900 Wall in _XRead. To further speed up metacity, these seem to be the areas that would most benefit. I would appreciate any suggestions regarding where time would be best spent, or ideas of how to best approach these areas. Perhaps getting the draw functions to make more use of MMX on Intel and mediaLib on Sparc might make a difference?
It looks like the main problem now is latency for GetImage requests. This is sucking down pixels for alpha compositing, due to lack of the Xrender extension. So the client is spending a lot of time waiting for replies from the server with the GetImage reply data. The main client-side CPU usage now is apparently actually doing the compositing, and converting from pixbuf format to display format, those things are going to be hard to speed up but don't seem like a huge problem anyhow, they are much smaller than the latency problem. The first thing I would do is look at how many of the images in Crux actually have an alpha channel, and how many are fully opaque. If fully opaque, I would be darn sure gdk_pixbuf_get_has_alpha() == FALSE at the time that we draw the final scaled pixbuf for those images. If the pixbuf has no alpha channel, GDK should not be doing the GetImage stuff (though possibly it still is, if so it should be fixed). For things that have alpha, if the alpha is 1-bit alpha, we can potentially record that fact alongside the vertical_striped etc. flags, and for one-bit alpha pixbufs use render_pixmap_and_mask to draw the pixbuf with a bitmask, instead of alpha compositing it. That might help. If all the images currently are using full alpha compositing, then probably we have to tweak the theme to reduce the alpha usage.
Okay, the stripey patch and making Crux a bit more flat has improved the performance of pixmaps substantially on Solaris. Together these improvements account for a 300% improvement based on the statistics returned from metacity-theme-viewer. Therefore, I think we can close this bug as being properly addressed.