GNOME Bugzilla – Bug 746468
gtkcairoblur is slow
Last modified: 2015-03-23 08:00:11 UTC
blurring keeps showing up in the profiles, lets make it faster.
Created attachment 299844 [details] [review] Add performance test for gtkcairoblur This just creates a large cairo surface and times bluring it at various values of radius.
Created attachment 299845 [details] [review] gtkcairoblur: Minor restructure This just moves get_box_filter_size to the top and makes it a macro (so it can be used as a constant later).
Created attachment 299846 [details] [review] gtkcairoblur: Unroll inner loop for common radius values This unrolls the inner blur loop for radius 1-10, allowing the compiler to use a divide-by-constant operation instead of a generic division. Here is the blur-performance output before: Radius 1: 124.95 msec, 32.01 kpixels/msec: Radius 2: 117.27 msec, 34.11 kpixels/msec: Radius 3: 123.57 msec, 32.37 kpixels/msec: Radius 4: 118.17 msec, 33.85 kpixels/msec: Radius 5: 119.32 msec, 33.52 kpixels/msec: Radius 6: 124.17 msec, 32.21 kpixels/msec: Radius 7: 121.04 msec, 33.05 kpixels/msec: Radius 8: 130.64 msec, 30.62 kpixels/msec: Radius 9: 119.47 msec, 33.48 kpixels/msec: Radius 10: 117.95 msec, 33.91 kpixels/msec: Radius 11: 122.38 msec, 32.68 kpixels/msec: Radius 12: 121.92 msec, 32.81 kpixels/msec: Radius 13: 125.45 msec, 31.89 kpixels/msec: Radius 14: 121.63 msec, 32.89 kpixels/msec: Radius 15: 120.18 msec, 33.28 kpixels/msec: And after: Radius 1: 42.26 msec, 94.65 kpixels/msec: Radius 2: 59.15 msec, 67.62 kpixels/msec: Radius 3: 60.29 msec, 66.35 kpixels/msec: Radius 4: 64.53 msec, 61.99 kpixels/msec: Radius 5: 60.07 msec, 66.59 kpixels/msec: Radius 6: 62.43 msec, 64.07 kpixels/msec: Radius 7: 60.36 msec, 66.27 kpixels/msec: Radius 8: 59.59 msec, 67.13 kpixels/msec: Radius 9: 76.17 msec, 52.51 kpixels/msec: Radius 10: 79.41 msec, 50.37 kpixels/msec: Radius 11: 118.92 msec, 33.64 kpixels/msec: Radius 12: 121.31 msec, 32.97 kpixels/msec: Radius 13: 118.30 msec, 33.81 kpixels/msec: Radius 14: 116.82 msec, 34.24 kpixels/msec: Radius 15: 116.99 msec, 34.19 kpixels/msec: I.e. almost double performance for the unrolled radius values.
Created attachment 299860 [details] [review] Add gtk_run_in_parallel helper This is a helper that schedules the same job in N copies on different threads, where N is limited by the number of CPUs (as well as user input). The last job is run on the originating cpu, and when that is done we block until all jobs are done. The idea here is that you want to split up your job in equally sized chunks, and run one chunk per processor, to maximize the amount of parallelism. The current implementation is pretty simple, using a shared GThreadPool and busy waiting for joining the jobs. A more complex implementation using custom threads and cpu affinity might have less overhead.
Created attachment 299861 [details] [review] gtkcairoblur: Use gtk_run_in_parallel to compute the blurs This splits the row-blurring parts into N_CPU horizontal strips and blurs each on a different CPU. blur-performance before patch: Radius 1: 37.02 msec, 108.05 kpixels/msec: Radius 2: 54.15 msec, 73.87 kpixels/msec: Radius 3: 53.32 msec, 75.02 kpixels/msec: Radius 4: 55.95 msec, 71.49 kpixels/msec: Radius 5: 54.13 msec, 73.90 kpixels/msec: Radius 6: 54.49 msec, 73.41 kpixels/msec: Radius 7: 53.86 msec, 74.27 kpixels/msec: Radius 8: 55.00 msec, 72.73 kpixels/msec: Radius 9: 67.52 msec, 59.25 kpixels/msec: Radius 10: 71.39 msec, 56.03 kpixels/msec: Radius 11: 106.47 msec, 37.57 kpixels/msec: Radius 12: 105.90 msec, 37.77 kpixels/msec: Radius 13: 105.72 msec, 37.84 kpixels/msec: Radius 14: 105.46 msec, 37.93 kpixels/msec: Radius 15: 107.28 msec, 37.29 kpixels/msec: after patch: Radius 1: 23.19 msec, 172.46 kpixels/msec: Radius 2: 34.17 msec, 117.06 kpixels/msec: Radius 3: 34.14 msec, 117.17 kpixels/msec: Radius 4: 34.16 msec, 117.11 kpixels/msec: Radius 5: 33.14 msec, 120.71 kpixels/msec: Radius 6: 35.54 msec, 112.54 kpixels/msec: Radius 7: 34.74 msec, 115.15 kpixels/msec: Radius 8: 34.45 msec, 116.10 kpixels/msec: Radius 9: 38.43 msec, 104.10 kpixels/msec: Radius 10: 42.32 msec, 94.52 kpixels/msec: Radius 11: 59.12 msec, 67.66 kpixels/msec: Radius 12: 58.30 msec, 68.60 kpixels/msec: Radius 13: 59.83 msec, 66.86 kpixels/msec: Radius 14: 57.89 msec, 69.10 kpixels/msec: Radius 15: 56.31 msec, 71.03 kpixels/msec:
Created attachment 299868 [details] [review] a quick demo (after some discussion on IRC) I wouldn't mind adding something vaguely like this to GLib if you think it would help here...
some more background on the IRC discussion: splitting into fixed-sized chunks is a bad idea due to the possibility that one chunk may finish before another. even tasks that 'seem' to be the same size can be subject to cache behaviour and contention with other processes in the system. i think a more dynamic approach to the solution would be better: one that hands out individual rows (or small chunks of rows) to be processed at a time.
Created attachment 299870 [details] demo with "chunk size" API here's an updated example that lets the user specify a chunk size. this makes it easier to handle the case of the last chunk (which may be smaller than the others) i'm not 100% convinced we want to encode this into the API (and it does add a tiny bit of extra work) but if we are going to have some API to facilitate this, I'm pretty sure this is what it would look like
Created attachment 299872 [details] [review] Add gtk_thread_pool_run_parallel helper This is a helper that schedules the same job in N copies on different threads, where N is limited by the number of CPUs (as well as user input). The last job is run on the originating cpu, and when that is done we block until all jobs are done. The idea here is that you want to split up your job in equally sized chunks, and run one chunk per processor, to maximize the amount of parallelism.
Created attachment 299873 [details] [review] gtkcairoblur: Use gtk_thread_pool_run_parallel to compute the blurs This splits the row-blurring parts into chunks of rows (at least 2000 pixels per chunk) and runs them in parallel. blur-performance before patch: Radius 1: 37.02 msec, 108.05 kpixels/msec: Radius 2: 54.15 msec, 73.87 kpixels/msec: Radius 3: 53.32 msec, 75.02 kpixels/msec: Radius 4: 55.95 msec, 71.49 kpixels/msec: Radius 5: 54.13 msec, 73.90 kpixels/msec: Radius 6: 54.49 msec, 73.41 kpixels/msec: Radius 7: 53.86 msec, 74.27 kpixels/msec: Radius 8: 55.00 msec, 72.73 kpixels/msec: Radius 9: 67.52 msec, 59.25 kpixels/msec: Radius 10: 71.39 msec, 56.03 kpixels/msec: Radius 11: 106.47 msec, 37.57 kpixels/msec: Radius 12: 105.90 msec, 37.77 kpixels/msec: Radius 13: 105.72 msec, 37.84 kpixels/msec: Radius 14: 105.46 msec, 37.93 kpixels/msec: Radius 15: 107.28 msec, 37.29 kpixels/msec: after patch: Radius 1: 23.19 msec, 172.46 kpixels/msec: Radius 2: 34.17 msec, 117.06 kpixels/msec: Radius 3: 34.14 msec, 117.17 kpixels/msec: Radius 4: 34.16 msec, 117.11 kpixels/msec: Radius 5: 33.14 msec, 120.71 kpixels/msec: Radius 6: 35.54 msec, 112.54 kpixels/msec: Radius 7: 34.74 msec, 115.15 kpixels/msec: Radius 8: 34.45 msec, 116.10 kpixels/msec: Radius 9: 38.43 msec, 104.10 kpixels/msec: Radius 10: 42.32 msec, 94.52 kpixels/msec: Radius 11: 59.12 msec, 67.66 kpixels/msec: Radius 12: 58.30 msec, 68.60 kpixels/msec: Radius 13: 59.83 msec, 66.86 kpixels/msec: Radius 14: 57.89 msec, 69.10 kpixels/msec: Radius 15: 56.31 msec, 71.03 kpixels/msec:
New version of the threaded blurring based on slightly modified (and with gtk_ namespace) version of ryans approach.
Created attachment 299950 [details] [review] shadow-box: Bail out blur early if radius is 1px For radius 1px the current implementation rounds down to a 1 px box filter which is a no-op. Rather than creating useless shadow masks in this case we bail out blurring early. Another alternative would be to make radius 1px round up to a 2 px box filter, but that would change the rendering of Adwaita which is probably not a great idea this late in the cycle.
Created attachment 299951 [details] [review] shadow-box: Blur only horizontally/vertically for the non-corner parts There is no need to e.g. blur in the x-direction for the top part of a box shadow. Also, there is no need to extend the mask in the non-blurred direction.
Created attachment 299952 [details] [review] box-shadow: For top/bottom and left/right parts, repeat a single line Since these part really are the same in all of the x or y direction and we don't blur in that direction we can just blur one line and repeat it during drawing.
With the last 3 patches and the first ones (not the parallel crack) I get shadow rendering at about 5% cpu time during widget-factory renderer, rather than 35% or so like it was before.
Created attachment 299953 [details] [review] shadow-box: Blur only horizontally/vertically for the non-corner parts There is no need to e.g. blur in the x-direction for the top part of a box shadow. Also, there is no need to extend the mask in the non-blurred direction.
Created attachment 299954 [details] [review] box-shadow: For top/bottom and left/right parts, repeat a single line Since these part really are the same in all of the x or y direction and we don't blur in that direction we can just blur one line and repeat it during drawing.
New version that fixed a leftover debug chunk.
Marked the threading crack obsolete. Its interesting code, but should not be needed anymore for shadows.
I compared screenshots of widget-factory with and without these patches, and found only a single pixel slightly different.
Review of attachment 299844 [details] [review]: sure
Review of attachment 299845 [details] [review]: ok
Review of attachment 299950 [details] [review]: ::: gtk/gtkcssshadowvalue.c @@ +316,3 @@ + + /* The code doesn't actually do any blurring for radius 1, as it ends up + with box filter size 1 */ Pet peeve of mine: I prefer /* This style * of comments. */
Review of attachment 299846 [details] [review]: please fix those cosmetics up while committing ::: gtk/gtkcairoblur.c @@ +81,3 @@ + sum += row[i]; \ + \ + if (i >= offset) \ Indentation hickup here: The if should be lined up with the one above. This is all inside the loop body. @@ +93,2 @@ + /* We unroll the values for d for radius 1-10 to avoid + a generic divide operation */ Prefer stars lined up here @@ +105,3 @@ + case get_box_filter_size(9): BLUR_ROW_KERNEL(get_box_filter_size(9)); + case get_box_filter_size(10): BLUR_ROW_KERNEL(get_box_filter_size(10)); + default: BLUR_ROW_KERNEL(d); And spaces before ( here
Review of attachment 299953 [details] [review]: Looks good ::: gtk/gtkcairoblur.c @@ +183,1 @@ { Should line up the parameters here
Review of attachment 299950 [details] [review]: I notice that _gtk_surface_blur also has a check for radius == 0, which could be similarly changed to <= 1
Review of attachment 299954 [details] [review]: ::: gtk/gtkcssshadowvalue.c @@ +330,3 @@ gboolean blur_x, + gboolean blur_y, + gboolean repeat) Three booleans in a row, the api is starting to get a bit ugly. Should we make that flags, just for readability ? gtk_css_shadow_value_start_drawing (shadow, cr, BLUR_X|REPEAT) is a lot more understandable than gtk_css_shadow_value_start_drawing (shadow, cr, TRUE, FALSE, TRUE)
Attachment 299844 [details] pushed as 9ba185b - Add performance test for gtkcairoblur Attachment 299845 [details] pushed as ae21e08 - gtkcairoblur: Minor restructure Attachment 299846 [details] pushed as d0dc1f5 - gtkcairoblur: Unroll inner loop for common radius values Attachment 299950 [details] pushed as 9c2a16f - shadow-box: Bail out blur early if radius is 1px Attachment 299953 [details] pushed as 967cb56 - shadow-box: Blur only horizontally/vertically for the non-corner parts Attachment 299954 [details] pushed as 8e03262 - box-shadow: For top/bottom and left/right parts, repeat a single line
I fixed this according to the comments and pushed to master, but we probably want them in 3-16 too. Question is, do we merge them now or wait until 3.16.1?
Attachment 299846 [details] fails to compile with clang 3.4: gtkcairoblur.c:97:10: error: expression is not an integer constant expression case get_box_filter_size (2): BLUR_ROW_KERNEL (get_box_filter_size (2)); ^~~~~~~~~~~~~~~~~~~~~~~ gtkcairoblur.c:39:44: note: expanded from macro 'get_box_filter_size' #define get_box_filter_size(radius) ((int)(GAUSSIAN_SCALE_FACTOR * radius)) ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ gtkcairoblur.c:37:39: note: expanded from macro 'GAUSSIAN_SCALE_FACTOR' #define GAUSSIAN_SCALE_FACTOR ((3.0 * sqrt(2 * G_PI) / 4)) ^
(In reply to Alexander Larsson from comment #30) > I fixed this according to the comments and pushed to master, but we probably > want them in 3-16 too. Question is, do we merge them now or wait until > 3.16.1? I'll cherry pick them after .0 is released.
(In reply to Ting-Wei Lan from comment #31) > Attachment 299846 [details] fails to compile with clang 3.4: > > gtkcairoblur.c:97:10: error: expression is not an integer constant expression > case get_box_filter_size (2): BLUR_ROW_KERNEL (get_box_filter_size (2)); > ^~~~~~~~~~~~~~~~~~~~~~~ > gtkcairoblur.c:39:44: note: expanded from macro 'get_box_filter_size' > #define get_box_filter_size(radius) ((int)(GAUSSIAN_SCALE_FACTOR * radius)) > ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > gtkcairoblur.c:37:39: note: expanded from macro 'GAUSSIAN_SCALE_FACTOR' > #define GAUSSIAN_SCALE_FACTOR ((3.0 * sqrt(2 * G_PI) / 4)) It built fine with clang 3.5 here
(In reply to Matthias Clasen from comment #33) > It built fine with clang 3.5 here Let me take that back, I managed to mangle my CC= line
Pushed a fix
What is we use a define for "sqrt(2 * G_PI)" in the get_box_filter_size instead. Then it may work as a constant for clang? I guess it doesn't really matter much though.