GNOME Bugzilla – Bug 791994
gblur-1d: Optimize by exploiting cache lines
Last modified: 2018-01-05 01:51:09 UTC
I have been studying the performance of gegl:gaussian-blur with pippin's help. My motivation is to speed-up operations like gegl:shadows-highlights and gegl:unsharp-mask, which are some of the slower operations used in gnome-photos. There is an opportunity to improve the locality of memory accesses in the IIR filter by exploiting the cache lines. Having an outer loop for each colour channel prevents us from fully utilizing the cache. Manually unrolling it and processing the entire pixel at once seems better. It would increase the size of the temporary buffer, but given that people have been keeping multiple GdkPixbuf's around in memory, I don't think it would be the end of the world if we used a "sizeof (double) * n_channels * height/width" buffer instead of "sizeof (double) * height/width". Note that until recently, the performance of the Young & van Vliet IIR blur was being dwarfed by the time taken to read and write GeglBuffer columns. It was particularly bad when it involved Babl conversions. See the series of GeglBuffer commits from pippin to address that. Here are some readings based on perf/test-blur: * Time taken to only read rows or columns and nothing else: https://rishi.fedorapeople.org/gegl-gaussian-blur-get-small-buffer * Time taken to only write rows or columns and nothing else: https://rishi.fedorapeople.org/gegl-gaussian-blur-set-small-buffer * Time taken by the entire IIR filter: https://rishi.fedorapeople.org/gegl-gaussian-blur-small-buffer ("small buffer" refers to the 1024x1024 buffers used by perf/test-blur because I was also playing with a 2048x2048 "big buffer" to see how it affects the outcome.) perf/test-unsharpmask has now gone from 150 MB/s to 163 MB/s (with pippin's GeglBuffer optimization) to 200 MB/s (with my proposed improvement to cache use).
Created attachment 366030 [details] [review] operations/common/gblur-1d: Optimize by exploiting cache lines
All the readings were taken on an Intel i7 Haswell [1] with 8G RAM and an SSD. [1] http://www.7-cpu.com/cpu/Haswell.html
Also seeing ~15% speedups for many gaussian blur code paths on a Intel(R) Core(TM) i5-7Y54 CPU @ 1.20GHz
I've been running with this patch in my tree with and seeing no other change than improved performance. commit e7f98682b9bddc7f7236fabe4cbc77e701510aa4 Author: Debarshi Ray <debarshir@gnome.org> Date: Mon Dec 25 02:47:42 2017 +0100 operations/common/gblur-1d: Optimize by exploiting cache lines Leapfrogging through the buffer for each colour channel doesn't fully utilize the data cache. When a pixel's colour channel is accessed, it is very likely that the other channels have also been read into the cache because they are on the same cache line. Therefore, computing the entire pixel at the same time avoids having to reload the same cache line again. https://bugzilla.gnome.org/show_bug.cgi?id=791994
Review of attachment 366030 [details] [review]: commited