GNOME Bugzilla – Bug 791837
CIE: Use a faster cbrtf implementation
Last modified: 2018-04-29 07:36:14 UTC
Conversions involving the CIE extensions are considerably slower than others at the moment because they don't have any accelerated variants. The current cbrtf implementation, borrowed from musl, seems like an impediment to vectorizing the conversions due to the three conditional branches involved. (Note that the earlier change from non-inlined sqrt from glibc, on my system, to the inlined version from musl had led to a pretty significant speed boost. For 15 megapixels, "RGBA float" to "CIE Lab alpha float" went from 1.2s to 0.35s.) Instead, this branchless implementation from Hacker's Delight using two Newton-Raphson iterations might simplify future vectorization: http://www.hackersdelight.org/hdcodetxt/acbrt.c.txt (The musl implementation also uses two iterations but I don't know if it is a Newton-Raphson variant or something entirely different.) It's nice that code from Hacker's Delight is very liberally licensed: http://www.hackersdelight.org/permissions.htm It is also nice that it measurably speeds up the existing scalar code paths. For 15 megapixels, "RGBA float" to "CIE Lab alpha float" went from 0.35s to 0.27s. A "Y float" to "CIE L float" conversion takes 0.085s instead of 0.102s. This has a positive impact on gegl:shadows-highlights, which uses two RGB(A) and Y conversions to CIE. It goes from 2.6s to 2.1s when operating on a 15 megapixel JPEG with shadows=100.0f and highlights=-100.0f. All measurements taken on an Intel i7 Haswell.
Created attachment 365831 [details] Test program used for measurements
Created attachment 365832 [details] [review] [Hacker's Delight] CIE: Use a faster cbrtf implementation
On my system (Intel(R) Core(TM) i5-7Y54 CPU @ 1.20GHz) this is also a significant speedup: without patch: ** Message: time: 126141, pixels: 15728640 with patch; ** Message: time: 35021, pixels: 15728640
On an older Intel i7 Sandybridge, for 15 megapixels, the figures are: "RGBA float" to "CIE Lab alpha float" goes from 0.437s to 0.388s; "Y float" to "CIE L float" goes from 0.132s to 0.12s. So, it's still faster, but by a more moderate amount.
From #gegl on GIMPNet: 15:20 <rishi> pippin: So, ok to push? 15:31 <rishi> Wow! Your i5 was really fast converting Y to L. 0.035 versus 0.085 on the i7. 16:13 <pippin> it isn't even a true i5 either - it is one of the odd budget pretend core-series cpus 16:14 <pippin> rishi: yep, please do push, given that both of us sped up,. it probably speeds up for most
The implementation of Halley's method for approximating the cube root of a single precision IEEE float used in Skia and Darktable [3] is even faster and uses even lesser number of instructions. On an older Intel i7 Sandybridge, for 15 megapixels, "Y float" to "CIE L float" takes .088s as opposed to 0.12s with the implementation from Hacker's Delight. Here are the generated instructions: * Hacker's Delight: https://godbolt.org/g/Zj8PTf * Halley's method: https://godbolt.org/g/CXsU9u However, Halley's approximation is too coarse. The above monochrome conversion has an error of 0.000003, and it exceeds the threshold of 0.000005 for a "RGBA float" to "CIE Lab alpha float" conversion. In comparison, the Hacker's Delight version is accurate upto 6 decimal places and hence no measurable error. [1] https://en.wikipedia.org/wiki/Halley%27s_method [2] http://www.mathpath.org/Algor/cuberoot/algor.cube.root.halley.htm [3] Look for "709921077", "cbrt_5f" or "cbrta_halleyf"
Created attachment 371512 [details] [review] [Halley's Method] CIE: Use a faster cbrtf implementation (I am putting this up as a patch for anybody who wants to have a play with it.) A second iteration of Halley's Method, absent in both Skia and Darktable, removes the error, but loses its advantage over the implementation from Hacker's Delight. The number of generated instructions is almost the same. On the aforementioned Intel i7 Sandybridge, for 15 megapixels, it's actually marginally slower. Another small advantage of the current implementation is its ease of attribution.