Bug 791837 – CIE: Use a faster cbrtf implementation

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 791837 - CIE: Use a faster cbrtf implementation


Summary:	CIE: Use a faster cbrtf implementation


Status:	RESOLVED FIXED

Product:	GEGL
Classification:	Other
Component:	babl
Version:	git master
Hardware:	Other All

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	Default Gegl Component Owner
QA Contact:	Default Gegl Component Owner

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2017-12-21 09:57 UTC by Debarshi Ray
Modified:	2018-04-29 07:36 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Test program used for measurements (1.49 KB, text/plain) 2017-12-21 10:04 UTC, Debarshi Ray		Details
[Hacker's Delight] CIE: Use a faster cbrtf implementation (3.06 KB, patch) 2017-12-21 10:05 UTC, Debarshi Ray	committed	Details \| Review
[Halley's Method] CIE: Use a faster cbrtf implementation (2.15 KB, patch) 2018-04-29 07:36 UTC, Debarshi Ray	rejected	Details \| Review

Description Debarshi Ray 2017-12-21 09:57:27 UTC

Conversions involving the CIE extensions are considerably slower than others at the moment because they don't have any accelerated variants. The current cbrtf implementation, borrowed from musl, seems like an impediment to vectorizing the conversions due to the three conditional branches involved.

(Note that the earlier change from non-inlined sqrt from glibc, on my system, to the inlined version from musl had led to a pretty significant speed boost. For 15 megapixels, "RGBA float" to "CIE Lab alpha float" went from 1.2s to 0.35s.)

Instead, this branchless implementation from Hacker's Delight using two Newton-Raphson iterations might simplify future vectorization:
  http://www.hackersdelight.org/hdcodetxt/acbrt.c.txt

(The musl implementation also uses two iterations but I don't know if it is a Newton-Raphson variant or something entirely different.)

It's nice that code from Hacker's Delight is very liberally licensed:
  http://www.hackersdelight.org/permissions.htm

It is also nice that it measurably speeds up the existing scalar code paths. For 15 megapixels, "RGBA float" to "CIE Lab alpha float" went from 0.35s to 0.27s. A
"Y float" to "CIE L float" conversion takes 0.085s instead of 0.102s.

This has a positive impact on gegl:shadows-highlights, which uses two RGB(A) and Y conversions to CIE. It goes from 2.6s to 2.1s when operating on a 15 megapixel JPEG with shadows=100.0f and highlights=-100.0f.

All measurements taken on an Intel i7 Haswell.

Comment 1 Debarshi Ray 2017-12-21 10:04:52 UTC

Created attachment 365831 [details]
Test program used for measurements

Comment 2 Debarshi Ray 2017-12-21 10:05:37 UTC

Created attachment 365832 [details] [review]
[Hacker's Delight] CIE: Use a faster cbrtf implementation

Comment 3 Øyvind Kolås (pippin) 2017-12-21 15:05:49 UTC

On my system (Intel(R) Core(TM) i5-7Y54 CPU @ 1.20GHz) this is also a significant speedup:

without patch: ** Message: time: 126141,  pixels: 15728640
with patch; ** Message: time: 35021,  pixels: 15728640

Comment 4 Debarshi Ray 2017-12-21 15:20:22 UTC

On an older Intel i7 Sandybridge, for 15 megapixels, the figures are:

"RGBA float" to "CIE Lab alpha float" goes from 0.437s to 0.388s; "Y float" to "CIE L float" goes from 0.132s to 0.12s.

So, it's still faster, but by a more moderate amount.

Comment 5 Debarshi Ray 2017-12-21 22:36:13 UTC

From #gegl on GIMPNet:

15:20 <rishi> pippin: So, ok to push?
15:31 <rishi> Wow! Your i5 was really fast converting Y to L. 0.035 versus      
      0.085 on the i7.
16:13 <pippin> it isn't even a true i5 either - it is one of the odd budget     
      pretend core-series cpus
16:14 <pippin> rishi: yep, please do push, given that both of us sped up,. it   
      probably speeds up for most

Comment 6 Debarshi Ray 2018-04-28 23:49:11 UTC

The implementation of Halley's method for approximating the cube root of a single precision IEEE float used in Skia and Darktable [3] is even faster and uses even lesser number of instructions.

On an older Intel i7 Sandybridge, for 15 megapixels, "Y float" to "CIE L float" takes .088s as opposed to 0.12s with the implementation from Hacker's Delight.

Here are the generated instructions:
 * Hacker's Delight: https://godbolt.org/g/Zj8PTf
 * Halley's method: https://godbolt.org/g/CXsU9u

However, Halley's approximation is too coarse. The above monochrome conversion has an error of 0.000003, and it exceeds the threshold of 0.000005 for a "RGBA float" to "CIE Lab alpha float" conversion.

In comparison, the Hacker's Delight version is accurate upto 6 decimal places and hence no measurable error.

[1] https://en.wikipedia.org/wiki/Halley%27s_method
[2] http://www.mathpath.org/Algor/cuberoot/algor.cube.root.halley.htm
[3] Look for "709921077", "cbrt_5f" or "cbrta_halleyf"

Comment 7 Debarshi Ray 2018-04-29 07:36:14 UTC

Created attachment 371512 [details] [review]
[Halley's Method] CIE: Use a faster cbrtf implementation

(I am putting this up as a patch for anybody who wants to have a play with it.)

A second iteration of Halley's Method, absent in both Skia and Darktable, removes the error, but loses its advantage over the implementation from Hacker's Delight. The number of generated instructions is almost the same. On the aforementioned Intel i7 Sandybridge, for 15 megapixels, it's actually marginally slower.

Another small advantage of the current implementation is its ease of attribution.