GNOME Bugzilla – Bug 738504
Optimize UTF-8 decoding by unrolling branches and expressions
Last modified: 2015-09-13 17:05:42 UTC
Some improvement, noticeable on the utf8-performance benchmark, can be reached by unrolling loops and expressions, and minimizing the amount of logic used to validate sequences.
Created attachment 288434 [details] [review] Unrolled implementation of g_utf8_to_ucs4_fast() Unrolling the branches and expressions for all expected cases of UTF-8 sequences facilitates the work of both an optimizing compiler and the branch prediction logic in the CPU. This speeds up decoding noticeably on text composed primarily of longer sequences.
Created attachment 288435 [details] [review] Added g_utf8_validate() to UTF-8 performance testing
Created attachment 288436 [details] [review] Optimized branching in g_utf8_validate() The number of branches and logical operations can be reduced by never producing a resulting wide character value to check its range. Instead, individual bytes in the sequence are validated depending on the branch taken on the basis of preceding bytes. The syntax given in RFC 3629 is made use of.
Created attachment 288437 [details] [review] Unrolled implementation of g_utf8_to_ucs4_fast() Added a G_UNLIKELY hint
Created attachment 288551 [details] [review] Reorganized utf8-performance tests Now each function-string pair gets its own test path to track a single performance result.
Review of attachment 288435 [details] [review]: OK.
Comment on attachment 288435 [details] [review] Added g_utf8_validate() to UTF-8 performance testing Attachment 288435 [details] pushed as 5bc0bc2 - Added g_utf8_validate() to UTF-8 performance testing
I've committed these now, thanks. Attachment 288436 [details] pushed as 3188b8e - Optimized branching in g_utf8_validate() Attachment 288437 [details] pushed as b963565 - Unrolled implementation of g_utf8_to_ucs4_fast() Attachment 288551 [details] pushed as 401f786 - Reorganized utf8-performance tests
This needs to be backed out, it's breaking apps: bug 754897 (vte) and bug 754887 (gtksourceview).
The 'optimised' version of g_utf8_validate simply is wrong. Instrumenting vte I got many examples of the old and new version disagreeing on return value or the location of the last valid byte in the @end out-parameter. (As for the other patch in this bug, I note that one testcase already had to be changed (commit e773acfe9a0f8cf4d67799f6177997bd8a761ede) because the new version didn't work with the 5- and 6-byte UTF-8 sequences that the old version accepted just fine. So even though that wasn't technically valid input, I'm not sure the new version is sufficiently compatible.)
(In reply to Christian Persch from comment #10) > The 'optimised' version of g_utf8_validate simply is wrong. Instrumenting > vte I got many examples of the old and new version disagreeing on return > value or the location of the last valid byte in the @end out-parameter. Can you refer to specific examples? > (As for the other patch in this bug, I note that one testcase already had to > be changed (commit e773acfe9a0f8cf4d67799f6177997bd8a761ede) because the new > version didn't work with the 5- and 6-byte UTF-8 sequences that the old > version accepted just fine. So even though that wasn't technically valid > input, I'm not sure the new version is sufficiently compatible.) I'm of an opinion that any code relying on a particular behavior of non-validating functions over invalid UTF-8 sequences is wrong.
For example: 98 9a 05 08 34 b0 92 92 8a 84 b9 21 a3 88 45 67 55 61 10 bf 98 21 63 0f b8 8d 44 24 37 a3 a0 04 46 08 ...
*** Bug 754897 has been marked as a duplicate of this bug. ***
Created attachment 311202 [details] [review] g_utf8_validate: fix a regression A recent change permitted some characters from range 0x80-0xbf as would-be valid sequence starters for length 2, as long as continuation characters were OK.
(In reply to Christian Persch from comment #12) > For example: > 98 9a 05 08 34 > b0 92 > 92 8a > 84 b9 21 > a3 88 45 > 67 55 61 10 bf 98 > 21 63 0f b8 8d > 44 24 37 a3 a0 04 46 08 Thanks. As the bug has slipped through the regression tests, I should add some more.
Created attachment 311203 [details] [review] glib/tests/utf8-validate: add another test for invalid continuation bytes This would have caught the regression committed in the course of bug #738504.
*** Bug 754887 has been marked as a duplicate of this bug. ***
Thanks for the extra tests. Applied. Attachment 311202 [details] pushed as d1f4d4a - g_utf8_validate: fix a regression Attachment 311203 [details] pushed as 8ab28b4 - glib/tests/utf8-validate: add another test for invalid continuation bytes