After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 738504 - Optimize UTF-8 decoding by unrolling branches and expressions
Optimize UTF-8 decoding by unrolling branches and expressions
Status: RESOLVED FIXED
Product: glib
Classification: Platform
Component: i18n
2.42.x
Other All
: Normal enhancement
: ---
Assigned To: gtkdev
gtkdev
: 754887 754897 (view as bug list)
Depends on:
Blocks:
 
 
Reported: 2014-10-14 06:29 UTC by Mikhail Zabaluev
Modified: 2015-09-13 17:05 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Unrolled implementation of g_utf8_to_ucs4_fast() (3.21 KB, patch)
2014-10-14 06:30 UTC, Mikhail Zabaluev
none Details | Review
Added g_utf8_validate() to UTF-8 performance testing (1.63 KB, patch)
2014-10-14 06:30 UTC, Mikhail Zabaluev
committed Details | Review
Optimized branching in g_utf8_validate() (5.88 KB, patch)
2014-10-14 06:30 UTC, Mikhail Zabaluev
committed Details | Review
Unrolled implementation of g_utf8_to_ucs4_fast() (3.22 KB, patch)
2014-10-14 06:53 UTC, Mikhail Zabaluev
committed Details | Review
Reorganized utf8-performance tests (7.51 KB, patch)
2014-10-14 20:41 UTC, Mikhail Zabaluev
committed Details | Review
g_utf8_validate: fix a regression (1.10 KB, patch)
2015-09-12 08:19 UTC, Mikhail Zabaluev
committed Details | Review
glib/tests/utf8-validate: add another test for invalid continuation bytes (841 bytes, patch)
2015-09-12 09:06 UTC, Mikhail Zabaluev
committed Details | Review

Description Mikhail Zabaluev 2014-10-14 06:29:28 UTC
Some improvement, noticeable on the utf8-performance benchmark, can be reached by unrolling loops and expressions, and minimizing the amount of logic used to
validate sequences.
Comment 1 Mikhail Zabaluev 2014-10-14 06:30:32 UTC
Created attachment 288434 [details] [review]
Unrolled implementation of g_utf8_to_ucs4_fast()

Unrolling the branches and expressions for all expected cases
of UTF-8 sequences facilitates the work of both an optimizing compiler
and the branch prediction logic in the CPU. This speeds up decoding
noticeably on text composed primarily of longer sequences.
Comment 2 Mikhail Zabaluev 2014-10-14 06:30:37 UTC
Created attachment 288435 [details] [review]
Added g_utf8_validate() to UTF-8 performance testing
Comment 3 Mikhail Zabaluev 2014-10-14 06:30:43 UTC
Created attachment 288436 [details] [review]
Optimized branching in g_utf8_validate()

The number of branches and logical operations can be reduced by
never producing a resulting wide character value to check its range.
Instead, individual bytes in the sequence are validated
depending on the branch taken on the basis of preceding bytes.
The syntax given in RFC 3629 is made use of.
Comment 4 Mikhail Zabaluev 2014-10-14 06:53:50 UTC
Created attachment 288437 [details] [review]
Unrolled implementation of g_utf8_to_ucs4_fast()

Added a G_UNLIKELY hint
Comment 5 Mikhail Zabaluev 2014-10-14 20:41:41 UTC
Created attachment 288551 [details] [review]
Reorganized utf8-performance tests

Now each function-string pair gets its own test path to track
a single performance result.
Comment 6 Allison Karlitskaya (desrt) 2014-10-15 08:17:24 UTC
Review of attachment 288435 [details] [review]:

OK.
Comment 7 Matthias Clasen 2015-06-05 19:41:32 UTC
Comment on attachment 288435 [details] [review]
Added g_utf8_validate() to UTF-8 performance testing

Attachment 288435 [details] pushed as 5bc0bc2 - Added g_utf8_validate() to UTF-8 performance testing
Comment 8 Matthias Clasen 2015-09-05 17:15:44 UTC
I've committed these now, thanks.

Attachment 288436 [details] pushed as 3188b8e - Optimized branching in g_utf8_validate()
Attachment 288437 [details] pushed as b963565 - Unrolled implementation of g_utf8_to_ucs4_fast()
Attachment 288551 [details] pushed as 401f786 - Reorganized utf8-performance tests
Comment 9 Christian Persch 2015-09-11 18:23:26 UTC
This needs to be backed out, it's breaking apps: bug 754897 (vte) and bug 754887 (gtksourceview).
Comment 10 Christian Persch 2015-09-11 19:14:09 UTC
The 'optimised' version of g_utf8_validate simply is wrong. Instrumenting vte I got many examples of the old and new version disagreeing on return value or the location of the last valid byte in the @end out-parameter.

(As for the other patch in this bug, I note that one testcase already had to be changed (commit e773acfe9a0f8cf4d67799f6177997bd8a761ede) because the new version didn't work with the 5- and 6-byte UTF-8 sequences that the old version accepted just fine. So even though that wasn't technically valid input, I'm not sure the new version is sufficiently compatible.)
Comment 11 Mikhail Zabaluev 2015-09-11 19:53:16 UTC
(In reply to Christian Persch from comment #10)
> The 'optimised' version of g_utf8_validate simply is wrong. Instrumenting
> vte I got many examples of the old and new version disagreeing on return
> value or the location of the last valid byte in the @end out-parameter.

Can you refer to specific examples?

> (As for the other patch in this bug, I note that one testcase already had to
> be changed (commit e773acfe9a0f8cf4d67799f6177997bd8a761ede) because the new
> version didn't work with the 5- and 6-byte UTF-8 sequences that the old
> version accepted just fine. So even though that wasn't technically valid
> input, I'm not sure the new version is sufficiently compatible.)

I'm of an opinion that any code relying on a particular behavior of non-validating functions over invalid UTF-8 sequences is wrong.
Comment 12 Christian Persch 2015-09-11 20:36:40 UTC
For example:
98 9a 05 08 34 
b0 92 
92 8a 
84 b9 21 
a3 88 45 
67 55 61 10 bf 98 
21 63 0f b8 8d 
44 24 37 a3 a0 04 46 08 
...
Comment 13 Christian Persch 2015-09-12 06:07:39 UTC
*** Bug 754897 has been marked as a duplicate of this bug. ***
Comment 14 Mikhail Zabaluev 2015-09-12 08:19:27 UTC
Created attachment 311202 [details] [review]
g_utf8_validate: fix a regression

A recent change permitted some characters from range 0x80-0xbf as
would-be valid sequence starters for length 2, as long as
continuation characters were OK.
Comment 15 Mikhail Zabaluev 2015-09-12 08:22:23 UTC
(In reply to Christian Persch from comment #12)
> For example:
> 98 9a 05 08 34 
> b0 92 
> 92 8a 
> 84 b9 21 
> a3 88 45 
> 67 55 61 10 bf 98 
> 21 63 0f b8 8d 
> 44 24 37 a3 a0 04 46 08 

Thanks. As the bug has slipped through the regression tests, I should add some more.
Comment 16 Mikhail Zabaluev 2015-09-12 09:06:19 UTC
Created attachment 311203 [details] [review]
glib/tests/utf8-validate: add another test for invalid continuation bytes

This would have caught the regression committed in the course of
bug #738504.
Comment 17 Sébastien Wilmet 2015-09-12 09:42:17 UTC
*** Bug 754887 has been marked as a duplicate of this bug. ***
Comment 18 Matthias Clasen 2015-09-13 17:05:35 UTC
Thanks for the extra tests. Applied.
Attachment 311202 [details] pushed as d1f4d4a - g_utf8_validate: fix a regression
Attachment 311203 [details] pushed as 8ab28b4 - glib/tests/utf8-validate: add another test for invalid continuation bytes