Bug 738504 – Optimize UTF-8 decoding by unrolling branches and expressions

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 738504 - Optimize UTF-8 decoding by unrolling branches and expressions


Summary:	Optimize UTF-8 decoding by unrolling branches and expressions


Status:	RESOLVED FIXED

Product:	glib
Classification:	Platform
Component:	i18n
Version:	2.42.x
Hardware:	Other All

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Duplicates:	754887 754897 (view as bug list)
Depends on:
Blocks:

Reported:	2014-10-14 06:29 UTC by Mikhail Zabaluev
Modified:	2015-09-13 17:05 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Unrolled implementation of g_utf8_to_ucs4_fast() (3.21 KB, patch) 2014-10-14 06:30 UTC, Mikhail Zabaluev	none	Details \| Review
Added g_utf8_validate() to UTF-8 performance testing (1.63 KB, patch) 2014-10-14 06:30 UTC, Mikhail Zabaluev	committed	Details \| Review
Optimized branching in g_utf8_validate() (5.88 KB, patch) 2014-10-14 06:30 UTC, Mikhail Zabaluev	committed	Details \| Review
Unrolled implementation of g_utf8_to_ucs4_fast() (3.22 KB, patch) 2014-10-14 06:53 UTC, Mikhail Zabaluev	committed	Details \| Review
Reorganized utf8-performance tests (7.51 KB, patch) 2014-10-14 20:41 UTC, Mikhail Zabaluev	committed	Details \| Review
g_utf8_validate: fix a regression (1.10 KB, patch) 2015-09-12 08:19 UTC, Mikhail Zabaluev	committed	Details \| Review
glib/tests/utf8-validate: add another test for invalid continuation bytes (841 bytes, patch) 2015-09-12 09:06 UTC, Mikhail Zabaluev	committed	Details \| Review

Description Mikhail Zabaluev 2014-10-14 06:29:28 UTC

Some improvement, noticeable on the utf8-performance benchmark, can be reached by unrolling loops and expressions, and minimizing the amount of logic used to
validate sequences.

Comment 1 Mikhail Zabaluev 2014-10-14 06:30:32 UTC

Created attachment 288434 [details] [review]
Unrolled implementation of g_utf8_to_ucs4_fast()

Unrolling the branches and expressions for all expected cases
of UTF-8 sequences facilitates the work of both an optimizing compiler
and the branch prediction logic in the CPU. This speeds up decoding
noticeably on text composed primarily of longer sequences.

Comment 2 Mikhail Zabaluev 2014-10-14 06:30:37 UTC

Created attachment 288435 [details] [review]
Added g_utf8_validate() to UTF-8 performance testing

Comment 3 Mikhail Zabaluev 2014-10-14 06:30:43 UTC

Created attachment 288436 [details] [review]
Optimized branching in g_utf8_validate()

The number of branches and logical operations can be reduced by
never producing a resulting wide character value to check its range.
Instead, individual bytes in the sequence are validated
depending on the branch taken on the basis of preceding bytes.
The syntax given in RFC 3629 is made use of.

Comment 4 Mikhail Zabaluev 2014-10-14 06:53:50 UTC

Created attachment 288437 [details] [review]
Unrolled implementation of g_utf8_to_ucs4_fast()

Added a G_UNLIKELY hint

Comment 5 Mikhail Zabaluev 2014-10-14 20:41:41 UTC

Created attachment 288551 [details] [review]
Reorganized utf8-performance tests

Now each function-string pair gets its own test path to track
a single performance result.

Comment 6 Allison Karlitskaya (desrt) 2014-10-15 08:17:24 UTC

Review of attachment 288435 [details] [review]:

OK.

Comment 7 Matthias Clasen 2015-06-05 19:41:32 UTC

Comment on attachment 288435 [details] [review]
Added g_utf8_validate() to UTF-8 performance testing

Attachment 288435 [details] pushed as 5bc0bc2 - Added g_utf8_validate() to UTF-8 performance testing

Comment 8 Matthias Clasen 2015-09-05 17:15:44 UTC

I've committed these now, thanks.

Attachment 288436 [details] pushed as 3188b8e - Optimized branching in g_utf8_validate()
Attachment 288437 [details] pushed as b963565 - Unrolled implementation of g_utf8_to_ucs4_fast()
Attachment 288551 [details] pushed as 401f786 - Reorganized utf8-performance tests

Comment 9 Christian Persch 2015-09-11 18:23:26 UTC

This needs to be backed out, it's breaking apps: bug 754897 (vte) and bug 754887 (gtksourceview).

Comment 10 Christian Persch 2015-09-11 19:14:09 UTC

The 'optimised' version of g_utf8_validate simply is wrong. Instrumenting vte I got many examples of the old and new version disagreeing on return value or the location of the last valid byte in the @end out-parameter.

(As for the other patch in this bug, I note that one testcase already had to be changed (commit e773acfe9a0f8cf4d67799f6177997bd8a761ede) because the new version didn't work with the 5- and 6-byte UTF-8 sequences that the old version accepted just fine. So even though that wasn't technically valid input, I'm not sure the new version is sufficiently compatible.)

Comment 11 Mikhail Zabaluev 2015-09-11 19:53:16 UTC

(In reply to Christian Persch from comment #10)
> The 'optimised' version of g_utf8_validate simply is wrong. Instrumenting
> vte I got many examples of the old and new version disagreeing on return
> value or the location of the last valid byte in the @end out-parameter.

Can you refer to specific examples?

> (As for the other patch in this bug, I note that one testcase already had to
> be changed (commit e773acfe9a0f8cf4d67799f6177997bd8a761ede) because the new
> version didn't work with the 5- and 6-byte UTF-8 sequences that the old
> version accepted just fine. So even though that wasn't technically valid
> input, I'm not sure the new version is sufficiently compatible.)

I'm of an opinion that any code relying on a particular behavior of non-validating functions over invalid UTF-8 sequences is wrong.

Comment 12 Christian Persch 2015-09-11 20:36:40 UTC

For example:
98 9a 05 08 34 
b0 92 
92 8a 
84 b9 21 
a3 88 45 
67 55 61 10 bf 98 
21 63 0f b8 8d 
44 24 37 a3 a0 04 46 08 
...

Comment 13 Christian Persch 2015-09-12 06:07:39 UTC

*** Bug 754897 has been marked as a duplicate of this bug. ***

Comment 14 Mikhail Zabaluev 2015-09-12 08:19:27 UTC

Created attachment 311202 [details] [review]
g_utf8_validate: fix a regression

A recent change permitted some characters from range 0x80-0xbf as
would-be valid sequence starters for length 2, as long as
continuation characters were OK.

Comment 15 Mikhail Zabaluev 2015-09-12 08:22:23 UTC

(In reply to Christian Persch from comment #12)
> For example:
> 98 9a 05 08 34 
> b0 92 
> 92 8a 
> 84 b9 21 
> a3 88 45 
> 67 55 61 10 bf 98 
> 21 63 0f b8 8d 
> 44 24 37 a3 a0 04 46 08 

Thanks. As the bug has slipped through the regression tests, I should add some more.

Comment 16 Mikhail Zabaluev 2015-09-12 09:06:19 UTC

Created attachment 311203 [details] [review]
glib/tests/utf8-validate: add another test for invalid continuation bytes

This would have caught the regression committed in the course of
bug #738504.

Comment 17 Sébastien Wilmet 2015-09-12 09:42:17 UTC

*** Bug 754887 has been marked as a duplicate of this bug. ***

Comment 18 Matthias Clasen 2015-09-13 17:05:35 UTC

Thanks for the extra tests. Applied.
Attachment 311202 [details] pushed as d1f4d4a - g_utf8_validate: fix a regression
Attachment 311203 [details] pushed as 8ab28b4 - glib/tests/utf8-validate: add another test for invalid continuation bytes