Bug 614856 – Faster UTF-8 decoding routines

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 614856 - Faster UTF-8 decoding routines


Summary:	Faster UTF-8 decoding routines


Status:	RESOLVED OBSOLETE

Product:	glib
Classification:	Platform
Component:	i18n
Version:	2.24.x
Hardware:	Other All

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Depends on:	619418 619420 619435 619437
Blocks:

Reported:	2010-04-05 08:22 UTC by Mikhail Zabaluev
Modified:	2018-05-24 12:12 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Added a performance test for UTF-8 decoding functions (6.92 KB, patch) 2010-04-05 08:29 UTC, Mikhail Zabaluev	none	Details \| Review
Added perf tests for UTF-8 string conversion functions with size specified (3.12 KB, patch) 2010-04-05 08:29 UTC, Mikhail Zabaluev	none	Details \| Review
Added a performance test for a loop of g_utf8_prev_char/g_utf8_get_char (1.40 KB, patch) 2010-04-05 08:29 UTC, Mikhail Zabaluev	none	Details \| Review
Adopted the UTF-8 decoding implementation from glibmm for g_utf8_get_char() (3.12 KB, patch) 2010-04-05 08:29 UTC, Mikhail Zabaluev	none	Details \| Review
Optimized the overlong sequence check in g_utf8_get_char_extended() (2.27 KB, patch) 2010-04-05 08:29 UTC, Mikhail Zabaluev	none	Details \| Review
Made g_utf8_to_ucs4_fast() even faster (1.94 KB, patch) 2010-04-05 08:29 UTC, Mikhail Zabaluev	none	Details \| Review
Added g_utf8_iterate() (2.55 KB, patch) 2010-04-05 08:29 UTC, Mikhail Zabaluev	none	Details \| Review
Added a functional test for g_utf8_iterate() (1.40 KB, patch) 2010-04-05 08:30 UTC, Mikhail Zabaluev	none	Details \| Review
Added performance test for g_utf8_iterate() (1.27 KB, patch) 2010-04-05 08:30 UTC, Mikhail Zabaluev	none	Details \| Review
Added g_utf8_iterate_back() (1.98 KB, patch) 2010-04-05 08:30 UTC, Mikhail Zabaluev	none	Details \| Review
Added a functional test for g_utf8_iterate_back() (1.09 KB, patch) 2010-04-05 08:30 UTC, Mikhail Zabaluev	none	Details \| Review
Added a performance test for g_utf8_iterate_back() (1.32 KB, patch) 2010-04-05 08:30 UTC, Mikhail Zabaluev	none	Details \| Review
Documented g_utf8_iterate() and g_utf8_iterate_back() (2.38 KB, patch) 2010-04-05 08:30 UTC, Mikhail Zabaluev	none	Details \| Review
Don't let g_utf8_iterate go past the end of the string in tests (1.63 KB, patch) 2010-04-05 08:30 UTC, Mikhail Zabaluev	none	Details \| Review
Make g_utf8_iterate() and g_utf8_iterate_back() inline functions (5.69 KB, patch) 2010-04-05 08:30 UTC, Mikhail Zabaluev	none	Details \| Review
Make the UTF-8 decoding mask explicitly 32 bits wide (1.29 KB, patch) 2010-04-05 08:30 UTC, Mikhail Zabaluev	none	Details \| Review
Reverted the implementation of g_utf8_get_char() (3.19 KB, patch) 2010-04-05 08:30 UTC, Mikhail Zabaluev	none	Details \| Review

Description Mikhail Zabaluev 2010-04-05 08:22:06 UTC

The proposed patches speed up UTF-8 decoding by using a loop construct originally used in glibmm. In addition to optimizing existing functions like g_utf8_to_ucs4_fast(), two new inline functions are added: g_iterate() and g_iterate_back(), that combine the effects of g_utf8_get_char() and g_utf8_next_char()/g_utf8_prev_char().
To measure the effects, a performance test is added that measures the throughput of many UTF-8 decoding functions.

Comment 1 Mikhail Zabaluev 2010-04-05 08:29:35 UTC

Created attachment 157941 [details] [review]
Added a performance test for UTF-8 decoding functions

Comment 2 Mikhail Zabaluev 2010-04-05 08:29:39 UTC

Created attachment 157942 [details] [review]
Added perf tests for UTF-8 string conversion functions with size specified

Comment 3 Mikhail Zabaluev 2010-04-05 08:29:43 UTC

Created attachment 157943 [details] [review]
Added a performance test for a loop of g_utf8_prev_char/g_utf8_get_char

Comment 4 Mikhail Zabaluev 2010-04-05 08:29:46 UTC

Created attachment 157944 [details] [review]
Adopted the UTF-8 decoding implementation from glibmm for g_utf8_get_char()

By suggestion from Daniel Elstner who wrote the glibmm implementation
for STL-style iterators.

Comment 5 Mikhail Zabaluev 2010-04-05 08:29:51 UTC

Created attachment 157945 [details] [review]
Optimized the overlong sequence check in g_utf8_get_char_extended()

Rather make it branch to get the due sequence length for the resulting
character code, we can as well get the minimum code value in the initial
branching.

Also documented the cases when the function returns -1.

Comment 6 Mikhail Zabaluev 2010-04-05 08:29:55 UTC

Created attachment 157946 [details] [review]
Made g_utf8_to_ucs4_fast() even faster

Comment 7 Mikhail Zabaluev 2010-04-05 08:29:58 UTC

Created attachment 157947 [details] [review]
Added g_utf8_iterate()

Comment 8 Mikhail Zabaluev 2010-04-05 08:30:02 UTC

Created attachment 157948 [details] [review]
Added a functional test for g_utf8_iterate()

Comment 9 Mikhail Zabaluev 2010-04-05 08:30:07 UTC

Created attachment 157949 [details] [review]
Added performance test for g_utf8_iterate()

Comment 10 Mikhail Zabaluev 2010-04-05 08:30:10 UTC

Created attachment 157950 [details] [review]
Added g_utf8_iterate_back()

Comment 11 Mikhail Zabaluev 2010-04-05 08:30:14 UTC

Created attachment 157951 [details] [review]
Added a functional test for g_utf8_iterate_back()

Comment 12 Mikhail Zabaluev 2010-04-05 08:30:19 UTC

Created attachment 157952 [details] [review]
Added a performance test for g_utf8_iterate_back()

Comment 13 Mikhail Zabaluev 2010-04-05 08:30:23 UTC

Created attachment 157953 [details] [review]
Documented g_utf8_iterate() and g_utf8_iterate_back()

Comment 14 Mikhail Zabaluev 2010-04-05 08:30:27 UTC

Created attachment 157954 [details] [review]
Don't let g_utf8_iterate go past the end of the string in tests

This should make tests friendlier to memory checking tools.
I'm reasonably confident that the function returns 0 on a null byte;
a test could be added specifically for that.

Comment 15 Mikhail Zabaluev 2010-04-05 08:30:33 UTC

Created attachment 157955 [details] [review]
Make g_utf8_iterate() and g_utf8_iterate_back() inline functions

Comment 16 Mikhail Zabaluev 2010-04-05 08:30:39 UTC

Created attachment 157956 [details] [review]
Make the UTF-8 decoding mask explicitly 32 bits wide

Rather than relying on gunichar to be defined as gint32, the algorithm
should now always work properly on 64-bit processors.

Comment 17 Mikhail Zabaluev 2010-04-05 08:30:43 UTC

Created attachment 157957 [details] [review]
Reverted the implementation of g_utf8_get_char()

Reportedly, some code in Pango relies on g_utf8_get_char() returning -1
on invalid bytes in the middle of a multi-byte UTF-8 sequence.
Addition of the missing check to the implementation in the parent tree
negates its performance benefit.

Comment 18 Mikhail Zabaluev 2010-04-05 08:41:10 UTC

Also available as a git branch:
http://git.collabora.co.uk/?p=user/zabaluev/glib.git;a=shortlog;h=refs/heads/fast-utf8-elstner

Comment 19 Behdad Esfahbod 2010-04-08 23:20:47 UTC

Sorry for being a PITA, but a bug like this doesn't really get much attention.  As mentioned in the thread, please file individual bugs for each logically separate issue.

Comment 20 Mikhail Zabaluev 2010-04-09 10:32:12 UTC

(In reply to comment #19)
> a bug like this doesn't really get much attention.
> As mentioned in the thread, please file individual bugs for each logically
> separate issue.

Would there be more attention if separate bugs are filed for:
1. adding a performance test;
2. optimizing g_utf8_get_char_extended();
3. changing the implementation of g_utf8_to_ucs4_fast();
4. adding g_utf8_iterate() and g_utf8_iterate_back()?

Comment 21 Behdad Esfahbod 2010-04-10 05:56:28 UTC

(In reply to comment #20)

> Would there be more attention if separate bugs are filed for:

That's what I've been trying to communicate, yes.

Comment 22 Mikhail Zabaluev 2010-05-23 13:27:51 UTC

(In reply to comment #20)
> Would there be more attention if separate bugs are filed for:
> 1. adding a performance test;
> 2. optimizing g_utf8_get_char_extended();
> 3. changing the implementation of g_utf8_to_ucs4_fast();
> 4. adding g_utf8_iterate() and g_utf8_iterate_back()?

Done, you can now check and commend on the dependency bugs.

Comment 23 GNOME Infrastructure Team 2018-05-24 12:12:16 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/281.