GNOME Bugzilla – Bug 614856
Faster UTF-8 decoding routines
Last modified: 2018-05-24 12:12:16 UTC
The proposed patches speed up UTF-8 decoding by using a loop construct originally used in glibmm. In addition to optimizing existing functions like g_utf8_to_ucs4_fast(), two new inline functions are added: g_iterate() and g_iterate_back(), that combine the effects of g_utf8_get_char() and g_utf8_next_char()/g_utf8_prev_char(). To measure the effects, a performance test is added that measures the throughput of many UTF-8 decoding functions.
Created attachment 157941 [details] [review] Added a performance test for UTF-8 decoding functions
Created attachment 157942 [details] [review] Added perf tests for UTF-8 string conversion functions with size specified
Created attachment 157943 [details] [review] Added a performance test for a loop of g_utf8_prev_char/g_utf8_get_char
Created attachment 157944 [details] [review] Adopted the UTF-8 decoding implementation from glibmm for g_utf8_get_char() By suggestion from Daniel Elstner who wrote the glibmm implementation for STL-style iterators.
Created attachment 157945 [details] [review] Optimized the overlong sequence check in g_utf8_get_char_extended() Rather make it branch to get the due sequence length for the resulting character code, we can as well get the minimum code value in the initial branching. Also documented the cases when the function returns -1.
Created attachment 157946 [details] [review] Made g_utf8_to_ucs4_fast() even faster
Created attachment 157947 [details] [review] Added g_utf8_iterate()
Created attachment 157948 [details] [review] Added a functional test for g_utf8_iterate()
Created attachment 157949 [details] [review] Added performance test for g_utf8_iterate()
Created attachment 157950 [details] [review] Added g_utf8_iterate_back()
Created attachment 157951 [details] [review] Added a functional test for g_utf8_iterate_back()
Created attachment 157952 [details] [review] Added a performance test for g_utf8_iterate_back()
Created attachment 157953 [details] [review] Documented g_utf8_iterate() and g_utf8_iterate_back()
Created attachment 157954 [details] [review] Don't let g_utf8_iterate go past the end of the string in tests This should make tests friendlier to memory checking tools. I'm reasonably confident that the function returns 0 on a null byte; a test could be added specifically for that.
Created attachment 157955 [details] [review] Make g_utf8_iterate() and g_utf8_iterate_back() inline functions
Created attachment 157956 [details] [review] Make the UTF-8 decoding mask explicitly 32 bits wide Rather than relying on gunichar to be defined as gint32, the algorithm should now always work properly on 64-bit processors.
Created attachment 157957 [details] [review] Reverted the implementation of g_utf8_get_char() Reportedly, some code in Pango relies on g_utf8_get_char() returning -1 on invalid bytes in the middle of a multi-byte UTF-8 sequence. Addition of the missing check to the implementation in the parent tree negates its performance benefit.
Also available as a git branch: http://git.collabora.co.uk/?p=user/zabaluev/glib.git;a=shortlog;h=refs/heads/fast-utf8-elstner
Sorry for being a PITA, but a bug like this doesn't really get much attention. As mentioned in the thread, please file individual bugs for each logically separate issue.
(In reply to comment #19) > a bug like this doesn't really get much attention. > As mentioned in the thread, please file individual bugs for each logically > separate issue. Would there be more attention if separate bugs are filed for: 1. adding a performance test; 2. optimizing g_utf8_get_char_extended(); 3. changing the implementation of g_utf8_to_ucs4_fast(); 4. adding g_utf8_iterate() and g_utf8_iterate_back()?
(In reply to comment #20) > Would there be more attention if separate bugs are filed for: That's what I've been trying to communicate, yes.
(In reply to comment #20) > Would there be more attention if separate bugs are filed for: > 1. adding a performance test; > 2. optimizing g_utf8_get_char_extended(); > 3. changing the implementation of g_utf8_to_ucs4_fast(); > 4. adding g_utf8_iterate() and g_utf8_iterate_back()? Done, you can now check and commend on the dependency bugs.
-- GitLab Migration Automatic Message -- This bug has been migrated to GNOME's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/281.