GNOME Bugzilla – Bug 619437
New inline functions for iteration over UTF-8
Last modified: 2018-05-24 12:19:49 UTC
Moving out from bug #614856, patches adding two inline functions for faster iteration over UTF-8 characters, g_utf8_iterate() and g_utf8_iterate_back().
Created attachment 161789 [details] [review] Added g_utf8_iterate()
Created attachment 161790 [details] [review] Added a functional test for g_utf8_iterate()
Created attachment 161791 [details] [review] Added performance test for g_utf8_iterate()
Created attachment 161792 [details] [review] Added g_utf8_iterate_back()
Created attachment 161793 [details] [review] Added a functional test for g_utf8_iterate_back()
Created attachment 161794 [details] [review] Added a performance test for g_utf8_iterate_back()
Created attachment 161795 [details] [review] Documented g_utf8_iterate() and g_utf8_iterate_back()
Created attachment 161796 [details] [review] Don't let g_utf8_iterate go past the end of the string in tests This should make tests friendlier to memory checking tools. I'm reasonably confident that the function returns 0 on a null byte; a test could be added specifically for that.
Created attachment 161797 [details] [review] Make g_utf8_iterate() and g_utf8_iterate_back() inline functions
Created attachment 161798 [details] [review] Make the UTF-8 decoding mask explicitly 32 bits wide Rather than relying on gunichar to be defined as gint32, the algorithm should now always work properly on 64-bit processors.
Attach one patch please!
I don't like the inline functions. These are nontrivial functions better left as function calls...
There are already non-inline functions, and they are about two times slower or worse. Also, these new functions were not created inline in the branch, there is a separate patch that does specifically that. I could remove it easily. But the point of these inlines is, they are good to get optimized away in loops. If inlining is disabled, you might as well go with plodding g_utf8_get_char().
(In reply to comment #13) > But the point of these inlines is, they are good to get optimized away in > loops. What do you mean by "optimized away"?! And have you done any realworld measurements showing that UTF-8 decoding loops are taking any measurable time?
(In reply to comment #14) > What do you mean by "optimized away"?! This means they can be inlined into code using local variables on registers, rather than emitted as plain function calls. This speeds up code in loops, where UTF-8 iteration is normally used. > And have you done any realworld measurements showing that UTF-8 decoding loops > are taking any measurable time? No. But I remember Tracker people were interested to optimize their UTF-8? I'll need to look around...
Created attachment 162868 [details] [review] All-in-one patch
ping?
I still don't think this is justified. Leaving up to Matthias to decide.
If claiming performance improvements, describe your application and methodology. Ideally there's some independent method to reproduce, but if it's a proprietary application or something, at least describe the high level issues?
-- GitLab Migration Automatic Message -- This bug has been migrated to GNOME's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/302.