GNOME Bugzilla – Bug 320519
Replacement for the utf8_skip_data array and g_utf8_next_char macro
Last modified: 2005-11-08 18:42:29 UTC
Following a blog post by Federico i implemented this functions without a lookup table: inline char *g_utf8_next_char (char *p) { #if __GNUC__ >= 3 # define likely(x) __builtin_expect (!!(x), 1) #else # define likely(x) (x) #endif #define u ((unsigned char)*p) if (likely(u < 0xc0)) return p+1; if (likely(! (u & 0x20))) return p+2; if (likely(! (u & 0x10))) return p+3; if (likely(! (u & 0x8))) return p+4; if (likely(! (u & 0x4))) return p+5; if (likely(! (u & 0x2))) return p+6; return p+1; #undef u } I get a 40% speedup on a simple traversal of 800 Meg utf-8 file.
I should have added a #undef likely at the end.
Even better, use G_LIKELY(). Also, the u macro doesn't really add anything. If the repeated casts bother you, you can use a local variable
Created attachment 54237 [details] [review] A corrected patch. This is a patch agains the current cvs version with the modifications suggested by Matthias Clasen.
What did said 800MB file contain ? Was it just ascii ? How was the distribution of 1- vs 2- vs more-byte chars ? Might be worthwhile to use federicos po-data collection as a testcase for this
The 800Mb file contained the content of /usr/share/dasher/training/training_*.txt concatened a lot of times, the distribution was 87%, 10%, 3%.
Of course, the assumption that likely(u < 0xc0) is only correct for western languages. It would be interesting to see how this code performs on actual CJK text like the script samples that can be found in the pango-profile module.
Humm, the consensus after several people trying out several implementations on several machines, was to not change the current code. There's just no solution that beats it on all architectures for all text. And that function should not be called too many times anyway. Closing.