After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 320519 - Replacement for the utf8_skip_data array and g_utf8_next_char macro
Replacement for the utf8_skip_data array and g_utf8_next_char macro
Status: RESOLVED NOTABUG
Product: glib
Classification: Platform
Component: general
2.8.x
Other Linux
: Normal enhancement
: ---
Assigned To: gtkdev
gtkdev
Depends on:
Blocks:
 
 
Reported: 2005-11-02 15:36 UTC by Benjamin Dauvergne
Modified: 2005-11-08 18:42 UTC
See Also:
GNOME target: ---
GNOME version: 2.13/2.14


Attachments
A corrected patch. (1003 bytes, patch)
2005-11-02 17:05 UTC, Benjamin Dauvergne
none Details | Review

Description Benjamin Dauvergne 2005-11-02 15:36:16 UTC
Following a blog post by Federico i implemented this functions without a lookup
table:

inline char *g_utf8_next_char (char *p) {
#if __GNUC__ >= 3
# define likely(x)  __builtin_expect (!!(x), 1)
#else
# define likely(x) (x)
#endif
#define u ((unsigned char)*p)
        if (likely(u < 0xc0))
                return p+1;
        if (likely(! (u & 0x20)))
                return p+2;
        if (likely(! (u & 0x10)))
                return p+3;
        if (likely(! (u & 0x8)))
                return p+4;
        if (likely(! (u & 0x4)))
                return p+5;
        if (likely(! (u & 0x2)))
                return p+6;
        return p+1;
#undef u
}
I get a 40% speedup on a simple traversal of 800 Meg utf-8 file.
Comment 1 Benjamin Dauvergne 2005-11-02 15:38:15 UTC
I should have added a #undef likely at the end.
Comment 2 Matthias Clasen 2005-11-02 15:49:16 UTC
Even better, use G_LIKELY(). 
Also, the u macro doesn't really add anything. 
If the repeated casts bother you, you can use a local variable
Comment 3 Benjamin Dauvergne 2005-11-02 17:05:24 UTC
Created attachment 54237 [details] [review]
A corrected patch.

This is a patch agains the current cvs version with the modifications suggested
by  
Matthias Clasen.
Comment 4 Matthias Clasen 2005-11-02 17:18:23 UTC
What did said 800MB file contain ? 
Was it just ascii ? 
How was the distribution of 1- vs 2- vs more-byte chars ?

Might be worthwhile to use federicos po-data collection as a testcase for this
Comment 5 Benjamin Dauvergne 2005-11-03 21:58:41 UTC
The 800Mb file contained the content of
/usr/share/dasher/training/training_*.txt concatened a lot of times, the
distribution was 87%, 10%, 3%.
Comment 6 Matthias Clasen 2005-11-04 19:41:27 UTC
Of course, the assumption that likely(u < 0xc0) is only correct for western
languages.

It would be interesting to see how this code performs on actual CJK text like
the script samples that can be found in the pango-profile module.
Comment 7 Behdad Esfahbod 2005-11-08 18:42:29 UTC
Humm, the consensus after several people trying out several implementations on
several machines, was to not change the current code.  There's just no solution
that beats it on all architectures for all text.  And that function should not
be called too many times anyway.  Closing.