Bug 320519 – Replacement for the utf8_skip_data array and g_utf8_next_char macro

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 320519 - Replacement for the utf8_skip_data array and g_utf8_next_char macro


Summary:	Replacement for the utf8_skip_data array and g_utf8_next_char macro


Status:	RESOLVED NOTABUG

Product:	glib
Classification:	Platform
Component:	general
Version:	2.8.x
Hardware:	Other Linux

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2005-11-02 15:36 UTC by Benjamin Dauvergne
Modified:	2005-11-08 18:42 UTC

See Also:
GNOME target:	---
GNOME version:	2.13/2.14

Attachments
A corrected patch. (1003 bytes, patch) 2005-11-02 17:05 UTC, Benjamin Dauvergne	none	Details \| Review

Description Benjamin Dauvergne 2005-11-02 15:36:16 UTC

Following a blog post by Federico i implemented this functions without a lookup
table:

inline char *g_utf8_next_char (char *p) {
#if __GNUC__ >= 3
# define likely(x)  __builtin_expect (!!(x), 1)
#else
# define likely(x) (x)
#endif
#define u ((unsigned char)*p)
        if (likely(u < 0xc0))
                return p+1;
        if (likely(! (u & 0x20)))
                return p+2;
        if (likely(! (u & 0x10)))
                return p+3;
        if (likely(! (u & 0x8)))
                return p+4;
        if (likely(! (u & 0x4)))
                return p+5;
        if (likely(! (u & 0x2)))
                return p+6;
        return p+1;
#undef u
}
I get a 40% speedup on a simple traversal of 800 Meg utf-8 file.

Comment 1 Benjamin Dauvergne 2005-11-02 15:38:15 UTC

I should have added a #undef likely at the end.

Comment 2 Matthias Clasen 2005-11-02 15:49:16 UTC

Even better, use G_LIKELY(). 
Also, the u macro doesn't really add anything. 
If the repeated casts bother you, you can use a local variable

Comment 3 Benjamin Dauvergne 2005-11-02 17:05:24 UTC

Created attachment 54237 [details] [review]
A corrected patch.

This is a patch agains the current cvs version with the modifications suggested
by  
Matthias Clasen.

Comment 4 Matthias Clasen 2005-11-02 17:18:23 UTC

What did said 800MB file contain ? 
Was it just ascii ? 
How was the distribution of 1- vs 2- vs more-byte chars ?

Might be worthwhile to use federicos po-data collection as a testcase for this

Comment 5 Benjamin Dauvergne 2005-11-03 21:58:41 UTC

The 800Mb file contained the content of
/usr/share/dasher/training/training_*.txt concatened a lot of times, the
distribution was 87%, 10%, 3%.

Comment 6 Matthias Clasen 2005-11-04 19:41:27 UTC

Of course, the assumption that likely(u < 0xc0) is only correct for western
languages.

It would be interesting to see how this code performs on actual CJK text like
the script samples that can be found in the pango-profile module.

Comment 7 Behdad Esfahbod 2005-11-08 18:42:29 UTC

Humm, the consensus after several people trying out several implementations on
several machines, was to not change the current code.  There's just no solution
that beats it on all architectures for all text.  And that function should not
be called too many times anyway.  Closing.