Bug 159131 – Optimize g_utf8_validate()

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 159131 - Optimize g_utf8_validate()


Summary:	Optimize g_utf8_validate()


Status:	RESOLVED FIXED

Product:	glib
Classification:	Platform
Component:	general
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2004-11-22 22:59 UTC by Owen Taylor
Modified:	2004-12-22 21:47 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Test program and new implementations (6.89 KB, text/plain) 2004-11-22 23:14 UTC, Owen Taylor		Details
the patch (5.04 KB, patch) 2004-11-23 03:47 UTC, Matthias Clasen	none	Details \| Review
new patch (6.36 KB, patch) 2004-11-23 16:35 UTC, Matthias Clasen	none	Details \| Review
Fast version (1.45 KB, text/plain) 2004-11-23 19:53 UTC, Owen Taylor		Details
integrate max_len and end checking (1.76 KB, text/plain) 2004-11-23 20:54 UTC, Matthias Clasen		Details
the latest version (4.04 KB, text/plain) 2004-11-24 15:42 UTC, Matthias Clasen		Details
unit tests for g_utf8_validate() (9.36 KB, text/plain) 2004-11-24 15:43 UTC, Matthias Clasen		Details

Description Owen Taylor 2004-11-22 22:59:58 UTC

Continuation of discussion in bug 159001

Spent some time working on alternate implementations of UTF-8 validations.
I'll attach a C file that has the benchmark and various implementations.

$ make time-validate
cc `pkg-config --libs gtk+-2.0`  time-validate.o   -o time-validate

Versions:
 strlen: baseline "fast access memory"
 g_utf8_validate(): current glib version
 validate: g_utf8_validate() rewritten as a state machine using
           a state variable
 validate_no_length: Same thing, special casing max_length == -1 
 validate_goto: State machine with gotos, special case max_length == -1,
                special case and optimize length == 2.
 validate_fake: while (*p) p++;

Test input 1: gtkwidget.c (all ASCII)

$ ./time-validate ~/cvs/gnome/gtk+/gtk/gtkwidget.c
strlen: 0.135053 seconds; 635.677 Mbytes / second
g_utf8_validate: 2.81853 seconds; 30.4592 Mbytes / second
validate: 0.771455 seconds; 111.283 Mbytes / second
validate_no_length: 0.533782 seconds; 160.834 Mbytes / second
validate_goto: 0.283989 seconds; 302.301 Mbytes / second
validate_fake: 0.199927 seconds; 429.407 Mbytes / second

Test input 2: Big mixed language GCOnf schemas file

$ ./time-validate /etc/gconf/schemas/desktop_gnome_interface.schemas
make: `time-validate' is up to date.
strlen: 0.14618 seconds; 637.875 Mbytes / second
g_utf8_validate: 2.72126 seconds; 34.2652 Mbytes / second
validate: 1.8901 seconds; 49.333 Mbytes / second
validate_no_length: 1.59125 seconds; 58.5982 Mbytes / second
validate_goto: 0.505775 seconds; 184.36 Mbytes / second
validate_fake: 0.215646 seconds; 432.396 Mbytes / second

Test input 3, Japanese input file

$ iconv -t utf-8 -f euc-jp /usr/share/doc/FreeWnn-1.10pl020/manual/intro  >
testinput
$ ./time-validate ./testinput

strlen: 0.043275 seconds; 631.552 Mbytes / second
g_utf8_validate: 0.78626 seconds; 34.76 Mbytes / second
validate: 0.754884 seconds; 36.2048 Mbytes / second
validate_no_length: 0.559099 seconds; 48.8829 Mbytes / second
validate_goto: 0.322513 seconds; 84.742 Mbytes / second
validate_fake: 0.090102 seconds; 303.327 Mbytes / second

What should be clear here is that quite a bit of speedup is possible - 
the ratio of validate_goto() vs. g_utf8_validate() for these three
tests is 9.9, 5.4, 2.4 (!)

I suspect it's probably possible to rewrite validate_goto() as a 
loop-over-characters rather than nest-of-gotos, and still keep much of the 
speedup.

The caveat emptor here is that I haven't tested any of these for
accuracy, so it's conceivable that they are doing well simply
because they aren't correct... Note the '-mcpu=i686' setting ... it
made a huge difference for the strlen() number, apparently because
GCC used a better inlined version than what glibc has, and a fairly large 
difference for some of the other tests. (I recompiled glib, so
the g_utf8_number() uses -mcpu=i686 as well)

Comment 1 Owen Taylor 2004-11-22 23:14:11 UTC

Created attachment 34047 [details]
Test program and new implementations

Comment 2 Matthias Clasen 2004-11-23 03:34:53 UTC

Here are some timings from my Athlon system at home (compiled
using -O2, run on a file containing XML with non-ASCII text)

strlen: 31.289 seconds; 363.317 Mbytes / second
g_utf8_validate: 177.57 seconds; 64.0192 Mbytes / second
validate: 159.668 seconds; 71.1968 Mbytes / second
validate_no_length: 142.883 seconds; 79.5606 Mbytes / second
validate_goto: 35.7656 seconds; 317.843 Mbytes / second

Here is what I can get out of g_utf8_validate() by
a) adding a quick if (len == 1) { p++; continue; }
b) splitting it into two functions for max_len < 0 and max_len >= 0
c) marking all conditions which indicate invalid utf-8 as G_UNLIKELY

g_utf8_validate: 39.2344 seconds; 289.742 Mbytes / second

It is not quite as fast as validate_goto, but it comes reasonably close.
Since the file I used contains non-ASCII, it is possible that adding
another quick continue for len == 2 would speed up this test some more.

Comment 3 Matthias Clasen 2004-11-23 03:47:50 UTC

Created attachment 34056 [details] [review]
the patch

Comment 4 Matthias Clasen 2004-11-23 16:35:13 UTC

Here is a patch which brings validate() performance close to the performance of 
the generic strlen() implementation in glibc (there is no hope to match the
performance of the -mcpu=i686 tuned version). 

In addition to the previous patch, I have removed some more branches from the
loop body, by folding them into the UTF8_COMPUTE/UTF8_GET macros. Additionally,
I removed most of the length 5 and 6 support, since Unicode will not go beyond 
plane 16, which is still covered by length 4.

Comment 5 Matthias Clasen 2004-11-23 16:35:59 UTC

Created attachment 34071 [details] [review]
new patch

Comment 6 Owen Taylor 2004-11-23 19:53:03 UTC

* I don't like using G_LIKELY() for the < 128 case. On some input
  texts, that branch could be taken very rarely. And on mostly
  ASCII texts, the processors branch prediction should do very
  well.

* I'd like to leave UTF-8 for non-valid Unicode working for 
  g_utf8_get_char, g_utf8_get_char_extended, g_ucs4_to_utf8, 
  this means we can't actually kill the length 5/6 cases from
  the standard UTF8_COMPUTE, etc, macros.

I'll attach my fastest version, which does a bit better than
the versions in the patch above; compared to your utf8_validate()
for length == -1 it is 24% faster on the schemas file (195 vs. 156 M/s)
and 13% faster on the Japanese (99 vs. 87 M/s). 

On mostly length-2 UTF-8:

grep msgstr ~/cvs/gnome/gtk+/po-properties/ru.po  | grep -v \"\" | sed
s'/msgstr//' > testinput2

it's 70% faster (127 vs 74 m/s).

The two optimizations it has compared to your version are a) len == 2 
is special cased, and b) the overlength check is simplified by the
observation that you can simply check for length == 3, 
val < (1 << 11) and for length == 4, val < (1 < 16) to catch
overlong UTF-8.

Comment 7 Owen Taylor 2004-11-23 19:53:55 UTC

Created attachment 34077 [details]
Fast version

Comment 8 Matthias Clasen 2004-11-23 20:54:00 UTC

Created attachment 34078 [details]
integrate max_len and end checking

Does this version look correct, as far as max_len and end are concerned ?
It will have to be split up again into a no_max and a max version for maximal
performance in the no_max case...

Comment 9 Matthias Clasen 2004-11-24 15:42:51 UTC

Created attachment 34092 [details]
the latest version

Comment 10 Matthias Clasen 2004-11-24 15:43:30 UTC

Created attachment 34093 [details]
unit tests for g_utf8_validate()

Comment 11 Matthias Clasen 2004-11-24 15:45:34 UTC

I have attached the latest version, which is again split up into two variants
depending on max_len < 0 or max_len >= 0.

I have also attached unit tests for g_utf8_validate() which allowed me to ensure
that the old and new implementation behave identical. The test cases are taken
from Markus Kuhns Unicode pages, and I added max_len related tests to make sure
that all possible exits in the old implementation are tested.

Comment 12 Matthias Clasen 2004-11-24 18:03:36 UTC

2004-11-24  Matthias Clasen  <mclasen@redhat.com>

	* glib/gutf8.c: Replace g_utf8_validate() with an
	optimized version, and clarify the docs a bit.  (#159131,
	Owen Taylor)