GNOME Bugzilla – Bug 159131
Optimize g_utf8_validate()
Last modified: 2004-12-22 21:47:04 UTC
Continuation of discussion in bug 159001 Spent some time working on alternate implementations of UTF-8 validations. I'll attach a C file that has the benchmark and various implementations. $ make time-validate cc `pkg-config --libs gtk+-2.0` time-validate.o -o time-validate Versions: strlen: baseline "fast access memory" g_utf8_validate(): current glib version validate: g_utf8_validate() rewritten as a state machine using a state variable validate_no_length: Same thing, special casing max_length == -1 validate_goto: State machine with gotos, special case max_length == -1, special case and optimize length == 2. validate_fake: while (*p) p++; Test input 1: gtkwidget.c (all ASCII) $ ./time-validate ~/cvs/gnome/gtk+/gtk/gtkwidget.c strlen: 0.135053 seconds; 635.677 Mbytes / second g_utf8_validate: 2.81853 seconds; 30.4592 Mbytes / second validate: 0.771455 seconds; 111.283 Mbytes / second validate_no_length: 0.533782 seconds; 160.834 Mbytes / second validate_goto: 0.283989 seconds; 302.301 Mbytes / second validate_fake: 0.199927 seconds; 429.407 Mbytes / second Test input 2: Big mixed language GCOnf schemas file $ ./time-validate /etc/gconf/schemas/desktop_gnome_interface.schemas make: `time-validate' is up to date. strlen: 0.14618 seconds; 637.875 Mbytes / second g_utf8_validate: 2.72126 seconds; 34.2652 Mbytes / second validate: 1.8901 seconds; 49.333 Mbytes / second validate_no_length: 1.59125 seconds; 58.5982 Mbytes / second validate_goto: 0.505775 seconds; 184.36 Mbytes / second validate_fake: 0.215646 seconds; 432.396 Mbytes / second Test input 3, Japanese input file $ iconv -t utf-8 -f euc-jp /usr/share/doc/FreeWnn-1.10pl020/manual/intro > testinput $ ./time-validate ./testinput strlen: 0.043275 seconds; 631.552 Mbytes / second g_utf8_validate: 0.78626 seconds; 34.76 Mbytes / second validate: 0.754884 seconds; 36.2048 Mbytes / second validate_no_length: 0.559099 seconds; 48.8829 Mbytes / second validate_goto: 0.322513 seconds; 84.742 Mbytes / second validate_fake: 0.090102 seconds; 303.327 Mbytes / second What should be clear here is that quite a bit of speedup is possible - the ratio of validate_goto() vs. g_utf8_validate() for these three tests is 9.9, 5.4, 2.4 (!) I suspect it's probably possible to rewrite validate_goto() as a loop-over-characters rather than nest-of-gotos, and still keep much of the speedup. The caveat emptor here is that I haven't tested any of these for accuracy, so it's conceivable that they are doing well simply because they aren't correct... Note the '-mcpu=i686' setting ... it made a huge difference for the strlen() number, apparently because GCC used a better inlined version than what glibc has, and a fairly large difference for some of the other tests. (I recompiled glib, so the g_utf8_number() uses -mcpu=i686 as well)
Created attachment 34047 [details] Test program and new implementations
Here are some timings from my Athlon system at home (compiled using -O2, run on a file containing XML with non-ASCII text) strlen: 31.289 seconds; 363.317 Mbytes / second g_utf8_validate: 177.57 seconds; 64.0192 Mbytes / second validate: 159.668 seconds; 71.1968 Mbytes / second validate_no_length: 142.883 seconds; 79.5606 Mbytes / second validate_goto: 35.7656 seconds; 317.843 Mbytes / second Here is what I can get out of g_utf8_validate() by a) adding a quick if (len == 1) { p++; continue; } b) splitting it into two functions for max_len < 0 and max_len >= 0 c) marking all conditions which indicate invalid utf-8 as G_UNLIKELY g_utf8_validate: 39.2344 seconds; 289.742 Mbytes / second It is not quite as fast as validate_goto, but it comes reasonably close. Since the file I used contains non-ASCII, it is possible that adding another quick continue for len == 2 would speed up this test some more.
Created attachment 34056 [details] [review] the patch
Here is a patch which brings validate() performance close to the performance of the generic strlen() implementation in glibc (there is no hope to match the performance of the -mcpu=i686 tuned version). In addition to the previous patch, I have removed some more branches from the loop body, by folding them into the UTF8_COMPUTE/UTF8_GET macros. Additionally, I removed most of the length 5 and 6 support, since Unicode will not go beyond plane 16, which is still covered by length 4.
Created attachment 34071 [details] [review] new patch
* I don't like using G_LIKELY() for the < 128 case. On some input texts, that branch could be taken very rarely. And on mostly ASCII texts, the processors branch prediction should do very well. * I'd like to leave UTF-8 for non-valid Unicode working for g_utf8_get_char, g_utf8_get_char_extended, g_ucs4_to_utf8, this means we can't actually kill the length 5/6 cases from the standard UTF8_COMPUTE, etc, macros. I'll attach my fastest version, which does a bit better than the versions in the patch above; compared to your utf8_validate() for length == -1 it is 24% faster on the schemas file (195 vs. 156 M/s) and 13% faster on the Japanese (99 vs. 87 M/s). On mostly length-2 UTF-8: grep msgstr ~/cvs/gnome/gtk+/po-properties/ru.po | grep -v \"\" | sed s'/msgstr//' > testinput2 it's 70% faster (127 vs 74 m/s). The two optimizations it has compared to your version are a) len == 2 is special cased, and b) the overlength check is simplified by the observation that you can simply check for length == 3, val < (1 << 11) and for length == 4, val < (1 < 16) to catch overlong UTF-8.
Created attachment 34077 [details] Fast version
Created attachment 34078 [details] integrate max_len and end checking Does this version look correct, as far as max_len and end are concerned ? It will have to be split up again into a no_max and a max version for maximal performance in the no_max case...
Created attachment 34092 [details] the latest version
Created attachment 34093 [details] unit tests for g_utf8_validate()
I have attached the latest version, which is again split up into two variants depending on max_len < 0 or max_len >= 0. I have also attached unit tests for g_utf8_validate() which allowed me to ensure that the old and new implementation behave identical. The test cases are taken from Markus Kuhns Unicode pages, and I added max_len related tests to make sure that all possible exits in the old implementation are tested.
2004-11-24 Matthias Clasen <mclasen@redhat.com> * glib/gutf8.c: Replace g_utf8_validate() with an optimized version, and clarify the docs a bit. (#159131, Owen Taylor)