GNOME Bugzilla – Bug 107138
g_ascii_strcasecmp performance
Last modified: 2011-02-18 15:57:04 UTC
g_ascii_strcasecmp calls g_ascii_tolower entirely too often. (To the tune of 1% of gnumeric's running time in my tree.) Also, one might consider inlining g_ascii_tolower. Patch coming up.
Created attachment 14639 [details] [review] Also do case conversion when needed
I suspect that inlining g_ascii_tolower() would be the necessary first step _before_ considering this.
I considered that, but on closer inspection that idea is not right. With my patch, we spend the time for one comparison of two characters. That should be extremely cheap given that the memory needs to be referenced anyway. When it fails, that is extra work; when it succeeds, it is much less work than two g_ascii_tolower calls (expanded or not). When used in a hash equal function (common, I would expect), there are barely more than one failure per call unless your hash function sucks.
Created attachment 16831 [details] [review] Benchmrk
Attached a benchmark. Corpus: words in /usr/dict/words, munged to be 1/3 lowercase, 1/3 UPPERCASE, 1/3 Mixedcase Test1: Insert all the words in a hash table, then look them all up. Test2: Sort the words with qsort. Results: $ rm -f strcmp-bench && make OPTIMIZE="-O2" strcmp-bench && ./strcmp-bench cc `pkg-config --cflags gtk+-2.0` -g -Wall -O2 `pkg-config --libs gtk+-2.0` strcmp-bench.c -o strcmp-bench 45427 words, 10 iterations Original: Hash, wall: 1.09s user: 1.10s Original: Sort, wall: 1.78s user: 1.78s MortenW : Hash, wall: 0.87s user: 0.90s MortenW : Sort, wall: 1.41s user: 1.40s Inlined : Hash, wall: 0.84s user: 0.83s Inlined : Sort, wall: 0.78s user: 0.77s Both : Hash, wall: 0.81s user: 0.80s Both : Sort, wall: 0.74s user: 0.73s Conclusions: - Both inlining and your optimization are significantly faster than the current code. - As you suggested, your optimization is a bigger win for hash table usage than sorting. - Inlining is a bigger win than your optimization for both tests - Even with inlining, your optimization seems to help a little bit, but the effect is much smaller. What if we decrease the branch penalty by enabling CMOV? $ rm -f strcmp-bench && make OPTIMIZE="-O2 -march=pentiumpro -mcpu=pentiumpro" strcmp-bench && ./strcmp-bench cc `pkg-config --cflags gtk+-2.0` -g -Wall -O2 -march=pentiumpro -mcpu=pentiumpro `pkg-config --libs gtk+-2.0` strcmp-bench.c -o strcmp-bench 45427 words, 10 iterations Original: Hash, wall: 1.04s user: 1.02s Original: Sort, wall: 1.70s user: 1.69s MortenW : Hash, wall: 0.83s user: 0.82s MortenW : Sort, wall: 1.38s user: 1.37s Inlined : Hash, wall: 0.78s user: 0.76s Inlined : Sort, wall: 0.69s user: 0.69s Both : Hash, wall: 0.76s user: 0.73s Both : Sort, wall: 0.71s user: 0.71s With that, your optimization actually slows down the sort test a bit (the numbers above are repeatable, it's not just noise, but the difference is certainly small.) More tests: On a PentiumIV, OPTIMIZE="-O2 -march=pentiumpro -mpcu=pentium4" $ ./strcmp-bench 45427 words, 10 iterations Original: Hash, wall: 0.97s user: 0.96s Original: Sort, wall: 0.87s user: 0.88s MortenW : Hash, wall: 0.91s user: 0.91s MortenW : Sort, wall: 0.75s user: 0.73s Inlined : Hash, wall: 0.87s user: 0.90s Inlined : Sort, wall: 0.54s user: 0.52s Both : Hash, wall: 0.86s user: 0.85s Both : Sort, wall: 0.50s user: 0.50s x86_64, OPTIMIZE="-O2" (disclaimer: this is in no way a benchmark of the processor, the numbers below should be compared only with themselves.) $ ./strcmp-bench 45427 words, 10 iterations Original: Hash, wall: 0.60s user: 0.60s Original: Sort, wall: 0.64s user: 0.63s MortenW : Hash, wall: 0.56s user: 0.56s MortenW : Sort, wall: 0.57s user: 0.56s Inlined : Hash, wall: 0.57s user: 0.60s Inlined : Sort, wall: 0.54s user: 0.54s Both : Hash, wall: 0.53s user: 0.51s Both : Sort, wall: 0.45s user: 0.46s PPC64, OPTIMIZE="-O2" ./strcmp-bench 45427 words, 10 iterations Original: Hash, wall: 0.77s user: 0.76s Original: Sort, wall: 1.18s user: 1.20s MortenW : Hash, wall: 0.68s user: 0.69s MortenW : Sort, wall: 1.04s user: 1.02s Inlined : Hash, wall: 0.65s user: 0.65s Inlined : Sort, wall: 0.79s user: 0.78s Both : Hash, wall: 0.65s user: 0.65s Both : Sort, wall: 0.79s user: 0.79s OK, I'm getting bored now :-) My conclusion here is that we might as well put in the "both" version ... its usually at least as fast as inlined-only version, and occasionally a fair bit faster. Of course, all the above numbers are with gcc-3.2.x - a different compiler might give considerably different results.
I bet I could come up with a benchmark to show whatever I wanted... However, it looks like the trivial inline will catch most benefits that there are, so why not just do that as a smallest-change approach?
The benchmarks here are meant to be representative of typical uses of g_ascii_strcmp() ... clearly they don't cover all possibilities, but I would content that they are *vastly* better than random guessing. (See your comment that inlining should be irrelevant compared to checking for character equality first, for example...)
Went with inlining only; in absence of clear benchmarks, might as well keep things simple. Fri May 30 19:23:47 2003 Owen Taylor <otaylor@redhat.com> * glib/gstrfuncs.c (g_ascii_strncasecmp) * glib/gstrfuncs.c (g_ascii_strcasecmp): Use TOLOWER() macro instead of g_ascii_tolower() (#107138)