After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 415323 - GScanner performance improvements
GScanner performance improvements
Status: RESOLVED OBSOLETE
Product: glib
Classification: Platform
Component: general
unspecified
Other All
: Normal enhancement
: ---
Assigned To: gtkdev
gtkdev
Depends on:
Blocks:
 
 
Reported: 2007-03-06 14:47 UTC by Charlie Brej
Modified: 2018-05-24 10:59 UTC
See Also:
GNOME target: ---
GNOME version: Unversioned Enhancement


Attachments
Replaces strchr calls searching for chars in cset strings with table lookups (4.87 KB, patch)
2007-03-06 15:32 UTC, Charlie Brej
none Details | Review
Preallocates a larger number of characters to each new gstring (2.11 KB, patch)
2007-03-06 15:40 UTC, Charlie Brej
none Details | Review
Replaces strchr calls searching for chars in cset strings with table lookups (4.93 KB, patch)
2007-03-06 21:07 UTC, Charlie Brej
none Details | Review
A simple benchmark program to measure the performance of GScanner (387 bytes, text/plain)
2007-03-21 14:24 UTC, Charlie Brej
  Details

Description Charlie Brej 2007-03-06 14:47:59 UTC
GScanner is very slow. Here are two patches to make it a lot faster. The cset_table patch doubles the speed and the gstring_preallocate patch insreases the speed by a further 10%.
Comment 1 Charlie Brej 2007-03-06 15:32:17 UTC
Created attachment 84076 [details] [review]
Replaces strchr calls searching for chars in cset strings with table lookups

This patch replaces calls to "strchr" which searches for a character in the cset_* strings, with lookups into arrays. This roughly halves the CPU usage by GScanner. During the g_scanner_new call, cset_* strings are used to fill tables of 256 boolean values stating which characters are allowed. g_scanner_cset_2_table creates new tables from strings and g_scanner_cset_table_lookup replies the presence of a character (encuded as a guint).
Comment 2 Charlie Brej 2007-03-06 15:40:09 UTC
Created attachment 84078 [details] [review]
Preallocates a larger number of characters to each new gstring

This patch replaces the contruction of new strings from g_string_new(NULL) to g_string_sized_new(STRING_NEW_SIZE) where STRING_NEW_SIZE is defined to a reasonable start number (currently 16). This removes majority of the calls to gstring realloc due to gscanner appending strings. This reduces the CPU demand by about 10% (according to valgrind).
Comment 3 Matthias Clasen 2007-03-06 18:34:13 UTC
For the table lookups, I'd probably use a single table and store per-character flags. Then you can combine the double lookups that are done in some cases.

Did you arrive experimentally at the initial size of 16 ? 
There are some instances where the 16 looks very good, like for parsing floats,
but some where a larger value might be better, e.g. for parsing comments.
Comment 4 Charlie Brej 2007-03-06 21:07:31 UTC
Created attachment 84120 [details] [review]
Replaces strchr calls searching for chars in cset strings with table lookups

Uses a single table instead of three. Now using a mask in the table_lookup function (This did shave a few more instructions off). Also moved the mallocing out of the table generation function.
Comment 5 Charlie Brej 2007-03-06 21:25:03 UTC
About the preallocation patch, I did a few runs across some files I had around, and 16 was large enough to remove nearly all calls to gstring realloc calls, and 64 was a bit slower. Unfortunately I don't scan comments (line or block) so I didn't even notice them. It does make sense to scrap STRING_NEW_SIZE and have custom numbers for each type. If I would do some benchmarks they would be very application specific, so a logical stab in the dark should be fine:
COMMENT_MULTI: 512
STRING:32
INT:16
COMMENT_SINGLE:128
IDENTIFIER:32
Comment 6 Matthias Clasen 2007-03-07 04:21:53 UTC
512 for multiline commments looks excessive, 128 is probably sufficient there too. The rest looks fine.
Comment 7 Matthias Clasen 2007-03-15 04:57:31 UTC
2007-03-15  Matthias Clasen  <mclasen@redhat.com>

        * glib/gscanner.[hc]: Some optimizations, use a lookup
        table for character classes, pre-allocate GStrings with
        reasonable sizes.  (#415323, Charlie Brej)

Comment 8 Richard Hult 2007-03-15 18:44:33 UTC
Hm, could this patch be the cause of this error message that has start to appear very recently?

gtkrc-2.0:1: error: invalid keyword `gtk', expected valid keyword - e.g. `style'
-:28: error: unexpected character `-', expected character `='

The file has always worked before and contains:

gtk-icon-theme-name = "gnome"
Comment 9 Christian Persch 2007-03-15 18:58:53 UTC
This might also be responsible for breaking glib-genmarshal (bug 418679).
Comment 10 Matthias Clasen 2007-03-15 19:57:37 UTC
I'll investigate
Comment 11 Matthias Clasen 2007-03-15 20:46:53 UTC
Hmm, I wonder why I'm not seeing this.
Anyway, I think I found the typo that caused this. 
Please try again now.

2007-03-15  Matthias Clasen  <mclasen@redhat.com>

        * glib/gscanner.c (g_scanner_get_token_ll): Fix a typo
        in the last commit.  (#415323, Richard Hult)

Comment 12 Richard Hult 2007-03-15 21:16:17 UTC
Fixes the problem I was seeing, thanks!
Comment 13 Michael Natterer 2007-03-16 09:24:08 UTC
That patch changes the layout of the GScanner struct, it needs
to be redone or reverted. GIMP crashes instantly with this
glib patch applied.
Comment 14 Tim Janik 2007-03-16 09:49:35 UTC
the patch needs to be reverted because it a) breaks ABI by changing GScanner fields. but also, it changes API, with the tables in place the char sets can not anymore be modified during the scanning phase, which is a supported feature of GScanner.
also, the original claim lacks backing:

(In reply to comment #0)
> GScanner is very slow. Here are two patches to make it a lot faster.

what exactly are the test scenarios you're using to make these claims?
and, can you provide a simple benchmark program for glib to backup these statements?
Comment 15 Matthias Clasen 2007-03-16 13:19:19 UTC
Sorry, I had overlooked that field addition.

...the char sets can not anymore be modified during the scanning phase,
which is a supported feature of GScanner...

Hmm, the support for this feature did not seem to go far enough to document it.

Anyway, I'll revert it.
Comment 16 Charlie Brej 2007-03-21 14:24:59 UTC
Created attachment 85044 [details]
A simple benchmark program to measure the performance of GScanner

Compile with something like:
> gcc gscannertest.c  -I/usr/include/glib-2.0/ -I/usr/lib/glib-2.0/include/ -lglib-2.0 -o gscannertest

Takes one command line papamiter (the file to scan). I used a 1Mb SDF file (I could supply it if desired) as a benchmark but any reasonable file will do. To measure the performance I used callgrind 
> valgrind --tool=callgrind ./glibtest router.sdf

Callgrind gives the number of instruction fetches and "time" gives simmilar (but less accurate) results.
Unpatched: 642,689,297
Lookups:   319,427,360
Prealloc:  291,243,375
Comment 17 Charlie Brej 2007-03-21 15:20:34 UTC
First of all sorry for making a bit of a pig's ear of this. Both the patches have typos which I messed up on (confusion due to keeping several patches on the same checkout). 

g_string_append_c should be g_string_append in the preallocate patch:

>@@ -1477,7 +1478,8 @@
> 	  if (token == G_TOKEN_NONE)
> 	    token = G_TOKEN_INT;
> 	  
>-	  gstring = g_string_new (dotted_float ? "0." : "");
>+	  gstring = g_string_sized_new (STRING_NEW_SIZE);
>+	  if (dotted_float) gstring = g_string_append (gstring, "0.");
> 	  gstring = g_string_append_c (gstring, ch);
> 	  
> 	  do /* while (in_number) */

CSET_IDENT_FIRST_FLAG should be CSET_IDENT_NTH_FLAG in the table lookup patch:

>@@ -1658,7 +1686,7 @@
> 		      gstring = g_string_append_c (gstring, ch);
> 		      ch = g_scanner_peek_next_char (scanner);
> 		    }
>-		  while (ch && strchr (config->cset_identifier_nth, ch));
>+		  while (ch && g_scanner_cset_table_lookup (scanner->cset_table, ch, CSET_IDENT_NTH_FLAG));
> 		  ch = 0;
> 		}
> 	      else if (config->scan_identifier_1char)

Secondly. There is an issue of editing the cset_* values after the scanner has been initialised. I didn't realise that was allowed as it seems strange to change the scanner in the middle of a parse. Gimp and nhtsclient seem to change the scanner->config directly but only at creation time (which I personally think should have been done through the g_scanner_new passed template). Gtkrc reader does this in runtime (for parsing some parts) which is impossible to get around. 

So the options are:
1) Change the other projects to not edit the config structure and make it private. (This is probably impossible at this stage)
2) Add a user function which allows them to create tables after they have edited a cset_ (This might add unnecesary coinfusion from the user side). If the function is not called the old strchr will be used during the scan.
3) Throw the table patch away.

The gstring patch should be fine though.
Comment 18 Matthias Clasen 2007-04-11 13:28:00 UTC
Tim, did you have any objections to the prealloc part of this ? 
Comment 19 Tim Janik 2007-04-11 13:48:43 UTC
(In reply to comment #18)
> Tim, did you have any objections to the prealloc part of this ? 

not real objections. i just think they are pointless, afair the GString implementation.
if i see actual benchmark numbers that reproducably benefit from the changes being applied, i'll shut up and apprechiate the patch though ;)
Comment 20 GNOME Infrastructure Team 2018-05-24 10:59:39 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/84.