GNOME Bugzilla – Bug 710320
g-ir-scanner: too much time is spent comparing path names
Last modified: 2015-02-07 16:53:16 UTC
first stab as speeding g-ir-scanner up following shortly...
Created attachment 257445 [details] [review] giscanner: speed up SourceScanner().parse_files() Was looking around a bit and noticed about 2/3 of g-ir-scanner time is spend in SourceScanner().parse_files(). Some profiling quickly most of the time spent in SourceScanner().parse_files() is used by comparing paths in gi_source_scanner_add_symbol(). Fix this by maintaining a hash table of GFile's instead of a list of file names. This makes "g-ir-scanner <whole_bunch_of_options> --output Gtk-3.0.gir" complete in about 10 seconds on my box instead of about 30 seconds (both best of 3 runs).
Review of attachment 257445 [details] [review]: ::: giscanner/giscannermodule.c @@ +363,3 @@ + file = g_file_new_for_path (filename); + g_hash_table_replace (self->scanner->files, file, file); g_hash_table_add() for sets. @@ -516,3 @@ return NULL; - self->scanner->current_filename = g_realpath (filename); We're losing realpath...we need to consider whether or not it's actually necessary. If you don't believe it is, say why in the commit message. See https://bugzilla.gnome.org/show_bug.cgi?id=704864 My inclination is to keep it out of perhaps an excess of conservativsm, but if you want to delete it, it needs a rationale. @@ +525,3 @@ } + file = g_file_new_for_path (filename); + g_hash_table_replace (self->scanner->files, file, file); g_hash_table_add() ::: giscanner/scannerlexer.l @@ +212,3 @@ "L\""{stringtext}*"\"" { return STRING; } +. { if (yytext[0]) fprintf(stderr, "%s:%d: unexpected character `%c'\n", g_file_get_parse_name (scanner->current_file), lineno, yytext[0]); } This leaks the path. @@ +239,3 @@ * starts with one '/' followed by exactly two '*' and not followed by a '/' */ + if (!g_hash_table_lookup (scanner->files, scanner->current_file)) { Prefer g_hash_table_contains() for sets. @@ -324,3 @@ - - compress = g_strcompress (filename); - real = g_realpath (filename); Losing realpath here. ::: giscanner/sourcescanner.c @@ +246,3 @@ g_slist_free (scanner->symbols); + g_hash_table_destroy (scanner->files); I tend to use _unref() in new code. While in practice it's the same here, sometimes it makes sense to hand out references to containers, and _unref supports that. @@ +277,3 @@ + if (g_hash_table_lookup (scanner->files, scanner->current_file)) { + scanner->symbols = g_slist_prepend (scanner->symbols, + gi_source_symbol_ref (symbol)); Er, why not just: if (scanner->macro_scan || g_hash_table_contains (scanner->files, scanner->current_file)) ?
Created attachment 257491 [details] [review] Revert "sourcescanner: Avoid scanning files when doing a macro scan" Will be not needed in about 3 patches...
Created attachment 257492 [details] [review] Revert "sourcescanner: Do some fast path checks on the filename" The "fast" path doesn't seem to work at all. Testing shows creating Gtk-3.0.gir still takes about 30 seconds with this and even on a Linux box...
Created attachment 257494 [details] [review] giscanner: restore linemark filename unescaping Not really relevant to speeding things up, just a small thing I noticed while working on process_linemarks() for the patch that follows...
Created attachment 257495 [details] [review] giscanner: speed up SourceScanner().parse_files() Addresses everything from comment #2 except this. Thought appreciated :) > ::: giscanner/sourcescanner.c > @@ +246,3 @@ > g_slist_free (scanner->symbols); > > + g_hash_table_destroy (scanner->files); > > I tend to use _unref() in new code. While in practice it's the same here, > sometimes it makes sense to hand out references to containers, and _unref > supports that. Hmm, scanner->typedef_table and scanner->struct_or_union_or_enum_table are g_hash_table_destroy()-ed only a couple of lines above, so I thought it would be more consistent to also _destroy() ?
Created attachment 257496 [details] [review] giscanner: remove g_realpath As said in the patch description: needs testing to see if GetFullPathName is really needed on win32, especially in MinGW/MSYS situations. Don't know off-hand if MSYS conversions between forward vs backward slash path separators, /c/ vs c:\ drives, etc are going to break things or not... For all I know GFile simply does the right thing and I'm worrying about nothing here :)
Created attachment 257501 [details] [review] Revert "sourcescanner: Avoid scanning files when doing a macro scan"
Created attachment 257504 [details] [review] Revert "sourcescanner: Do some fast path checks on the filename"
Created attachment 257505 [details] [review] giscanner: restore linemark filename unescaping
Created attachment 257506 [details] [review] giscanner: speed up SourceScanner().parse_files()
Created attachment 257507 [details] [review] giscanner: remove g_realpath
(In reply to comment #6) > Hmm, scanner->typedef_table and scanner->struct_or_union_or_enum_table > are g_hash_table_destroy()-ed only a couple of lines above, so I thought > it would be more consistent to also _destroy() ? g-i is pretty old code; I couldn't see that there was another call to _destroy() from the patch context. I've run into this before with e.g. g_clear_object() too. It's up to you then; I would lean a bit towards using _unref anyways personally.
Review of attachment 257501 [details] [review]: Ok, but it'd be nice if the commit message had the comment you added.
Review of attachment 257504 [details] [review]: Same here, would be nice if the rationale was in the git commit as well.
Review of attachment 257505 [details] [review]: Looks good.
Review of attachment 257506 [details] [review]: Just comment. ::: giscanner/giscannermodule.c @@ +362,3 @@ return NULL; + file = g_file_new_for_path ( g_realpath (filename)); The extra space here is not GNU style (which g-i uses). Though now that I look, this code is fairly inconsistent =(
Review of attachment 257506 [details] [review]: (Just one comment)
Review of attachment 257507 [details] [review]: Mmm...realpath is different from the "canonicalization" that GFile does. realpath(), in particular, resolves symbolic links. GFile does not. The original commit message is: +2008-02-18 Rob Taylor <rob.taylor@codethink.co.uk> + + * tools/grealpath.h: Added: + * tools/scanner.c: (main): + * tools/scannerlexer.l: + * tools/Makefile.am: + Always use absolute paths with symbolic links resolved when + comparing filenames. Sadly though, it doesn't say *why*. This kind of thing is why modern git commit message usage is so much better than the old GNU changelog approach. Let me see if I can ask Rob if he remembers why this was added.
<robtaylor> walters: I think i probably added that for users of symlinked /usr/local <robtaylor> (i used to use stow when developing ) <walters> robtaylor: ok hm, but...we aren't actually scanning files in /usr/local/stow/libfoo-1.0/libfoo.h right, just the files in srcdir <robtaylor> walters: ah, hmm, i guess you could be messed up if your source dir is in a symlinked directory <robtaylor> walters: pass - there's really been too much churn from when i wrote it to see if it's still relevent to the way it's used now =) <walters> robtaylor: yeah understood =) i'll just pass this info on to dieterv and let him decide what to do; if it breaks we can always readd it So...up to you =)
Created attachment 257575 [details] [review] Revert "sourcescanner: Do some fast path checks on the filename" This reverts commit 8c0ca4717d834a6c578579656683c55ea22a06f4. The "fast" path doesn't seem to work at all. Testing shows creating Gtk-3.0.gir still takes about 30 seconds with this (even on a Linux box)...
Created attachment 257576 [details] [review] Revert "sourcescanner: Do some fast path checks on the filename" Please ignore comment #22, git bz stopped doing whatever it should have done again, so here goes with manual patch attaching... Add comment as mentioned in comment #15 and comment #16
Created attachment 257577 [details] [review] Revert "sourcescanner: Avoid scanning files when doing a macro scan" Add comment to commit message as mentioned in comment #14
Created attachment 257578 [details] [review] giscanner: restore linemark filename unescaping
Created attachment 257579 [details] [review] giscanner: speed up SourceScanner().parse_files() (In reply to comment #18) > Review of attachment 257506 [details] [review]: > > Just comment. > > ::: giscanner/giscannermodule.c > @@ +362,3 @@ > return NULL; > > + file = g_file_new_for_path ( g_realpath (filename)); > > The extra space here is not GNU style (which g-i uses). Though now that I > look, this code is fairly inconsistent =( That space in there is a typo, now fixed, thanks for noticing :)
Created attachment 257582 [details] [review] giscanner: remove g_realpath (In reply to comment #21) > <robtaylor> walters: I think i probably added that for users of symlinked > /usr/local > <robtaylor> (i used to use stow when developing ) > <walters> robtaylor: ok hm, but...we aren't actually scanning files in > /usr/local/stow/libfoo-1.0/libfoo.h right, just the files in srcdir > <robtaylor> walters: ah, hmm, i guess you could be messed up if your source dir > is in a symlinked directory > <robtaylor> walters: pass - there's really been too much churn from when i > wrote it to see if it's still relevent to the way it's used now =) > <walters> robtaylor: yeah understood =) i'll just pass this info on to dieterv > and let him decide what to do; if it breaks we can always readd it > > So...up to you =) How about this: calculate real paths only where they enter the system in SourceScanner().parse_files() and SourceScanner().parse_macros() using os.path.realpath() and remove the absolute and real path calculations elsewhere ?
Review of attachment 257582 [details] [review]: Ah, so we just use the Python version on the boundary? Makes sense.
Attachment 257576 [details] pushed as 298e34c - Revert "sourcescanner: Do some fast path checks on the filename" Attachment 257577 [details] pushed as ce1e2a2 - Revert "sourcescanner: Avoid scanning files when doing a macro scan" Attachment 257578 [details] pushed as 69d2fe3 - giscanner: restore linemark filename unescaping Attachment 257579 [details] pushed as 65a0fa4 - giscanner: speed up SourceScanner().parse_files() Attachment 257582 [details] pushed as ebb8050 - giscanner: remove g_realpath
This broke gobject-instrospection when using symlinks. In my case my jhbuild prefix is /home/carlos/gnome which is a symlink to /home/carlos/gnome-<version>. Now my gir files for glib, gio and gobject contain only constants (because iin that case macro_scan is TRUE in gi_source_scanner_add_symbol()). One of the problems is that in SourceScanner.parse_files we are passing real paths to the scanner.append_filename but not to scanner.lex_filename. But that's not enough to fix it, because we are still getting non real paths from the preprocessor, and process_linemarks is setting the non real path as the current file of the scanner.
Created attachment 259726 [details] [review] giscanner: Make sure we use real paths in more places This patch seems to fix the problem, I hope this covers all places were we need to convert paths.
Forgot to reopen it.
hmm, I've just noticed this bug was about reducing the time comparing filenames. I'll use a new bug instead, sorry for the noise.
[Mass-moving gobject-introspection tickets to its own Bugzilla product - see bug 708029. Mass-filter your bugmail for this message: introspection20150207 ]