GNOME Bugzilla – Bug 2742
plug-ins do not load due to signal problems
Last modified: 2009-08-15 18:40:50 UTC
Package: gimp Version: 1.1.10 Name........: angel li Email.......: angel@miami.edu Platform....: Compaq Alpha, Digital Unix GIMP Version: 1.1.10 GTK Version.: 1.2.6 -- Other system notes: -- -- Problem description: None of the plug-ins work. When gimp tries to initialize them, they get a wire_read error. -- -- How to repeat: -- -- Other comments: -- ------- Additional Comments From gosgood@idt.net 2000-03-12 08:23:50 ---- Subject: Earlier Manifestation of #6050 From: "Garry R. Osgood" <gosgood@idt.net> To: 2742@bugs.gnome.org Message-Id: <38CB9A66.6B0DB270@idt.net> Date: Sun, 12 Mar 2000 08:23:50 -0500 Not much detail in this report. Perhaps angel@miami.edu would care to confirm that a Digital Tru64 4.0f, the C compiler (cc -V returns DEC C V5.9-008 on Digital UNIX V4.0 (Rev. 1229) or some such), was used in the compile, as I suspect may have been the case. G. R. Osgood. ------- Additional Comments From gosgood@idt.net 2000-03-13 19:16:03 ---- Subject: [Fwd: Re: Still having trouble building gimp plug-ins?] From: "Garry R. Osgood" <gosgood@idt.net> To: 2742@bugs.gnome.org Message-Id: <38CD84C3.77E245FC@idt.net> Date: Mon, 13 Mar 2000 19:16:03 -0500 All Angel Li <angel@rrsl.rsmas.miami.edu>, who originated #2742 (which I merged with #6050) performed builds with DEC 4.0D and the newest 5.0 with success, but continued to fail with 4.0F; is going to try a patched version of 4.0F to see if that succeeds. Methinks this can be classed as a compiler and not Gimp problem - but keep lets keep these open for a little while, yet. Be good, be well Garry Osgood -------- Original Message -------- Received: by u1.farm.idt.net for gosgood(with pop daemon (v1.21 1997/08/10) Mon Mar 13 18:29:25 2000) X-From_: angel@rrsl.rsmas.miami.edu Mon Mar 13 17:15:00 2000 Received: from mail-relay3.idt.net (MAIL-RELAY3.IDT.NET [169.132.8.27])by u3.farm.idt.net (8.9.3/8.9.3) with ESMTP id RAA19319for <gosgood@idt.net>; Mon, 13 Mar 2000 17:14:59 -0500 (EST) Received: from avocado.rrsl.rsmas.miami.edu (avocado.rsmas.miami.edu [129.171.98.122])by mail-relay3.idt.net (8.9.3/8.9.3) with ESMTP id RAA06505for <gosgood@idt.net>; Mon, 13 Mar 2000 17:14:59 -0500 (EST) Received: (from mailer@localhost) by avocado.rrsl.rsmas.miami.edu (8.8.8/8.7.3) id RAA04887 for <gosgood@idt.net>; Mon, 13 Mar 2000 17:14:58 -0500 (EST) Received: from mombin.rrsl.rsmas.miami.edu(192.168.1.30) by avocado.rrsl.rsmas.miami.edu via smap (V2.0+anti-relay+anti-spam)id xma000933; Mon, 13 Mar 00 17:14:55 -0500 Received: from flipper-a.rrsl.rsmas.miami.edu by mombin.rrsl.rsmas.miami.edu (8.8.8/1.1.10.5/10Jan97-1049AM)id RAA14170; Mon, 13 Mar 2000 17:14:54 -0500 (EST) Date: Mon, 13 Mar 2000 17:14:54 -0500 (EST) From: Angel Li <angel@rrsl.rsmas.miami.edu> To: "Garry R. Osgood" <gosgood@idt.net> Subject: Re: Still having trouble building gimp plug-ins? In-Reply-To: <38CBDC06.D34C9FC@idt.net> Message-ID: <Pine.OSF.4.21.0003131712210.18585-100000@flipper.rrsl.rsmas.miami.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Status: O X-Mozilla-Status: 8011 X-Mozilla-Status2: 00000000 X-UIDL: e891d71989090000 On Sun, 12 Mar 2000, Garry R. Osgood wrote: > Hi > > This is about a gimp bug report > you sent in some time ago. People > using DEC compilers on Alphas have > had similar problems building > plug-ins. Could you check bug #6050 > (see below) and confirm if the > compiler matches your build > environment? > > Are you still having trouble > building plug-ins? > > If so, the workaround may be > to use an earller compiler > version (search on SHIRASAKI Yasuhiro > below). > > Thanks in advance for your feedback. > Hi, I did a build with a previous version of the compiler and plugins work! I also did a build with the compiler that's bundled with the newest version of the OS and it also worked. To summarize, Digital Unix version 4.0D is OK Digital Unix version 4.0F is not OK Digital Unix version 5.0 is OK Some patches just came out for 4.0F. I'll report back if they fix the compiler. Angel ------- Additional Comments From gosgood@idt.net 2000-04-09 21:25:26 ---- Subject: [Fwd: Re: wire read: error: found it!] From: "Garry R. Osgood" <gosgood@idt.net> To: 2742@bugs.gnome.org Message-Id: <38F12D86.8A5C7049@idt.net> Date: Sun, 09 Apr 2000 21:25:26 -0400 FYI Tim Mooney's test of Austin Donnelly's patch -- GRO -------- Original Message -------- Received: by u3.farm.idt.net for gosgood(with pop daemon (v1.21 1997/08/10) Sun Apr 9 20:35:21 2000) X-From_: mooney@dogbert.cc.ndsu.nodak.edu Sun Apr 9 19:26:21 2000 Received: from mail-relay4.idt.net (MAIL-RELAY4.IDT.NET [169.132.8.88])by u1.farm.idt.net (8.9.3/8.9.3) with ESMTP id TAA20318for <gosgood@idt.net>; Sun, 9 Apr 2000 19:26:20 -0400 (EDT) Received: from dogbert.cc.ndsu.nodak.edu (dogbert.cc.ndsu.NoDak.edu [134.129.106.23])by mail-relay4.idt.net (8.9.3/8.9.3) with ESMTP id TAA23070for <gosgood@idt.net>; Sun, 9 Apr 2000 19:26:19 -0400 (EDT) Received: from localhost (mooney@localhost)by dogbert.cc.ndsu.nodak.edu (8.9.3/8.9.1) with ESMTP id SAA16996;Sun, 9 Apr 2000 18:26:19 -0500 (CDT) Date: Sun, 9 Apr 2000 18:26:18 -0500 (CDT) From: Tim Mooney <mooney@dogbert.cc.ndsu.nodak.edu> To: "Garry R. Osgood" <gosgood@idt.net> cc: Austin Donnelly <Austin.Donnelly@cl.cam.ac.uk> Subject: Re: wire read: error: found it! In-Reply-To: <38EE9C73.3D858DA1@idt.net> Message-ID: <Pine.OSF.4.21.0004082223440.881-100000@dogbert.cc.ndsu.nodak.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Mozilla-Status: 8011 X-Mozilla-Status2: 00000000 X-UIDL: 66733fe761130000 In regard to: Re: wire read: error: found it!, Garry R. Osgood said (at...: >> Looks like a g_io_channel_read() is returning EINTR from a SIGCHLD. >> The SIGCHLD is probably because the plugin died. >> >> On OSF/1, it looks like signal() doesn't install restarting signal >> handlers. We should _really_ be using sigaction(2) since this solves >> the problem in a portable manner. I've been thinking about this some, and doing some reading. I'm frankly surprised that this problem hasn't been reported for more platforms -- Solaris and HP-UX both have `signal' functions that are SysV-based, so they're even more dangerous than the signal() on systems with a signal() that is BSD-like (like OSF/1 / Tru64). I also spent some time looking at the man page for signal(2) on Tru64, and although the wording is a little murky, Austin is definitely correct that the default signal(2) semantics are the BSD-like signal() *without* restarting system calls that were interrupted. So far I have tested Austin's patch on: alpha-dec-osf4.0d (+ patch kit #6) alpha-dec-osf4.0f (+ patch kit #2) alpha-dec-osf4.0f (+ patch kit #3) alphaev56-dec-osf4.0f (+ patch kit #3) alpha-dec-osf5.0 (+ patch kit #1) In all cases, the patch greatly improves the situation. Where before on Tru64 Unix there would be anywhere from a few to all of the plug-ins erroring out when initially queried, now *none* of them do. On my work desktop machine, which is the alphaev56-dec-osf4.0f listed above, I do still get a long hang followed by a segv from extension_script_fu: /local/gnu/lib/X11/gimp/1.1/plug-ins/script-fu: Segmentation fault caught /local/gnu/lib/X11/gimp/1.1/plug-ins/script-fu (pid:1415): [E]xit, [H]alt, show [S]tack trace or [P]roceed: S
+ Trace 3240
This doesn't happen on any of the other machines I tested on, only my workstation. I don't think it's related to the issue Austin's fixed. The segv is happening in the script_fu_find_scripts() (I think it's happening in the repl_c_string() routine, but I'm not sure yet) procedure called from script-fu's run(). I planned to test 1.1.19 with the patch on powerpc-ibm-aix4.3.2.0, sparc-sun-solaris2.6, sparc-sun-solaris2.7, and hppa1.1-hp-hpux10.20, but so far 1.1.19 has had compile problems on aix and hpux, so I don't know what effect the patch has, if any. I skipped testing on IRIX, being you said you were going to do that Garry. At this point, I can definitely say that the patch fixes the problem on Tru64 Unix. Should other instances of signal() in the gimp source base be stamped out? Should a configure test be written (or stolen, possibly from bash) that checks to make sure that the system has the necessary sigaction support? Every place I've checked has the SA_RESTART flag and the sa_handler member, but neither are specified by POSIX so there may be some system out there that doesn't have them. I would be happy to help with the configure test if people think it should be implemented. Should hook_signal be placed in its own file, and named something like gimp_os_signal(), so that it can be linked into both the app and libgimp, and provided for plug-in writers to use as an abstraction to the OS's signal mechanism? Just trying to make sure we've covered all the bases... Thanks again for you work, Tim -- Tim Mooney mooney@dogbert.cc.ndsu.NoDak.edu Information Technology Services (701) 231-1076 (Voice) Room 242-J1, IACC Building (701) 231-8541 (Fax) North Dakota State University, Fargo, ND 58105-5164 ------- Additional Comments From gosgood@idt.net 2000-04-09 21:25:30 ---- Subject: [Fwd: wire read: error: found it!] From: "Garry R. Osgood" <gosgood@idt.net> To: 2742@bugs.gnome.org Message-Id: <38F12D8A.42593CEB@idt.net> Date: Sun, 09 Apr 2000 21:25:30 -0400 Austin Donnelly isolated a probable cause and proposes a patch. Tim Mooney to test - GRO -------- Original Message -------- Received: by u3.farm.idt.net for gosgood(with pop daemon (v1.21 1997/08/10) Fri Apr 7 21:42:08 2000) X-From_: austin.donnelly@cl.cam.ac.uk Fri Apr 7 13:15:37 2000 Received: from mail-relay4.idt.net (MAIL-RELAY4.IDT.NET [169.132.8.88])by u2.farm.idt.net (8.9.3/8.9.3) with ESMTP id NAA06347for <gosgood@idt.net>; Fri, 7 Apr 2000 13:15:13 -0400 (EDT) Received: from wisbech.cl.cam.ac.uk (exim@mta1.cl.cam.ac.uk [128.232.0.15])by mail-relay4.idt.net (8.9.3/8.9.3) with ESMTP id NAA24508for <gosgood@idt.net>; Fri, 7 Apr 2000 13:15:12 -0400 (EDT) Received: from hornet.cl.cam.ac.uk ([128.232.8.3] ident=exim)by wisbech.cl.cam.ac.uk with esmtp (Exim 3.092 #1)id 12dcLU-0002GA-00for gosgood@idt.net; Fri, 07 Apr 2000 18:15:00 +0100 Received: from and1000 by hornet.cl.cam.ac.uk with local (Exim 3.01 #1)id 12dcLT-00089R-00for gosgood@idt.net; Fri, 07 Apr 2000 18:14:59 +0100 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <14574.6033.473500.26028@hornet.cl.cam.ac.uk> Date: Fri, 7 Apr 2000 18:14:57 +0100 (BST) From: Austin Donnelly <Austin.Donnelly@cl.cam.ac.uk> To: "Garry R. Osgood" <gosgood@idt.net> Subject: wire read: error: found it! In-Reply-To: <38ED34C2.5A05EA7A@idt.net> References: <Pine.OSF.4.21.0004041728180.32587-100000@dogbert.cc.ndsu.nodak.edu><38EA9379.E58DF582@idt.net><14571.7219.908672.263324@hornet.cl.cam.ac.uk><14572.42492.89013.72562@hornet.cl.cam.ac.uk><14572.47055.189071.417615@hornet.cl.cam.ac.uk><38ED34C2.5A05EA7A@idt.net> X-Mailer: VM 6.75 under Emacs 20.6.1 Sender: Austin Donnelly <Austin.Donnelly@cl.cam.ac.uk> X-Mozilla-Status: 8003 X-Mozilla-Status2: 00000000 X-UIDL: 9f14909777150000 Race is between: static void plug_in_query (gchar *filename, PlugInDef *plug_in_def) { PlugIn *plug_in; WireMessage msg; plug_in = plug_in_new (filename); if (plug_in) { plug_in->query = TRUE; plug_in->synchronous = TRUE; plug_in->user_data = plug_in_def; if (plug_in_open (plug_in)) <<<--- this { plug_in_push (plug_in); while (plug_in->open) { if (!wire_read_msg (current_readchannel, &msg)) <<<-- this plug_in_close (current_plug_in, TRUE); else { plug_in_handle_message (&msg); wire_destroy (&msg); } } plug_in_pop (); plug_in_destroy (plug_in); } } } Looks like a g_io_channel_read() is returning EINTR from a SIGCHLD. The SIGCHLD is probably because the plugin died. On OSF/1, it looks like signal() doesn't install restarting signal handlers. We should _really_ be using sigaction(2) since this solves the problem in a portable manner. I've got a patch against 1.1.18 that fixes this: please try it out and check it in if it looks ok. I've lightly tested it on Linux ix86 Red Hat 6.1.92, and OSF1 3.2D. I'll leave you close the relevant bug reports. Thanks, Austin ------------------------------------------------------------ --- main.c~ Wed Feb 23 20:25:23 2000 +++ main.c Fri Apr 7 18:05:29 2000 @@ -86,6 +86,35 @@ static gint gimp_argc = 0; static gchar **gimp_argv = NULL; + +/* hook_signal: Cause handler to be run when signum is delivered. We + * use sigaction(2) rather than signal(2) so that we can control the + * signal hander's environment completely: some signal(2) + * implementations differ in their sematics, so we need to nail down + * exactly what we want. */ +static void +hook_signal (int signum, RETSIGTYPE (*handler)(int)) +{ + int ret; + struct sigaction sa; + + sa.sa_handler = handler; + sa.sa_sigaction = NULL; + + /* Mask all signals while handler runs to avoid re-entrancy + * problems. */ + sigfillset (&sa.sa_mask); + + /* Must restart syscalls else get EINTR on g_io_channel_read() + * occasionally. */ + sa.sa_flags = SA_RESTART; + + ret = sigaction (signum, &sa, NULL); + if (ret < 0) + gimp_fatal_error ("unable to hook signal %d\n", signum); +} + + /* * argv processing: * Arguments are either switches, their associated @@ -104,8 +133,7 @@ * * The exception is the batch switch. When this is * encountered, all remaining args are treated as batch - * commands. - */ + * commands. */ int main (int argc, @@ -325,36 +353,36 @@ /* Handle some signals */ #ifdef SIGHUP - signal (SIGHUP, on_signal); + hook_signal (SIGHUP, on_signal); #endif #ifdef SIGINT - signal (SIGINT, on_signal); + hook_signal (SIGINT, on_signal); #endif #ifdef SIGQUIT - signal (SIGQUIT, on_signal); + hook_signal (SIGQUIT, on_signal); #endif #ifdef SIGABRT - signal (SIGABRT, on_signal); + hook_signal (SIGABRT, on_signal); #endif #ifdef SIGBUS - signal (SIGBUS, on_signal); + hook_signal (SIGBUS, on_signal); #endif #ifdef SIGSEGV - signal (SIGSEGV, on_signal); + hook_signal (SIGSEGV, on_signal); #endif #ifdef SIGPIPE - signal (SIGPIPE, on_signal); + hook_signal (SIGPIPE, on_signal); #endif #ifdef SIGTERM - signal (SIGTERM, on_signal); + hook_signal (SIGTERM, on_signal); #endif #ifdef SIGFPE - signal (SIGFPE, on_signal); + hook_signal (SIGFPE, on_signal); #endif #ifdef SIGCHLD /* Handle child exits */ - signal (SIGCHLD, on_sig_child); + hook_signal (SIGCHLD, on_sig_child); #endif #endif ------------------------------------------------------------ ------- Additional Comments From gosgood@idt.net 2000-04-15 21:41:32 ---- Subject: [Fwd: Re: wire read: error: found it!] From: "Garry R. Osgood" <gosgood@idt.net> To: 2742@bugs.gnome.org Message-Id: <38F91A4C.5BBA35C0@idt.net> Date: Sat, 15 Apr 2000 21:41:32 -0400 FYI to bug report #2742 GR Osgood. -------- Original Message -------- X-Mozilla-Status: 0001 X-Mozilla-Status2: 00000000 Message-ID: <38F9199B.B7C82043@idt.net> Date: Sat, 15 Apr 2000 21:38:35 -0400 From: "Garry R. Osgood" <gosgood@idt.net> X-Mailer: Mozilla 4.51C-SGI [en] (X11; I; IRIX 6.5 IP20) X-Accept-Language: en, zh-TW MIME-Version: 1.0 To: Tim Mooney <mooney@dogbert.cc.ndsu.nodak.edu>,Sven Neumann <neumanns@uni-duesseldorf.de>,Michael Natterer <mitch@gimp.org>, Tor Lillqvist <tml@iki.fi> CC: Austin Donnelly <austin@gimp.org> Subject: Re: wire read: error: found it! References: <Pine.OSF.4.21.0004082223440.881-100000@dogbert.cc.ndsu.nodak.edu> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sven, Mitch, I'm cc-ing you on this for comment; this concerns Bug # 2742 and a patch Austin Donnelly wrote that addresses the issue. The patch itself was posted to #2742@bugs.gimp.org; it modifies the signal environment of Spec 1170 compliant POSIX boxes so that if a signal arrives at a gimp process while it is in a system routine, that routine will "restart" following signal handling (if the process survives signal handling, that is), and not return -1 (error) and setting errno = EINTR (default POSIX behavior). In particular with #2742, when an "out-of-box" gimp is probing plug-in query() methods, it is launching many child processes which may persist somewhat as defuncts/zombies before the O/S reaps them and signals gimp with SIGCLD. Concurrently, gimp makes a lot of calls to the system read() function to pull bytes from the pipe connecting gimp to the currently active child. It appears that on OSF/1 the coincidence of gimp being in read() when SIGCLD arrives is quite high, giving rise to the behaviour reported in #2742. Austin's patch makes use of a hook_signal() to set up the signal environment; Following Tim Mooney's observations after he tested the patch on DEC OSF/1 boxes, methinks he is right in suggesting that it be promoted to a libgimp function so that both core and plug-ins have the same abstraction of the signal mechanism. Comments? Tor, I hope you can review this as well; I am laboring under the happy illusion that if all of these code modifications are wrapped in #ifndef G_OS_WIN32 /* UNIX-SOLARIS-LINUX-IRIX-OSF/1 signal action stuff */ #endif then the Windows versions will not be affected. Am I right? Tim, Austin (who's wandering around in Welsh hills, but will read this eventually), Appropriately, on the anniversary of the sinking of the Titanic. I've found time to step through the unpatched gimp investigating #2742 and Austin's patched version. I've got a comfortable idea what's going on and how Austin's patch fixes it. It mostly concurs with and confirms Austin's reasoning. [Austin: pondering plug_in_query() on or about April 6] > > >> Looks like a g_io_channel_read() is returning EINTR from a SIGCHLD. > >> The SIGCHLD is probably because the plugin died. > >> I've reproduced the condition - or a condition remarkably like it - on the SGI. (0) Context is line 2316, plug_in _query(), plug_in.c-CVS-1.98 [April 11, 2000] wire_read_msg() has been called and that bottoms out to g_io_channel_read(). plug_in_query() is getting bytes from the n-th plug-in that had been fork()-ed in plug_in_open(). (1) g_io_channel_read() enters the system call read(2) to get the child process bytes on the pipe (if any). (2) Concurrently, and asynchronously, the n-th plug-in is executing it's query() method at some juncture it may deposit bytes in the pipe. (3) Also concurrently, the processes for plug-ins n-1, n-2, ... 1 (no telling how many) have invoked exit() and are in various stages of cleanup or are defunct and awaiting reaping by the O/S, which may take its own sweet time. They are in the process table. They are zombies. (4) At some juncture while the gimp process is in any number of read() system calls, the O/S gets a time quantum and uses it to clean up some defunct plug-in child process. It sends the gimp process a SIGCLD (signal 18). (5) What happens when a process is delivered a signal and it is in a system call? According to [Robbins, Robbins], on a POSIX.1 compliant O/S, a "slow" system call such as read() is to follow this policy: "Fail. Return -1. set errno to EINTR." This concurs with the observation Austin Donnelly made at the exit of g_io_channel_read(). I remark here that the SIGCLD need not necessarily map to the n-th plug-in on the other side of the wire; that may be functioning in an orderly manner. Any one of (or more than one of) the plug-ins launched prior to the n-th one may still be around and exiting in an orderly way; any number of these may be waiting to be reaped by the O/S at its leisure. When the O/S does so, and the gimp process gets SIGCLD in the system read() call, that call fails and returns -1, setting errno to EINTR. (6) When read() returns, g_io_channel_read() returns G_IO_ERROR_UNKOWN, and bytes_read value of 0. (See g_io_unix_read(), line 159, glib-1.2.7/giounix.c and the caller, wire_read() maps this to a g_warning() "wire_read: error()" which observers of bug #2742 have been reporting. (7) This sequence of events can be reproduced on the SGI by artfully manipulating plug-in processes with a debugger. I surmise that the prerequisite timing arises more naturally under OSF/1 and for a number of plausible cases: there are dozens of invocations of read() for every invocation of plug_in_query(). A little bit of latency in a read() implementation can create opportunities for the gimp process to be in system code. Likewise, zombie plug-ins may linger on OSF/1 longer. Austin also said on or around April 6. > >> On OSF/1, it looks like signal() doesn't install restarting signal > >> handlers. We should _really_ be using sigaction(2) since this solves > >> the problem in a portable manner. This is the case with SGI as well. Austin's solution was to use the POSIX sigaction(2) call and pass in struct sigaction::sa_flags |= SA_RESTART and that works here as well; then calls to read(), if interrupted, "rewind" and are attempted again. This is transparent to gimp code: read() simply works "through" the SIGCLD signal. But as Tim observed, SA_RESTART is not guaranteed to be supported on a POSIX platform. According to [Robbins,Robbins] this flag is a SPEC 1170 extension. > Tim Mooney observed: > IShould other instances of signal() in the gimp source base be stamped out? > Should a configure test be written (or stolen, possibly from bash) that > checks to make sure that the system has the necessary sigaction support? > Every place I've checked has the SA_RESTART flag and the sa_handler member, > but neither are specified by POSIX so there may be some system out there > that doesn't have them. I would be happy to help with the configure test > if people think it should be implemented. These are correct things to do in principle, IMHO. The configure test would have to do some sort of 'POSIX capability probe' to determine if the box's POSIX implementation supports SPEC 1170. I'm not real smart about configure issues; I'm not sure how to write it in GNU configure-ese. If you could help here, that would be useful. But - in practice - I don't think we have to worry much. ,The condition seems to be very timing dependent, observed consistently so far just in OSF/1, and, happily, that O/S implements Spec 1170. Up to now, the set of O/S'es that exhibit the behavior and are not Spec 1170 seems to be null. Closure does call for an appropriate configure test, but conditions aren't insisting on Closure Right At This Instant!!!! > Should hook_signal be placed in its own file, and named something > like gimp_os_signal(), so that it can be > linked into both the app and libgimp, and provided for plug-in writers to use > as an abstraction to the OS's signal mechanism? I agree on this. plug-ins fork processes as well (gz comes to mind) and can face this issue in principle as well. They should have the exact same abstraction of establishing signal handlers and managing the signal environment. Maybe gimp_sigaction(), because that what it is extruding (POSIX sigaction()) but its implementation would be essentially Austin's hook_signal() I'm at the code relocation and retest phase, with a check-in around midweek unless somebody (everybody) tells me I'm crazy. Tim, I trust you will test the commit sometimes soon after. Be good, all, be well Garry Osgood Ref [Robbins, Robbins] Robbins, Kay A; Robbins, Steven "Practical UNIX Programming" A Guide to Concurrency, Communication, and Multithreading" 1996 Prentice-Hall Inc ISBN 0-13-443706-3 See pp 188: "5.6 System Calls and Signals" ------- Additional Comments From mitschel@cs.tu-berlin.de 2000-04-16 08:02:49 ---- Subject: [Fwd: Re: wire read: error: found it!] From: Michael Natterer <mitschel@cs.tu-berlin.de> To: 2742@bugs.gnome.org Message-Id: <38F9ABE9.5AD900A2@cs.tu-berlin.de> Date: Sun, 16 Apr 2000 14:02:49 +0200 3rd try to forward it to bugs.gnome.org :) -------- Original Message -------- Subject: Re: wire read: error: found it! Date: Sun, 16 Apr 2000 13:42:11 +0200 From: Michael Natterer <mitschel@cs.tu-berlin.de> To: "Garry R. Osgood" <gosgood@idt.net> CC: Tim Mooney <mooney@dogbert.cc.ndsu.nodak.edu>,Sven Neumann <neumanns@uni-duesseldorf.de>,Michael Natterer <mitch@gimp.org>, Tor Lillqvist <tml@iki.fi>,Austin Donnelly <austin@gimp.org> References: <Pine.OSF.4.21.0004082223440.881-100000@dogbert.cc.ndsu.nodak.edu> <38F9199B.B7C82043@idt.net> "Garry R. Osgood" wrote: > > I'm cc-ing you on this for comment; this concerns Bug # 2742 > and a patch Austin Donnelly wrote that addresses the issue. The > patch itself was posted to #2742@bugs.gimp.org; it modifies > the signal environment of Spec 1170 compliant POSIX boxes > so that if a signal arrives at a gimp process while it is in a system > routine, that routine will "restart" following signal handling > (if the process survives signal handling, that is), and not return > -1 (error) and setting errno = EINTR (default POSIX behavior). Yes, this is how I understand POSIX signals to work. > In particular with #2742, when an "out-of-box" gimp is > probing plug-in query() methods, it is launching many child > processes which may persist somewhat as defuncts/zombies > before the O/S reaps them and signals gimp with SIGCLD. > Concurrently, gimp makes a lot of calls to the system read() > function to pull bytes from the pipe connecting gimp to > the currently active child. It appears that on OSF/1 the > coincidence of gimp being in read() when SIGCLD arrives > is quite high, giving rise to the behaviour reported in > #2742. I'm quite qure that this is not only an OSF/1 issue but can occur with all UNIX variants out there. The reason why most people don't get these errors might be that esp. Linux behaves _very_ programmer friendly in regard to signals. (Well, basically it should do the same, but my theory is that it does magic things to minimize signals interrupting system calls. And yes, this is just a theory :) ) (...) Your analysis seems to reflect exactly what happens... > Austin also said on or around April 6. > > > >> On OSF/1, it looks like signal() doesn't install restarting signal > > >> handlers. We should _really_ be using sigaction(2) since this solves > > >> the problem in a portable manner. > > This is the case with SGI as well. Austin's solution was to use the POSIX sigaction(2) > call and pass in struct sigaction::sa_flags |= SA_RESTART and that works here as > well; then calls to read(), if interrupted, "rewind" and are attempted again. This is > transparent to gimp code: read() simply works "through" the SIGCLD signal. > But as Tim observed, SA_RESTART is not guaranteed to be supported on a POSIX > platform. According to [Robbins,Robbins] this flag is a SPEC 1170 extension. Oh yes (YES!!), strange enough, I'm teaching UNIX to students for 3 years now and the primary goal when it comes to signals is teaching them: "use sigaction() instead of signal()" -- I should have noticed this before :-) > > Tim Mooney observed: > > > IShould other instances of signal() in the gimp source base be stamped out? > > Should a configure test be written (or stolen, possibly from bash) that > > checks to make sure that the system has the necessary sigaction support? > > Every place I've checked has the SA_RESTART flag and the sa_handler member, > > but neither are specified by POSIX so there may be some system out there > > that doesn't have them. I would be happy to help with the configure test > > if people think it should be implemented. > > These are correct things to do in principle, IMHO. The configure test would have > to do some sort of 'POSIX capability probe' to determine if the box's POSIX > implementation supports SPEC 1170. I'm not real smart about configure issues; > I'm not sure how to write it in GNU configure-ese. If you could help here, that > would be useful. I 100% agree here. We should replace _all_ calls to signal() with our own wrapper but I'm afraid I have too few configure knowledge to hack it (it took me a whole day to hack yosh's proposed gtkxmhtml configure test to work on solaris before yosh applied it...) > But - in practice - I don't think we have to worry much. ,The condition seems to > be very timing dependent, observed consistently so far just in OSF/1, and, happily, > that O/S implements Spec 1170. Up to now, the set of O/S'es that exhibit > the behavior and are not Spec 1170 seems to be null. Closure does call for > an appropriate configure test, but conditions aren't insisting on Closure > Right At This Instant!!!! > > > Should hook_signal be placed in its own file, and named something > > like gimp_os_signal(), so that it can be > > linked into both the app and libgimp, and provided for plug-in writers to use > > as an abstraction to the OS's signal mechanism? > > I agree on this. plug-ins fork processes as well (gz comes to mind) and can > face this issue in principle as well. They should have the exact same abstraction > of establishing signal handlers and managing the signal environment. Me too, a libgimp function (with an included SIGCHLD handler) is imho the way to go here. This is also a way to get rid of the signal handling stuff in app/main.c. > Maybe gimp_sigaction(), because that what it is extruding (POSIX sigaction()) > but its implementation would be essentially Austin's hook_signal() > > I'm at the code relocation and retest phase, with a check-in around midweek > unless somebody (everybody) tells me I'm crazy. Tim, I trust you will test > the commit sometimes soon after. You're not crazy :) Please go ahead, this is a big issue. BTW, to really get rid of strange child exits and to deliver messages about their death correctly, we could also wrap fork() with out own function and keep a list of started processes there. The GNU info pages section libc --> "Signal Handling" --> "Defining Handlers" --> "Merged Handlers" has an excellent example of a SIGCHLD handler which is safe against race condition and stuff. We could then traverse our list of children in a periodically called idle function (the shell does it before displaying the prompt) and pop up real error messages instead of spitting out stuff on the console. Or is this overkill?? Thanks for the debugging to all of you!! bye, --Mitch ------- Additional Comments From gosgood@idt.net 2000-06-12 21:04:10 ---- Subject: [Fwd: Re: Closing #2742] From: "Garry R. Osgood" <gosgood@idt.net> To: 2742@bugs.gnome.org Message-Id: <3945888A.B45A7FD1@idt.net> Date: Mon, 12 Jun 2000 21:04:10 -0400 FYI Austin Donnelly wrote a test which isolated a problem with Compaq (DEC OSF/1) not handling all sa_flags in the struct sigaction object. In particular, SA_RESTART did not function in a Posix compliant way when the stream was a pipe. In that case, behaviour was as if the SA_RESTART flag was never requested and system calls interrupted by action handlers returned EINTR. This affected, in particular, the reading from pipes that gimp does with plugins. Tim Mooney is pursuing a incident report with Compaq/DEC (See attached) But, in the course of writing a workaround to permit current Compaq OSF/1 releases to function with Gimp and plugins, Mitch Natterer uncovered weakness in how glib g_io_channels percolate EINTR up to applications like Gimp, forcing Mitch to write quite a bit of awkwardness at the Gimp level. According to Austin, Tim Janik is planning on g_io_channels to be more discriptive in percolating error conditions upward. So, #2742 has been temporarily averted by temporary patchwork that an improved glib will retire. To keep ourselves reminded of this matter, I propose keeping this bug open. It closes when what amounts to a g_io_channels workaround can be safely retired. Be good, be well -------- Original Message -------- X-Mozilla-Status: 0001 X-Mozilla-Status2: 00000000 Message-ID: <39458273.25E60831@idt.net> Date: Mon, 12 Jun 2000 20:38:12 -0400 From: "Garry R. Osgood" <gosgood@idt.net> X-Mailer: Mozilla 4.51C-SGI [en] (X11; I; IRIX 6.5 IP20) X-Accept-Language: en, zh-TW MIME-Version: 1.0 To: Austin Donnelly <austin@gimp.org> CC: Tim Mooney <mooney@dogbert.cc.ndsu.nodak.edu>,Michael Natterer <mitch@gimp.org>, Tim Janik <timj@gtk.org> Subject: Re: Closing #2742 References: <39455469.9F24F12C@idt.net><Pine.OSF.4.21.0006121622300.13913-100000@dogbert.cc.ndsu.nodak.edu> <14661.22824.931934.547816@hornet.cl.cam.ac.uk> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Austin Donnelly wrote: > On Monday, 12 Jun 2000, Tim Mooney wrote: > > > In regard to: Closing #2742, Garry R. Osgood said (at 5:21pm on Jun 12, 2000): > > > > >Since Austin isolated #2742 to an OS issue, and since > > >Mitch's workaround seems contained in light of that, > > >would #2742 be now a closed item? > > I spoke to Tim Janik at GimpCon, and we agree the current hack in the > gimp is too ugly to live much longer. > > He suggested I file a bug report on glib to add the necessary > g_io_channel() error returns, then gimp should use that. Apparently, > gimp is one of the few programs that actually uses g_io_channels. > > Tim promised me a stable 1.2.x glib release with such a fix could be > made reasonably quickly. > > Austin Well, since active bug list is - in some respects - the most well-kept TODO list, for 1.2 ;) and since Mitch's workaround is something we prefer not to be in 1.2, I propose keeping #2742 open with the glib fix as a necessary action item. So I'll forward this brief flurry of email to 2742@bugs.gnome.org with the suggestion that it closes when (1) g_io_channel() error returns are expanded in the next glib release (2) Mitch has the opportunity of backing out his workaround. and (3) the DEC OSF/1 platforms can still talk with plugins. Thanks, all Garry ==================================================================================== In regard to: Closing #2742, Garry R. Osgood said (at 5:21pm on Jun 12, 2000): >Since Austin isolated #2742 to an OS issue, and since >Mitch's workaround seems contained in light of that, >would #2742 be now a closed item? I think so. I still have a call open with Compaq regarding the issue. The people I've spoken with agree with me that it's a problem and needs to be fixed, but since things are working the way they're (poorly) documented to work via the man page for pipe(2) on Tru64, it gets a lower priority than things that aren't working the way they're documented to work. Garry, my suggestion is that you add a comment to #2742 indicating that it's a known issue with signal handling on OSF1/Tru64, and is not a problem with the GIMP. You might also wish to mention that if Tru64 users want this fixed and they have software support, they should call and open a support call regarding the issue. They can reference my support call #, C000515-1805 The more people that open calls regarding the issue, the faster something will happen. Thanks! Tim -- Tim Mooney mooney@dogbert.cc.ndsu.NoDak.edu Information Technology Services (701) 231-1076 (Voice) Room 242-J1, IACC Building (701) 231-8541 (Fax) North Dakota State University, Fargo, ND 58105-5164 ------- Bug moved to this database by debbugs-export@bugzilla.gnome.org 2001-01-28 10:47 ------- This bug was previously known as bug 2742 at http://bugs.gnome.org/ http://bugs.gnome.org/show_bug.cgi?id=2742 Originally filed under the gimp product and general component. The original reporter (angel@miami.edu) of this bug does not have an account here. Reassigning to the exporter, debbugs-export@bugzilla.gnome.org. Reassigning to the default owner of the component, egger@suse.de.
Re-assigning all Gimp bugs to default component owner (Gimp bugs list)
Can anyone check the status of this bug report? This is the oldest open GIMP bug in our database and the last update occured almost two years ago (June 2000).
I'll close this report now since noone complained about this problem for more than two years.
The fix is part of the stable release 1.2.4 (or earlier, hopefully). Closing this bug.