After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 2742 - plug-ins do not load due to signal problems
plug-ins do not load due to signal problems
Status: VERIFIED FIXED
Product: GIMP
Classification: Other
Component: General
1.x
Other OSF/1
: Normal major
: ---
Assigned To: GIMP Bugs
GIMP Bugs
Depends on:
Blocks: 6050
 
 
Reported: 1999-10-13 12:10 UTC by angel
Modified: 2009-08-15 18:40 UTC
See Also:
GNOME target: ---
GNOME version: ---



Description angel 2001-01-28 15:47:42 UTC
Package: gimp
Version: 1.1.10

Name........: angel li
Email.......: angel@miami.edu
Platform....: Compaq Alpha, Digital Unix
GIMP Version: 1.1.10
GTK Version.: 1.2.6


-- Other system notes:

--

-- Problem description:
None of the plug-ins work. When gimp tries to initialize
them, they get a wire_read error.
--


-- How to repeat:

--


-- Other comments:

--




------- Additional Comments From gosgood@idt.net 2000-03-12 08:23:50 ----

Subject: Earlier Manifestation of #6050
From: "Garry R. Osgood" <gosgood@idt.net>
To: 2742@bugs.gnome.org
Message-Id: <38CB9A66.6B0DB270@idt.net>
Date: Sun, 12 Mar 2000 08:23:50 -0500

Not much detail in this report. Perhaps angel@miami.edu would
care to confirm that a Digital Tru64 4.0f, the C compiler
(cc -V returns   DEC C V5.9-008 on Digital UNIX V4.0 (Rev. 1229) or
some such), was used in the compile, as I suspect may have been the case.

G. R. Osgood.




------- Additional Comments From gosgood@idt.net 2000-03-13 19:16:03 ----

Subject: [Fwd: Re: Still having trouble building gimp plug-ins?]
From: "Garry R. Osgood" <gosgood@idt.net>
To: 2742@bugs.gnome.org
Message-Id: <38CD84C3.77E245FC@idt.net>
Date: Mon, 13 Mar 2000 19:16:03 -0500

All

Angel Li <angel@rrsl.rsmas.miami.edu>, who
originated #2742 (which I merged with #6050)
performed builds with DEC 4.0D and the newest
5.0 with success, but continued to fail with
4.0F; is going to try a patched version of 4.0F
to see if that succeeds.

Methinks this can be classed as a compiler
and not Gimp problem - but keep lets keep these 
open for a little while, yet.

Be good, be well

Garry Osgood


-------- Original Message --------
Received: by u1.farm.idt.net for gosgood(with pop daemon (v1.21 1997/08/10) Mon Mar 13 18:29:25 2000)
X-From_: angel@rrsl.rsmas.miami.edu Mon Mar 13 17:15:00 2000
Received: from mail-relay3.idt.net (MAIL-RELAY3.IDT.NET [169.132.8.27])by u3.farm.idt.net (8.9.3/8.9.3) with ESMTP id RAA19319for <gosgood@idt.net>;
Mon, 13 Mar 2000 17:14:59 -0500 (EST)
Received: from avocado.rrsl.rsmas.miami.edu (avocado.rsmas.miami.edu [129.171.98.122])by mail-relay3.idt.net (8.9.3/8.9.3) with ESMTP id RAA06505for
<gosgood@idt.net>; Mon, 13 Mar 2000 17:14:59 -0500 (EST)
Received: (from mailer@localhost) by avocado.rrsl.rsmas.miami.edu (8.8.8/8.7.3) id RAA04887 for <gosgood@idt.net>; Mon, 13 Mar 2000 17:14:58 -0500
(EST)
Received: from mombin.rrsl.rsmas.miami.edu(192.168.1.30) by avocado.rrsl.rsmas.miami.edu via smap (V2.0+anti-relay+anti-spam)id xma000933; Mon, 13 Mar
00 17:14:55 -0500
Received: from flipper-a.rrsl.rsmas.miami.edu by mombin.rrsl.rsmas.miami.edu (8.8.8/1.1.10.5/10Jan97-1049AM)id RAA14170; Mon, 13 Mar 2000 17:14:54
-0500 (EST)
Date: Mon, 13 Mar 2000 17:14:54 -0500 (EST)
From: Angel Li <angel@rrsl.rsmas.miami.edu>
To: "Garry R. Osgood" <gosgood@idt.net>
Subject: Re: Still having trouble building gimp plug-ins?
In-Reply-To: <38CBDC06.D34C9FC@idt.net>
Message-ID: <Pine.OSF.4.21.0003131712210.18585-100000@flipper.rrsl.rsmas.miami.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Status: O
X-Mozilla-Status: 8011
X-Mozilla-Status2: 00000000
X-UIDL: e891d71989090000

On Sun, 12 Mar 2000, Garry R. Osgood wrote:

> Hi
> 
> This is about a gimp bug report
> you sent in some time ago. People
> using DEC compilers on Alphas have
> had similar problems building
> plug-ins. Could you check bug #6050
> (see below) and confirm if the
> compiler matches your build
> environment?
> 
> Are you still having trouble
> building plug-ins?
> 
> If so, the workaround may be
> to use an earller compiler
> version (search on SHIRASAKI Yasuhiro
> below).
> 
> Thanks in advance for your feedback.
> 
Hi,

I did a build with a previous version of the compiler and plugins
work! I also did a build with the compiler that's bundled with
the newest version of the OS and it also worked.

To summarize,

	Digital Unix version 4.0D is OK
	Digital Unix version 4.0F is not OK
	Digital Unix version 5.0 is OK

Some patches just came out for 4.0F. I'll report back if they
fix the compiler.

Angel




------- Additional Comments From gosgood@idt.net 2000-04-09 21:25:26 ----

Subject: [Fwd: Re: wire read: error: found it!]
From: "Garry R. Osgood" <gosgood@idt.net>
To: 2742@bugs.gnome.org
Message-Id: <38F12D86.8A5C7049@idt.net>
Date: Sun, 09 Apr 2000 21:25:26 -0400

FYI Tim Mooney's test of Austin Donnelly's patch -- GRO

-------- Original Message --------
Received: by u3.farm.idt.net for gosgood(with pop daemon (v1.21 1997/08/10) Sun Apr  9 20:35:21 2000)
X-From_: mooney@dogbert.cc.ndsu.nodak.edu Sun Apr  9 19:26:21 2000
Received: from mail-relay4.idt.net (MAIL-RELAY4.IDT.NET [169.132.8.88])by u1.farm.idt.net (8.9.3/8.9.3) with ESMTP id TAA20318for <gosgood@idt.net>; Sun, 9 Apr 2000 19:26:20 -0400 (EDT)
Received: from dogbert.cc.ndsu.nodak.edu (dogbert.cc.ndsu.NoDak.edu [134.129.106.23])by mail-relay4.idt.net (8.9.3/8.9.3) with ESMTP id TAA23070for <gosgood@idt.net>; Sun, 9 Apr 2000 19:26:19 -0400 (EDT)
Received: from localhost (mooney@localhost)by dogbert.cc.ndsu.nodak.edu (8.9.3/8.9.1) with ESMTP id SAA16996;Sun, 9 Apr 2000 18:26:19 -0500 (CDT)
Date: Sun, 9 Apr 2000 18:26:18 -0500 (CDT)
From: Tim Mooney <mooney@dogbert.cc.ndsu.nodak.edu>
To: "Garry R. Osgood" <gosgood@idt.net>
cc: Austin Donnelly <Austin.Donnelly@cl.cam.ac.uk>
Subject: Re: wire read: error: found it!
In-Reply-To: <38EE9C73.3D858DA1@idt.net>
Message-ID: <Pine.OSF.4.21.0004082223440.881-100000@dogbert.cc.ndsu.nodak.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Mozilla-Status: 8011
X-Mozilla-Status2: 00000000
X-UIDL: 66733fe761130000

In regard to: Re: wire read: error: found it!, Garry R. Osgood said (at...:

>> Looks like a g_io_channel_read() is returning EINTR from a SIGCHLD.
>> The SIGCHLD is probably because the plugin died.
>>
>> On OSF/1, it looks like signal() doesn't install restarting signal
>> handlers.  We should _really_ be using sigaction(2) since this solves
>> the problem in a portable manner.

I've been thinking about this some, and doing some reading.  I'm frankly
surprised that this problem hasn't been reported for more platforms --
Solaris and HP-UX both have `signal' functions that are SysV-based, so
they're even more dangerous than the signal() on systems with a signal()
that is BSD-like (like OSF/1 / Tru64).

I also spent some time looking at the man page for signal(2) on Tru64, and
although the wording is a little murky, Austin is definitely correct that
the default signal(2) semantics are the BSD-like signal() *without* restarting
system calls that were interrupted.

So far I have tested Austin's patch on:

	alpha-dec-osf4.0d (+ patch kit #6)
	alpha-dec-osf4.0f (+ patch kit #2)
	alpha-dec-osf4.0f (+ patch kit #3)
	alphaev56-dec-osf4.0f (+ patch kit #3)
	alpha-dec-osf5.0 (+ patch kit #1)

In all cases, the patch greatly improves the situation.  Where before on
Tru64 Unix there would be anywhere from a few to all of the plug-ins
erroring out when initially queried, now *none* of them do.

On my work desktop machine, which is the alphaev56-dec-osf4.0f listed above,
I do still get a long hang followed by a segv from extension_script_fu:

/local/gnu/lib/X11/gimp/1.1/plug-ins/script-fu: Segmentation fault caught
/local/gnu/lib/X11/gimp/1.1/plug-ins/script-fu (pid:1415): [E]xit, [H]alt, show
[S]tack trace or [P]roceed: S
  • #0 g_on_error_stack_trace
  • #1 g_on_error_query
  • #2 gimp_request_wakeups
  • #3 <signal handler called>




This doesn't happen on any of the other machines I tested on, only my
workstation.  I don't think it's related to the issue Austin's fixed.  The
segv is happening in the script_fu_find_scripts() (I think it's happening
in the repl_c_string() routine, but I'm not sure yet) procedure called from
script-fu's run().

I planned to test 1.1.19 with the patch on powerpc-ibm-aix4.3.2.0,
sparc-sun-solaris2.6, sparc-sun-solaris2.7, and hppa1.1-hp-hpux10.20, but
so far 1.1.19 has had compile problems on aix and hpux, so I don't know what
effect the patch has, if any.  I skipped testing on IRIX, being you said
you were going to do that Garry.

At this point, I can definitely say that the patch fixes the problem on
Tru64 Unix. 

Should other instances of signal() in the gimp source base be stamped out?
Should a configure test be written (or stolen, possibly from bash) that
checks to make sure that the system has the necessary sigaction support?
Every place I've checked has the SA_RESTART flag and the sa_handler member,
but neither are specified by POSIX so there may be some system out there
that doesn't have them.  I would be happy to help with the configure test
if people think it should be implemented.  Should hook_signal be placed in
its own file, and named something like gimp_os_signal(), so that it can be
linked into both the app and libgimp, and provided for plug-in writers to use
as an abstraction to the OS's signal mechanism?

Just trying to make sure we've covered all the bases...

Thanks again for you work,

Tim
-- 
Tim Mooney                              mooney@dogbert.cc.ndsu.NoDak.edu
Information Technology Services         (701) 231-1076 (Voice)
Room 242-J1, IACC Building              (701) 231-8541 (Fax)
North Dakota State University, Fargo, ND 58105-5164




------- Additional Comments From gosgood@idt.net 2000-04-09 21:25:30 ----

Subject: [Fwd: wire read: error: found it!]
From: "Garry R. Osgood" <gosgood@idt.net>
To: 2742@bugs.gnome.org
Message-Id: <38F12D8A.42593CEB@idt.net>
Date: Sun, 09 Apr 2000 21:25:30 -0400

Austin Donnelly isolated a probable cause and proposes a patch. Tim Mooney to test - GRO

-------- Original Message --------
Received: by u3.farm.idt.net for gosgood(with pop daemon (v1.21 1997/08/10) Fri Apr  7 21:42:08 2000)
X-From_: austin.donnelly@cl.cam.ac.uk Fri Apr  7 13:15:37 2000
Received: from mail-relay4.idt.net (MAIL-RELAY4.IDT.NET [169.132.8.88])by u2.farm.idt.net (8.9.3/8.9.3) with ESMTP id NAA06347for <gosgood@idt.net>; Fri, 7 Apr 2000 13:15:13 -0400 (EDT)
Received: from wisbech.cl.cam.ac.uk (exim@mta1.cl.cam.ac.uk [128.232.0.15])by mail-relay4.idt.net (8.9.3/8.9.3) with ESMTP id NAA24508for <gosgood@idt.net>; Fri, 7 Apr 2000 13:15:12 -0400 (EDT)
Received: from hornet.cl.cam.ac.uk ([128.232.8.3] ident=exim)by wisbech.cl.cam.ac.uk with esmtp (Exim 3.092 #1)id 12dcLU-0002GA-00for gosgood@idt.net; Fri, 07 Apr 2000 18:15:00 +0100
Received: from and1000 by hornet.cl.cam.ac.uk with local (Exim 3.01 #1)id 12dcLT-00089R-00for gosgood@idt.net; Fri, 07 Apr 2000 18:14:59 +0100
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: <14574.6033.473500.26028@hornet.cl.cam.ac.uk>
Date: Fri, 7 Apr 2000 18:14:57 +0100 (BST)
From: Austin Donnelly <Austin.Donnelly@cl.cam.ac.uk>
To: "Garry R. Osgood" <gosgood@idt.net>
Subject: wire read: error: found it!
In-Reply-To: <38ED34C2.5A05EA7A@idt.net>
References:
<Pine.OSF.4.21.0004041728180.32587-100000@dogbert.cc.ndsu.nodak.edu><38EA9379.E58DF582@idt.net><14571.7219.908672.263324@hornet.cl.cam.ac.uk><14572.42492.89013.72562@hornet.cl.cam.ac.uk><14572.47055.189071.417615@hornet.cl.cam.ac.uk><38ED34C2.5A05EA7A@idt.net>
X-Mailer: VM 6.75 under Emacs 20.6.1
Sender: Austin Donnelly <Austin.Donnelly@cl.cam.ac.uk>
X-Mozilla-Status: 8003
X-Mozilla-Status2: 00000000
X-UIDL: 9f14909777150000

Race is between:

static void
plug_in_query (gchar     *filename,
	       PlugInDef *plug_in_def)
{
  PlugIn *plug_in;
  WireMessage msg;

  plug_in = plug_in_new (filename);
  if (plug_in)
    {
      plug_in->query = TRUE;
      plug_in->synchronous = TRUE;
      plug_in->user_data = plug_in_def;

      if (plug_in_open (plug_in))  <<<--- this
	{
	  plug_in_push (plug_in);

	  while (plug_in->open)
	    {
	      if (!wire_read_msg (current_readchannel, &msg))  <<<-- this
		plug_in_close (current_plug_in, TRUE);
	      else 
		{
		  plug_in_handle_message (&msg);
		  wire_destroy (&msg);
		}
	    }

	  plug_in_pop ();
	  plug_in_destroy (plug_in);
	}
    }
}

Looks like a g_io_channel_read() is returning EINTR from a SIGCHLD.
The SIGCHLD is probably because the plugin died.

On OSF/1, it looks like signal() doesn't install restarting signal
handlers.  We should _really_ be using sigaction(2) since this solves
the problem in a portable manner.

I've got a patch against 1.1.18 that fixes this: please try it out and
check it in if it looks ok.  I've lightly tested it on Linux ix86
Red Hat 6.1.92, and OSF1 3.2D.

I'll leave you close the relevant bug reports.

Thanks,
Austin
------------------------------------------------------------
--- main.c~     Wed Feb 23 20:25:23 2000
+++ main.c      Fri Apr  7 18:05:29 2000
@@ -86,6 +86,35 @@
 static gint    gimp_argc = 0;
 static gchar **gimp_argv = NULL;
 
+
+/* hook_signal: Cause handler to be run when signum is delivered.  We
+ * use sigaction(2) rather than signal(2) so that we can control the
+ * signal hander's environment completely: some signal(2)
+ * implementations differ in their sematics, so we need to nail down
+ * exactly what we want.  */
+static void
+hook_signal (int signum, RETSIGTYPE (*handler)(int))
+{
+    int ret;
+    struct sigaction sa;
+
+    sa.sa_handler = handler;
+    sa.sa_sigaction = NULL;
+
+    /* Mask all signals while handler runs to avoid re-entrancy
+     * problems. */
+    sigfillset (&sa.sa_mask);
+
+    /* Must restart syscalls else get EINTR on g_io_channel_read()
+     * occasionally. */
+    sa.sa_flags = SA_RESTART;
+
+    ret = sigaction (signum, &sa, NULL);
+    if (ret < 0)
+       gimp_fatal_error ("unable to hook signal %d\n", signum);
+}
+
+
 /*
  *  argv processing: 
  *      Arguments are either switches, their associated
@@ -104,8 +133,7 @@
  *
  *      The exception is the batch switch.  When this is
  *      encountered, all remaining args are treated as batch
- *      commands.
- */
+ * commands.  */
 
 int
 main (int    argc,
@@ -325,36 +353,36 @@
 
   /* Handle some signals */
 #ifdef SIGHUP
-  signal (SIGHUP, on_signal);
+  hook_signal (SIGHUP, on_signal);
 #endif
 #ifdef SIGINT
-  signal (SIGINT, on_signal);
+  hook_signal (SIGINT, on_signal);
 #endif
 #ifdef SIGQUIT
-  signal (SIGQUIT, on_signal);
+  hook_signal (SIGQUIT, on_signal);
 #endif
 #ifdef SIGABRT
-  signal (SIGABRT, on_signal);
+  hook_signal (SIGABRT, on_signal);
 #endif
 #ifdef SIGBUS
-  signal (SIGBUS, on_signal);
+  hook_signal (SIGBUS, on_signal);
 #endif
 #ifdef SIGSEGV
-  signal (SIGSEGV, on_signal);
+  hook_signal (SIGSEGV, on_signal);
 #endif
 #ifdef SIGPIPE
-  signal (SIGPIPE, on_signal);
+  hook_signal (SIGPIPE, on_signal);
 #endif
 #ifdef SIGTERM
-  signal (SIGTERM, on_signal);
+  hook_signal (SIGTERM, on_signal);
 #endif
 #ifdef SIGFPE
-  signal (SIGFPE, on_signal);
+  hook_signal (SIGFPE, on_signal);
 #endif
 
 #ifdef SIGCHLD
   /* Handle child exits */
-  signal (SIGCHLD, on_sig_child);
+  hook_signal (SIGCHLD, on_sig_child);
 #endif
 
 #endif
------------------------------------------------------------




------- Additional Comments From gosgood@idt.net 2000-04-15 21:41:32 ----

Subject: [Fwd: Re: wire read: error: found it!]
From: "Garry R. Osgood" <gosgood@idt.net>
To: 2742@bugs.gnome.org
Message-Id: <38F91A4C.5BBA35C0@idt.net>
Date: Sat, 15 Apr 2000 21:41:32 -0400

FYI to bug report #2742 GR Osgood.

-------- Original Message --------
X-Mozilla-Status: 0001
X-Mozilla-Status2: 00000000
Message-ID: <38F9199B.B7C82043@idt.net>
Date: Sat, 15 Apr 2000 21:38:35 -0400
From: "Garry R. Osgood" <gosgood@idt.net>
X-Mailer: Mozilla 4.51C-SGI [en] (X11; I; IRIX 6.5 IP20)
X-Accept-Language: en, zh-TW
MIME-Version: 1.0
To: Tim Mooney <mooney@dogbert.cc.ndsu.nodak.edu>,Sven Neumann <neumanns@uni-duesseldorf.de>,Michael Natterer <mitch@gimp.org>, Tor Lillqvist <tml@iki.fi>
CC: Austin Donnelly <austin@gimp.org>
Subject: Re: wire read: error: found it!
References: <Pine.OSF.4.21.0004082223440.881-100000@dogbert.cc.ndsu.nodak.edu>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Sven, Mitch,

I'm cc-ing you on this for comment; this concerns Bug # 2742
and a patch Austin Donnelly wrote that addresses the issue. The
patch itself was posted to #2742@bugs.gimp.org; it modifies
the signal environment of Spec 1170 compliant POSIX boxes
so that if a signal arrives at a gimp process while it is in a system
routine, that routine will "restart" following signal handling
(if the process survives signal handling, that is), and not return
-1 (error) and setting errno = EINTR (default POSIX behavior).

In particular with #2742, when an "out-of-box" gimp is
probing plug-in query() methods, it is launching many child
processes which may persist somewhat as defuncts/zombies
before the O/S reaps them and signals gimp with SIGCLD.
Concurrently, gimp makes a lot of calls to the system read()
function to pull bytes from the pipe connecting gimp to
the currently active child. It appears that on OSF/1 the
coincidence of gimp being in read() when SIGCLD arrives
is quite high, giving rise to the behaviour reported in
#2742.

Austin's patch makes use of a hook_signal() to set up the
signal environment; Following Tim Mooney's observations
after he tested the patch on DEC OSF/1 boxes, methinks he
is right in suggesting that it be promoted to a libgimp
function so that both core and plug-ins have the same
abstraction of the signal mechanism. Comments?

Tor,

I hope you can review this as well; I am laboring under
the happy illusion that if all of these code modifications
are wrapped in

#ifndef G_OS_WIN32

/* UNIX-SOLARIS-LINUX-IRIX-OSF/1 signal
   action stuff */

#endif

then the Windows versions will not be affected. Am I
right?

Tim, Austin (who's wandering around in Welsh hills, but will read this eventually),

Appropriately, on the anniversary of the sinking of the Titanic. I've
found time to step through the unpatched gimp investigating #2742
and Austin's patched version. I've got a comfortable idea what's going
on and how Austin's patch fixes it. It mostly concurs with and confirms
Austin's reasoning.

[Austin: pondering plug_in_query() on or about April 6]

>
> >> Looks like a g_io_channel_read() is returning EINTR from a SIGCHLD.
> >> The SIGCHLD is probably because the plugin died.
> >>

I've reproduced the condition - or a condition remarkably like it - on the SGI.

(0) Context is line 2316, plug_in _query(), plug_in.c-CVS-1.98 [April 11, 2000]
wire_read_msg() has been called and that bottoms out to g_io_channel_read().
plug_in_query() is getting bytes from the n-th plug-in that had been fork()-ed
in plug_in_open().

(1) g_io_channel_read() enters the system call read(2) to get the child process bytes
on the pipe (if any).

(2) Concurrently, and asynchronously, the n-th plug-in is executing it's query() method
at some juncture it may deposit bytes in the pipe.

(3) Also concurrently, the processes for plug-ins n-1, n-2, ... 1 (no telling how many)
have invoked exit() and are in various stages of cleanup or are defunct and awaiting
reaping by the O/S, which may take its own sweet time. They are in the process table.
They are zombies.

(4) At some juncture while the gimp process is in any number of  read() system calls,
the O/S gets a time quantum and uses it to clean up some defunct plug-in child process.
It sends the gimp process a SIGCLD (signal 18).

(5) What happens when a process is delivered a signal and it is in a system call? According
to [Robbins, Robbins], on a POSIX.1 compliant O/S, a "slow" system call such as read()
is to follow this policy:  "Fail. Return -1. set errno to EINTR." This concurs with the
observation Austin Donnelly made at the exit of g_io_channel_read().  I remark here
that the SIGCLD need not necessarily map to the n-th plug-in on the other side of the
wire; that may be functioning in an orderly manner. Any one of (or more than one of) the
plug-ins launched prior to the n-th one may still be around and exiting in an orderly
way; any number of these may be waiting to be reaped by the O/S at its leisure. When
the O/S does so, and the gimp process gets SIGCLD in the system read() call, that
call fails and returns -1, setting errno to EINTR.

(6) When read() returns, g_io_channel_read() returns G_IO_ERROR_UNKOWN, and
bytes_read value of 0. (See g_io_unix_read(),  line 159, glib-1.2.7/giounix.c and
the caller, wire_read() maps this to a g_warning() "wire_read: error()" which observers
of bug #2742 have been reporting.

(7) This sequence of events can be reproduced on the SGI by artfully manipulating plug-in
processes with a debugger. I surmise that the prerequisite timing arises more naturally under
OSF/1 and for a number of plausible cases: there are dozens of invocations of read() for
every invocation of plug_in_query(). A little bit of latency in a read() implementation
can create opportunities for the gimp process to be in system code. Likewise, zombie
plug-ins may linger on OSF/1 longer.

Austin also said on or around April 6.

> >> On OSF/1, it looks like signal() doesn't install restarting signal
> >> handlers.  We should _really_ be using sigaction(2) since this solves
> >> the problem in a portable manner.

This is the case with SGI as well. Austin's solution was to use the POSIX sigaction(2)
call and pass in struct sigaction::sa_flags |= SA_RESTART and that works here as
well; then calls to read(), if interrupted, "rewind" and are attempted again. This is
transparent to gimp code: read() simply works "through" the SIGCLD signal.
But as Tim observed, SA_RESTART is not guaranteed to be supported on a POSIX
platform. According to [Robbins,Robbins] this flag is a SPEC 1170  extension.

> Tim Mooney observed:

> IShould other instances of signal() in the gimp source base be stamped out?
> Should a configure test be written (or stolen, possibly from bash) that
> checks to make sure that the system has the necessary sigaction support?
> Every place I've checked has the SA_RESTART flag and the sa_handler member,
> but neither are specified by POSIX so there may be some system out there
> that doesn't have them.  I would be happy to help with the configure test
> if people think it should be implemented.

These are correct things to do in principle, IMHO. The configure test would have
to do some sort of 'POSIX capability probe' to determine if the box's POSIX
implementation supports SPEC 1170. I'm not real smart about configure issues;
I'm not sure how to write it in GNU configure-ese. If you could help here, that
would be useful.

But - in practice - I don't think we have to worry much. ,The condition seems to
be very timing dependent, observed consistently so far just in OSF/1, and, happily,
that O/S implements Spec 1170.  Up to now, the set of O/S'es that exhibit
the behavior and are not Spec 1170 seems to be  null. Closure does call for
an appropriate configure test, but conditions aren't insisting on Closure
Right At This Instant!!!!

> Should hook_signal be placed in its own file, and named something
> like gimp_os_signal(), so that it can be
> linked into both the app and libgimp, and provided for plug-in writers to use
> as an abstraction to the OS's signal mechanism?

I agree on this. plug-ins fork processes as well (gz comes to mind) and can
face this issue in principle as well. They should have the exact same abstraction
of establishing signal handlers and managing the signal environment.

 Maybe gimp_sigaction(), because that what it is extruding (POSIX sigaction())
but its implementation  would be essentially Austin's hook_signal()

I'm at the code relocation and  retest phase, with a check-in around midweek
unless somebody (everybody) tells me I'm crazy. Tim, I trust you will test
the commit sometimes soon after.

Be good, all, be well


Garry Osgood

Ref [Robbins, Robbins] Robbins, Kay A; Robbins, Steven "Practical UNIX Programming"
    A Guide to Concurrency, Communication, and Multithreading" 1996 Prentice-Hall Inc
    ISBN 0-13-443706-3 See pp 188: "5.6 System Calls and Signals"




------- Additional Comments From mitschel@cs.tu-berlin.de 2000-04-16 08:02:49 ----

Subject: [Fwd: Re: wire read: error: found it!]
From: Michael Natterer <mitschel@cs.tu-berlin.de>
To: 2742@bugs.gnome.org
Message-Id: <38F9ABE9.5AD900A2@cs.tu-berlin.de>
Date: Sun, 16 Apr 2000 14:02:49 +0200

3rd try to forward it to bugs.gnome.org :)

-------- Original Message --------
Subject: Re: wire read: error: found it!
Date: Sun, 16 Apr 2000 13:42:11 +0200
From: Michael Natterer <mitschel@cs.tu-berlin.de>
To: "Garry R. Osgood" <gosgood@idt.net>
CC: Tim Mooney <mooney@dogbert.cc.ndsu.nodak.edu>,Sven Neumann
<neumanns@uni-duesseldorf.de>,Michael Natterer <mitch@gimp.org>, Tor Lillqvist
<tml@iki.fi>,Austin Donnelly <austin@gimp.org>
References: <Pine.OSF.4.21.0004082223440.881-100000@dogbert.cc.ndsu.nodak.edu>
<38F9199B.B7C82043@idt.net>

"Garry R. Osgood" wrote:
> 
> I'm cc-ing you on this for comment; this concerns Bug # 2742
> and a patch Austin Donnelly wrote that addresses the issue. The
> patch itself was posted to #2742@bugs.gimp.org; it modifies
> the signal environment of Spec 1170 compliant POSIX boxes
> so that if a signal arrives at a gimp process while it is in a system
> routine, that routine will "restart" following signal handling
> (if the process survives signal handling, that is), and not return
> -1 (error) and setting errno = EINTR (default POSIX behavior).

Yes, this is how I understand POSIX signals to work.

> In particular with #2742, when an "out-of-box" gimp is
> probing plug-in query() methods, it is launching many child
> processes which may persist somewhat as defuncts/zombies
> before the O/S reaps them and signals gimp with SIGCLD.
> Concurrently, gimp makes a lot of calls to the system read()
> function to pull bytes from the pipe connecting gimp to
> the currently active child. It appears that on OSF/1 the
> coincidence of gimp being in read() when SIGCLD arrives
> is quite high, giving rise to the behaviour reported in
> #2742.

I'm quite qure that this is not only an OSF/1 issue but can occur
with all UNIX variants out there. The reason why most people don't
get these errors might be that esp. Linux behaves _very_ programmer
friendly in regard to signals. (Well, basically it should do the same,
but my theory is that it does magic things to minimize signals
interrupting system calls. And yes, this is just a theory :) )

(...)

Your analysis seems to reflect exactly what happens...

> Austin also said on or around April 6.
> 
> > >> On OSF/1, it looks like signal() doesn't install restarting signal
> > >> handlers.  We should _really_ be using sigaction(2) since this solves
> > >> the problem in a portable manner.
> 
> This is the case with SGI as well. Austin's solution was to use the POSIX sigaction(2)
> call and pass in struct sigaction::sa_flags |= SA_RESTART and that works here as
> well; then calls to read(), if interrupted, "rewind" and are attempted again. This is
> transparent to gimp code: read() simply works "through" the SIGCLD signal.
> But as Tim observed, SA_RESTART is not guaranteed to be supported on a POSIX
> platform. According to [Robbins,Robbins] this flag is a SPEC 1170  extension.

Oh yes (YES!!), strange enough, I'm teaching UNIX to students for 3 years now
and the primary goal when it comes to signals is teaching them: "use
sigaction()
instead of signal()" -- I should have noticed this before :-)

> > Tim Mooney observed:
> 
> > IShould other instances of signal() in the gimp source base be stamped out?
> > Should a configure test be written (or stolen, possibly from bash) that
> > checks to make sure that the system has the necessary sigaction support?
> > Every place I've checked has the SA_RESTART flag and the sa_handler member,
> > but neither are specified by POSIX so there may be some system out there
> > that doesn't have them.  I would be happy to help with the configure test
> > if people think it should be implemented.
> 
> These are correct things to do in principle, IMHO. The configure test would have
> to do some sort of 'POSIX capability probe' to determine if the box's POSIX
> implementation supports SPEC 1170. I'm not real smart about configure issues;
> I'm not sure how to write it in GNU configure-ese. If you could help here, that
> would be useful.

I 100% agree here. We should replace _all_ calls to signal() with our own
wrapper but I'm afraid I have too few configure knowledge to hack it
(it took me a whole day to hack yosh's proposed gtkxmhtml configure test
to work on solaris before yosh applied it...)

> But - in practice - I don't think we have to worry much. ,The condition seems to
> be very timing dependent, observed consistently so far just in OSF/1, and, happily,
> that O/S implements Spec 1170.  Up to now, the set of O/S'es that exhibit
> the behavior and are not Spec 1170 seems to be  null. Closure does call for
> an appropriate configure test, but conditions aren't insisting on Closure
> Right At This Instant!!!!
> 
> > Should hook_signal be placed in its own file, and named something
> > like gimp_os_signal(), so that it can be
> > linked into both the app and libgimp, and provided for plug-in writers to use
> > as an abstraction to the OS's signal mechanism?
> 
> I agree on this. plug-ins fork processes as well (gz comes to mind) and can
> face this issue in principle as well. They should have the exact same abstraction
> of establishing signal handlers and managing the signal environment.

Me too, a libgimp function (with an included SIGCHLD handler)  is imho the way
to go here. This is also a way to get rid of the signal handling stuff in
app/main.c.

>  Maybe gimp_sigaction(), because that what it is extruding (POSIX sigaction())
> but its implementation  would be essentially Austin's hook_signal()
> 
> I'm at the code relocation and  retest phase, with a check-in around midweek
> unless somebody (everybody) tells me I'm crazy. Tim, I trust you will test
> the commit sometimes soon after.

You're not crazy :) Please go ahead, this is a big issue.

BTW, to really get rid of strange child exits and to deliver messages about
their
death correctly, we could also wrap fork() with out own function and keep
a list of started processes there. The GNU info pages section

libc --> "Signal Handling" --> "Defining Handlers" --> "Merged Handlers"

has an excellent example of a SIGCHLD handler which is safe against race
condition and stuff.

We could then traverse our list of children in a periodically called idle
function
(the shell does it before displaying the prompt) and pop up real error messages
instead of spitting out stuff on the console.

Or is this overkill??

Thanks for the debugging to all of you!!

bye,
--Mitch




------- Additional Comments From gosgood@idt.net 2000-06-12 21:04:10 ----

Subject: [Fwd: Re: Closing #2742]
From: "Garry R. Osgood" <gosgood@idt.net>
To: 2742@bugs.gnome.org
Message-Id: <3945888A.B45A7FD1@idt.net>
Date: Mon, 12 Jun 2000 21:04:10 -0400

FYI

Austin Donnelly wrote a test which isolated a problem with Compaq (DEC OSF/1) not handling all sa_flags
in the struct sigaction object. In particular, SA_RESTART did not function in a Posix compliant way when the stream
was a pipe. In that case, behaviour was as if the SA_RESTART flag was never requested and system calls
interrupted by action handlers returned EINTR. This affected, in particular, the reading from pipes that
gimp does with plugins. Tim Mooney is pursuing a incident report with Compaq/DEC (See attached)

But, in the course of writing a workaround to permit current Compaq OSF/1 releases to function with Gimp and
plugins, Mitch Natterer uncovered weakness in how glib g_io_channels percolate EINTR up to applications 
like Gimp, forcing Mitch to write quite a bit of awkwardness at the
Gimp level. According to Austin, Tim Janik is planning on g_io_channels to be more discriptive in percolating
error conditions upward. 

So, #2742 has been temporarily averted by temporary patchwork that an improved glib will retire. To keep
ourselves reminded of this matter, I propose keeping this bug open. It closes when what amounts to a g_io_channels
workaround can be safely retired. 

Be good, be well
 

-------- Original Message --------
X-Mozilla-Status: 0001
X-Mozilla-Status2: 00000000
Message-ID: <39458273.25E60831@idt.net>
Date: Mon, 12 Jun 2000 20:38:12 -0400
From: "Garry R. Osgood" <gosgood@idt.net>
X-Mailer: Mozilla 4.51C-SGI [en] (X11; I; IRIX 6.5 IP20)
X-Accept-Language: en, zh-TW
MIME-Version: 1.0
To: Austin Donnelly <austin@gimp.org>
CC: Tim Mooney <mooney@dogbert.cc.ndsu.nodak.edu>,Michael Natterer <mitch@gimp.org>, Tim Janik <timj@gtk.org>
Subject: Re: Closing #2742
References: <39455469.9F24F12C@idt.net><Pine.OSF.4.21.0006121622300.13913-100000@dogbert.cc.ndsu.nodak.edu> <14661.22824.931934.547816@hornet.cl.cam.ac.uk>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Austin Donnelly wrote:

> On Monday, 12 Jun 2000, Tim Mooney wrote:
>
> > In regard to: Closing #2742, Garry R. Osgood said (at 5:21pm on Jun 12, 2000):
> >
> > >Since Austin isolated #2742 to an OS issue, and since
> > >Mitch's workaround seems contained in light of that,
> > >would #2742 be now a closed item?
>
> I spoke to Tim Janik at GimpCon, and we agree the current hack in the
> gimp is too ugly to live much longer.
>
> He suggested I file a bug report on glib to add the necessary
> g_io_channel() error returns, then gimp should use that.  Apparently,
> gimp is one of the few programs that actually uses g_io_channels.
>
> Tim promised me a stable 1.2.x glib release with such a fix could be
> made reasonably quickly.
>
> Austin

Well, since active bug list is - in some respects - the most well-kept
TODO list, for 1.2 ;) and since Mitch's workaround is something we prefer not
to be in 1.2, I propose keeping #2742 open with the glib fix as a necessary
action item.  So I'll forward this brief flurry of email to 2742@bugs.gnome.org
with the suggestion that it closes when (1) g_io_channel() error returns are expanded
in the next glib release (2) Mitch has the opportunity of backing out his workaround. and
(3) the DEC OSF/1 platforms can still talk with plugins.

Thanks, all

Garry

====================================================================================
In regard to: Closing #2742, Garry R. Osgood said (at 5:21pm on Jun 12, 2000):

>Since Austin isolated #2742 to an OS issue, and since
>Mitch's workaround seems contained in light of that,
>would #2742 be now a closed item?

I think so.  I still have a call open with Compaq regarding the issue.
The people I've spoken with agree with me that it's a problem and needs
to be fixed, but since things are working the way they're (poorly) documented
to work via the man page for pipe(2) on Tru64, it gets a lower priority than
things that aren't working the way they're documented to work.

Garry, my suggestion is that you add a comment to #2742 indicating that
it's a known issue with signal handling on OSF1/Tru64, and is not a
problem with the GIMP.  You might also wish to mention that if Tru64
users want this fixed and they have software support, they should call
and open a support call regarding the issue.  They can reference my
support call #,

        C000515-1805

The more people that open calls regarding the issue, the faster something
will happen.

Thanks!

Tim
-- 
Tim Mooney                              mooney@dogbert.cc.ndsu.NoDak.edu
Information Technology Services         (701) 231-1076 (Voice)
Room 242-J1, IACC Building              (701) 231-8541 (Fax)
North Dakota State University, Fargo, ND 58105-5164




------- Bug moved to this database by debbugs-export@bugzilla.gnome.org 2001-01-28 10:47 -------
This bug was previously known as bug 2742 at http://bugs.gnome.org/
http://bugs.gnome.org/show_bug.cgi?id=2742
Originally filed under the gimp product and general component.

The original reporter (angel@miami.edu) of this bug does not have an account here.
Reassigning to the exporter, debbugs-export@bugzilla.gnome.org.
Reassigning to the default owner of the component, egger@suse.de.

Comment 1 Raphaël Quinet 2001-04-26 18:12:40 UTC
Re-assigning all Gimp bugs to default component owner (Gimp bugs list)
Comment 2 Raphaël Quinet 2002-04-09 16:20:01 UTC
Can anyone check the status of this bug report?  This is the oldest
open GIMP bug in our database and the last update occured almost two
years ago (June 2000).
Comment 3 Sven Neumann 2002-08-28 04:28:11 UTC
I'll close this report now since noone complained about this problem
for more than two years.
Comment 4 Raphaël Quinet 2003-06-20 18:03:44 UTC
The fix is part of the stable release 1.2.4 (or earlier, hopefully). 
Closing this bug.