Bug 75511 – Possible g_warning in linc

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 75511 - Possible g_warning in linc


Summary:	Possible g_warning in linc


Status:	RESOLVED FIXED

Product:	linc
Classification:	Deprecated
Component:	general
Version:	0.1.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Michael Meeks
QA Contact:	Michael Meeks

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2002-03-19 21:59 UTC by Luis Villa
Modified:	2004-12-22 21:47 UTC

See Also:
GNOME target:	---
GNOME version:	2.0

Description Luis Villa 2002-03-19 21:59:09 UTC

After my last attempt to login to my laptop, whenever a client attempts to
contact the gconfd, it crashes. I'd had one successful login after my last
update of the binary, so I don't /think/ the build itself is bad, but I'm
not sure. Here is the stack trace from gconfd after an attempt to contact
it from gnome-terminal2.  If there is anything else you need, reply to the
bug or find me in IRC. Thanks...
---
(gdb) run
Starting program: /usr/bin/gconfd-2 
[New Thread 1024 (LWP 2839)]

Program received signal SIGSEGV, Segmentation fault.

+ Trace 19486

Thread 1024 (LWP 2839)

#0 g_type_check_instance_cast
at gtype.c line 2600
#1 linc_server_handle_io
at linc-server.c line 224
#2 linc_source_dispatch
at linc-source.c line 56
#3 g_main_dispatch
at gmain.c line 1617
#4 g_main_context_dispatch
at gmain.c line 2161
#5 g_main_context_iterate
at gmain.c line 2242
#6 g_main_loop_run
at gmain.c line 2462
#7 gconf_main
at gconfd.c line 746
#8 main
at gconfd.c line 635
#9 __libc_start_main
at ../sysdeps/generic/libc-start.c line 129

Comment 1 Havoc Pennington 2002-03-19 22:07:01 UTC

I'd consider it really likely that this is a build issue, 
since I haven't changed gconf at all lately...

Comment 2 Luis Villa 2002-03-19 22:58:20 UTC

Rebuilt from jacob's 'clean' source rpms; still the same segfault. Is
it possible some config file or something got borked on my install?
strace seems to show a lot of 'too many open files' errors, but that's
about it; no attempt to open any config files or anything
/immediately/ before crashing, nor is anything else on the system
suffering (AFAICT) from having too many open files.

Comment 3 Havoc Pennington 2002-03-19 23:23:26 UTC

Hmm, I also have some gconfd crashes in my logs, I see. But I haven't
changed anything! And this backtrace is in ORBit before gconfd code is
even reached (from an incoming corba request).

I'm investigating now. Repeated gconfd crashes seem to be the cause of
huge saved_state files, as well (since gconfd never gets to "compress"
the log before it crashes again)

Comment 4 Luis Villa 2002-03-19 23:49:04 UTC

hp: yeah, I've got a 600K saved_state file. Is there anything else I
can  do to help debug? 

BTW, on a tip from jacob, I remembered to look at /tmp/orbit-louie/,
which has 3500+ linc files, 1800+ of which were generated in the hour
in which the crash occurred. Don't know quite how abnormal/normal that
is, though.

I'm cc:ing michael because he described the problem as 'fascinating',
not because I expect him to actually solve anything (for once :)

Comment 5 Havoc Pennington 2002-03-19 23:50:05 UTC

Beautiful, I see the problem. This also explains the "huge saved_state
file" issue a couple people have reported.

gconfd opens each client in the saved_state file on startup. If the
saved_state file grows to contain more IORs than the max number of
open files, then CORBA_ORB_string_to_object() starts to fail. Relevant
strace:

5298  connect(1021, {sin_family=AF_UNIX,
path="/tmp/orbit-hp/linc-70c2a2ccb09bf"}, 34) = -1 ECONNREFUSED
(Connection refused)
5298  socket(PF_UNIX, SOCK_STREAM, 0)   = 1022
5298  fcntl64(0x3fe, 0x4, 0x800, 0x4)   = 0
5298  fcntl64(0x3fe, 0x2, 0x1, 0x2)     = 0
5298  connect(1022, {sin_family=AF_UNIX,
path="/tmp/orbit-hp/linc-504a198912334"}, 34) = -1 ECONNREFUSED
(Connection refused)
5298  socket(PF_UNIX, SOCK_STREAM, 0)   = 1023
5298  fcntl64(0x3ff, 0x4, 0x800, 0x4)   = 0
5298  fcntl64(0x3ff, 0x2, 0x1, 0x2)     = 0
5298  connect(1023, {sin_family=AF_UNIX,
path="/tmp/orbit-hp/linc-504a198912334"}, 34) = -1 ECONNREFUSED
(Connection refused)
5298  socket(PF_UNIX, SOCK_STREAM, 0)   = -1 EMFILE (Too many open files)
5298  socket(PF_UNIX, SOCK_STREAM, 0)   = -1 EMFILE (Too many open files)
5298  socket(PF_UNIX, SOCK_STREAM, 0)   = -1 EMFILE (Too many open files)
5298  socket(PF_UNIX, SOCK_STREAM, 0)   = -1 EMFILE (Too many open files)
5298  socket(PF_UNIX, SOCK_STREAM, 0)   = -1 EMFILE (Too many open files)


Reading gconfd code, gconfd handles this by getting a nil object back
from string_to_object, kicking a message to the logfile, and "losing"
a client. Which is bad but not catastrophic.

The crash comes as soon as we get a client request, apparently 
linc fails to handle an error code from accept(), ends up with a NULL
GIOChannel and segfaults:

5298  socket(PF_UNIX, SOCK_STREAM, 0)   = -1 EMFILE (Too many open files)
5298  gettimeofday({1016580730, 697012}, NULL) = 0
5298  gettimeofday({1016580730, 697158}, NULL) = 0
5298  poll([{fd=5, events=POLLIN|POLLPRI, revents=POLLIN}], 1, 299259) = 1
5298  gettimeofday({1016580734, 601825}, NULL) = 0
5298  accept(5, 0xbffff1a0, [2])        = -1 EMFILE (Too many open files)
5298  --- SIGSEGV (Segmentation fault) ---

The saved state file grows exponentially as gconfd crashes over and
over, and stuff goes to hell big time.

Ugh.

This is going to be hard to fix. For a long time I've wanted to make
an architectural change where clients save their per-client state, and
the daemon is basically stateless (on daemon crash, clients resend
their state to the daemon). Which gets rid of saved_state.
But that's a lot of work to rearrange so I've been putting it off...

Even then it doesn't solve the fundamental problem that we can't ever
handle more than 1000 or so clients. I don't know enough about network
programming to know what say the X server does about that.


One last suspicious thing is that linc may be leaking file descriptors?

5298  connect(79, {sin_family=AF_UNIX,
path="/tmp/orbit-hp/linc-302a53e0438ae"}, 34) = -1 ECONNREFUSED
(Connection refused)
5298  socket(PF_UNIX, SOCK_STREAM, 0)   = 80
5298  fcntl64(0x50, 0x4, 0x800, 0x4)    = 0
5298  fcntl64(0x50, 0x2, 0x1, 0x2)      = 0
5298  connect(80, {sin_family=AF_UNIX,
path="/tmp/orbit-hp/linc-4f810a4c15125"}, 34) = -1 ECONNREFUSED
(Connection refused)
5298  socket(PF_UNIX, SOCK_STREAM, 0)   = 81
5298  fcntl64(0x51, 0x4, 0x800, 0x4)    = 0
5298  fcntl64(0x51, 0x2, 0x1, 0x2)      = 0
5298  connect(81, {sin_family=AF_UNIX,
path="/tmp/orbit-hp/linc-4f810a4c15125"}, 34) = -1 ECONNREFUSED
(Connection refused)
5298  socket(PF_UNIX, SOCK_STREAM, 0)   = 82
5298  fcntl64(0x52, 0x4, 0x800, 0x4)    = 0
5298  fcntl64(0x52, 0x2, 0x1, 0x2)      = 0

there's an ECONNREFUSED on this client, so why is socket() called 
for each client that gets ECONNREFUSED, and where does the result of
the socket() get closed?

Comment 6 Havoc Pennington 2002-03-19 23:51:40 UTC

Michael I could use your input on exactly what happens in ORBit/linc
here, and whether there's an fd leak.

Comment 7 Luis Villa 2002-03-19 23:55:30 UTC

Sounds (1) exciting ;) and (2) like you have a grip on it. Michael
tells me that removing my saved_state file will fix the problem and
I'll be able to move from my desktop back to my laptop; will you need
it? If so, I can keep a copy; otherwise, I'll just ditch it.

Comment 8 Havoc Pennington 2002-03-19 23:58:14 UTC

No I have a bad saved_state file, don't worry about it.

Michael I wonder if the problem is that in linc_connection_initiate(),
sometimes (cnx != NULL && fd >= 0) leading to an fd leak.

	if (!cnx && fd >= 0) {
		d_printf ("initiation failed\n");
		close (fd);
	}

Comment 9 Havoc Pennington 2002-03-20 00:08:03 UTC

The bug causing the segfault is I believe that 
linc_server_accept_connection() does not initialize 
the "connection" variable when accept fails, 
and then linc_server_handle_io() calls a G_OBJECT() cast on this 
uninitialized variable.

Comment 10 Havoc Pennington 2002-03-20 03:11:50 UTC

I just found another bug: the fd leak causes settings to be lost, 
because when xmlParseFile() fails gconf assumes the XML document was
corrupt and moves it to the side. So I need to distinguish a parse
error from an I/O error in xmlParseFile(), but I have no idea how I'd
do that, 
the API doesn't seem to allow it - in fact error reporting in libxml
worries me quite a bit. I'll ask Daniel.

Comment 11 Michael Meeks 2002-03-20 11:07:25 UTC

I'll take a look at the linc issues - as for the XML issues - perhaps
there is no better time to switch to a push parser (as in librsvg) and
preferably the SAX interface (as in librsvg) ;-)

Comment 12 Michael Meeks 2002-03-20 11:13:21 UTC

Fixed the linc sillies - thanks for the report.

Comment 13 Michael Meeks 2002-03-20 12:15:21 UTC

I don't quite see how you could get 3500+ linc files, that's real
badness, but I've just fixed a couple of places where these might leak
- but not at anything like the order needed to create that many I think.

Comment 14 Havoc Pennington 2002-03-20 13:19:28 UTC

Looking at linc CVS, I think
linc_server_handle_io will still get a warning, since 
IIRC we didn't change the G_OBJECT() cast to allow NULL, 
I could be wrong. Moving the bug to linc so you can check that to your
satisfaction, it's fine to close the bug afterward.

Comment 15 Michael Meeks 2002-03-21 17:27:54 UTC

Just fixed the warning and did a new release for you.

Comment 16 Luis Villa 2002-05-01 10:26:21 UTC

*** Bug 75504 has been marked as a duplicate of this bug. ***