GNOME Bugzilla – Bug 75511
Possible g_warning in linc
Last modified: 2004-12-22 21:47:04 UTC
After my last attempt to login to my laptop, whenever a client attempts to contact the gconfd, it crashes. I'd had one successful login after my last update of the binary, so I don't /think/ the build itself is bad, but I'm not sure. Here is the stack trace from gconfd after an attempt to contact it from gnome-terminal2. If there is anything else you need, reply to the bug or find me in IRC. Thanks... --- (gdb) run Starting program: /usr/bin/gconfd-2 [New Thread 1024 (LWP 2839)] Program received signal SIGSEGV, Segmentation fault.
+ Trace 19486
Thread 1024 (LWP 2839)
I'd consider it really likely that this is a build issue, since I haven't changed gconf at all lately...
Rebuilt from jacob's 'clean' source rpms; still the same segfault. Is it possible some config file or something got borked on my install? strace seems to show a lot of 'too many open files' errors, but that's about it; no attempt to open any config files or anything /immediately/ before crashing, nor is anything else on the system suffering (AFAICT) from having too many open files.
Hmm, I also have some gconfd crashes in my logs, I see. But I haven't changed anything! And this backtrace is in ORBit before gconfd code is even reached (from an incoming corba request). I'm investigating now. Repeated gconfd crashes seem to be the cause of huge saved_state files, as well (since gconfd never gets to "compress" the log before it crashes again)
hp: yeah, I've got a 600K saved_state file. Is there anything else I can do to help debug? BTW, on a tip from jacob, I remembered to look at /tmp/orbit-louie/, which has 3500+ linc files, 1800+ of which were generated in the hour in which the crash occurred. Don't know quite how abnormal/normal that is, though. I'm cc:ing michael because he described the problem as 'fascinating', not because I expect him to actually solve anything (for once :)
Beautiful, I see the problem. This also explains the "huge saved_state file" issue a couple people have reported. gconfd opens each client in the saved_state file on startup. If the saved_state file grows to contain more IORs than the max number of open files, then CORBA_ORB_string_to_object() starts to fail. Relevant strace: 5298 connect(1021, {sin_family=AF_UNIX, path="/tmp/orbit-hp/linc-70c2a2ccb09bf"}, 34) = -1 ECONNREFUSED (Connection refused) 5298 socket(PF_UNIX, SOCK_STREAM, 0) = 1022 5298 fcntl64(0x3fe, 0x4, 0x800, 0x4) = 0 5298 fcntl64(0x3fe, 0x2, 0x1, 0x2) = 0 5298 connect(1022, {sin_family=AF_UNIX, path="/tmp/orbit-hp/linc-504a198912334"}, 34) = -1 ECONNREFUSED (Connection refused) 5298 socket(PF_UNIX, SOCK_STREAM, 0) = 1023 5298 fcntl64(0x3ff, 0x4, 0x800, 0x4) = 0 5298 fcntl64(0x3ff, 0x2, 0x1, 0x2) = 0 5298 connect(1023, {sin_family=AF_UNIX, path="/tmp/orbit-hp/linc-504a198912334"}, 34) = -1 ECONNREFUSED (Connection refused) 5298 socket(PF_UNIX, SOCK_STREAM, 0) = -1 EMFILE (Too many open files) 5298 socket(PF_UNIX, SOCK_STREAM, 0) = -1 EMFILE (Too many open files) 5298 socket(PF_UNIX, SOCK_STREAM, 0) = -1 EMFILE (Too many open files) 5298 socket(PF_UNIX, SOCK_STREAM, 0) = -1 EMFILE (Too many open files) 5298 socket(PF_UNIX, SOCK_STREAM, 0) = -1 EMFILE (Too many open files) Reading gconfd code, gconfd handles this by getting a nil object back from string_to_object, kicking a message to the logfile, and "losing" a client. Which is bad but not catastrophic. The crash comes as soon as we get a client request, apparently linc fails to handle an error code from accept(), ends up with a NULL GIOChannel and segfaults: 5298 socket(PF_UNIX, SOCK_STREAM, 0) = -1 EMFILE (Too many open files) 5298 gettimeofday({1016580730, 697012}, NULL) = 0 5298 gettimeofday({1016580730, 697158}, NULL) = 0 5298 poll([{fd=5, events=POLLIN|POLLPRI, revents=POLLIN}], 1, 299259) = 1 5298 gettimeofday({1016580734, 601825}, NULL) = 0 5298 accept(5, 0xbffff1a0, [2]) = -1 EMFILE (Too many open files) 5298 --- SIGSEGV (Segmentation fault) --- The saved state file grows exponentially as gconfd crashes over and over, and stuff goes to hell big time. Ugh. This is going to be hard to fix. For a long time I've wanted to make an architectural change where clients save their per-client state, and the daemon is basically stateless (on daemon crash, clients resend their state to the daemon). Which gets rid of saved_state. But that's a lot of work to rearrange so I've been putting it off... Even then it doesn't solve the fundamental problem that we can't ever handle more than 1000 or so clients. I don't know enough about network programming to know what say the X server does about that. One last suspicious thing is that linc may be leaking file descriptors? 5298 connect(79, {sin_family=AF_UNIX, path="/tmp/orbit-hp/linc-302a53e0438ae"}, 34) = -1 ECONNREFUSED (Connection refused) 5298 socket(PF_UNIX, SOCK_STREAM, 0) = 80 5298 fcntl64(0x50, 0x4, 0x800, 0x4) = 0 5298 fcntl64(0x50, 0x2, 0x1, 0x2) = 0 5298 connect(80, {sin_family=AF_UNIX, path="/tmp/orbit-hp/linc-4f810a4c15125"}, 34) = -1 ECONNREFUSED (Connection refused) 5298 socket(PF_UNIX, SOCK_STREAM, 0) = 81 5298 fcntl64(0x51, 0x4, 0x800, 0x4) = 0 5298 fcntl64(0x51, 0x2, 0x1, 0x2) = 0 5298 connect(81, {sin_family=AF_UNIX, path="/tmp/orbit-hp/linc-4f810a4c15125"}, 34) = -1 ECONNREFUSED (Connection refused) 5298 socket(PF_UNIX, SOCK_STREAM, 0) = 82 5298 fcntl64(0x52, 0x4, 0x800, 0x4) = 0 5298 fcntl64(0x52, 0x2, 0x1, 0x2) = 0 there's an ECONNREFUSED on this client, so why is socket() called for each client that gets ECONNREFUSED, and where does the result of the socket() get closed?
Michael I could use your input on exactly what happens in ORBit/linc here, and whether there's an fd leak.
Sounds (1) exciting ;) and (2) like you have a grip on it. Michael tells me that removing my saved_state file will fix the problem and I'll be able to move from my desktop back to my laptop; will you need it? If so, I can keep a copy; otherwise, I'll just ditch it.
No I have a bad saved_state file, don't worry about it. Michael I wonder if the problem is that in linc_connection_initiate(), sometimes (cnx != NULL && fd >= 0) leading to an fd leak. if (!cnx && fd >= 0) { d_printf ("initiation failed\n"); close (fd); }
The bug causing the segfault is I believe that linc_server_accept_connection() does not initialize the "connection" variable when accept fails, and then linc_server_handle_io() calls a G_OBJECT() cast on this uninitialized variable.
I just found another bug: the fd leak causes settings to be lost, because when xmlParseFile() fails gconf assumes the XML document was corrupt and moves it to the side. So I need to distinguish a parse error from an I/O error in xmlParseFile(), but I have no idea how I'd do that, the API doesn't seem to allow it - in fact error reporting in libxml worries me quite a bit. I'll ask Daniel.
I'll take a look at the linc issues - as for the XML issues - perhaps there is no better time to switch to a push parser (as in librsvg) and preferably the SAX interface (as in librsvg) ;-)
Fixed the linc sillies - thanks for the report.
I don't quite see how you could get 3500+ linc files, that's real badness, but I've just fixed a couple of places where these might leak - but not at anything like the order needed to create that many I think.
Looking at linc CVS, I think linc_server_handle_io will still get a warning, since IIRC we didn't change the G_OBJECT() cast to allow NULL, I could be wrong. Moving the bug to linc so you can check that to your satisfaction, it's fine to close the bug afterward.
Just fixed the warning and did a new release for you.
*** Bug 75504 has been marked as a duplicate of this bug. ***