GNOME Bugzilla – Bug 498509
Insufficient unreferencing of async context?
Last modified: 2008-10-01 17:38:56 UTC
Please describe the problem: Maybe this is related to bug 494173. I start my async client session in a dedicated main loop with a dedicated GMainContext. There can be several simultaneous sessions, and some of them have a limited lifespan, so they must be clearly separated. After a while, my app crashes with too many open fds. This turns out to be due to the pipe pair of the GMainContext not being closed, ergo the GMainContext is not destroyed at the end of my session's lifecycle. Further debugging reveals this: 1. src->context = g_main_context_new (); // context refcount = 1 2. src->loop = g_main_loop_new (src->context, TRUE); // context refcount = 2 3. src->session = soup_session_async_new_with_options ( SOUP_SESSION_ASYNC_CONTEXT, src->context, NULL); // context refcount = 3 // session refcount = 1 4. src->msg = soup_message_new (SOUP_METHOD_GET, src->location); 5. g_signal_connect (src->msg, "got_chunk", G_CALLBACK (soup_got_chunk), src); 6. soup_message_set_flags (src->msg, SOUP_MESSAGE_OVERWRITE_CHUNKS); 7. soup_session_queue_message (src->session, src->msg, soup_response, src); // context refcount = 5 // session refcount = 2 8. g_object_unref (src->session); // session refcount = 1 9. g_main_loop_run (src->loop); // data is received. In got_chunk callback: 10. g_main_loop_quit (src->loop); 11. soup_message_io_pause (msg); // back in main thread: 12. soup_message_io_unpause (src->msg); 13. g_main_loop_run (src->loop); // lather, rinse, repeat until clean // main thread detects that Content-Length bytes have been read, // or there is an external interrupt forcing the session to stop. // context refcount = 5 // session refcount = 1 14. soup_session_abort (src->session); /* This unrefs the message. */ // context refcount = 5 // session refcount = 1 15. g_object_unref (src->session); // context refcount = 4 // session refcount = 0 16. g_main_loop_unref (src->loop); // context refcount = 3 17. g_main_context_unref (src->context); // context refcount = 2 If the session is aborted before step 7 (queueing the message) is reached, teardown occurs correctly, i.e. the context refcount drops to 0 and the pipe pair is closed. Steps to reproduce: Actual results: Expected results: Does this happen every time? Other information:
Created attachment 99403 [details] [review] patch I think this fixes it, but none of the test cases currently tests using a non-NULL async context, so I need to write a new test before committing. But let me know if it works for you.
Should be fixed (and regression tested) in svn now. This will go out with libsoup 2.2.104 as part of GNOME 2.20.2 next week. Thanks for the bug report. (btw, the reason valgrind was being unhelpful is because it tries to merge related leaks together based on their backtraces, but the default level [--leak-resolution=low] seems to end up merging *unrelated* leaks together, causing some of them to be hidden. --leak-resolution=med seems to work better.)
Created attachment 99431 [details] [review] Add a new test.
Note on comment 0: step 15 is a bug. I should not be unreffing the session there. I'm not sure this is fixed. I've added a test that isolates my use case from GStreamer. It exposes the refcount of the GMainContext object. Messy and dangerous, but it's for debug purposes only. Run with '-d' for debug info. Run with -'c' to let the server send the Content-Length header. Because of the 'got-body' signal not being emitted (fixed in bug 452280), I had code to abort the session when all the data arrived. Run with '-w' to unref the session and the GMainContext until their refcount is 0; useful for comparing Valgrind results. What I see is the following: - When running with '-c', session_abort() called before 'got-body' arrives. The session refcount is one higher than without -c, and does not fall to 0 after session_abort() and unreffing the session. The context refcount is 3 at the end of the test. - Even when running without '-c', the context refcount is 1 at the end.
Created attachment 99433 [details] [review] Add a new test. Oops; inverted -w logic.
fixed; the problem was that if you finalized a SoupMessage before it emitted "finished", it would leak the I/O state (which includes a reference to the socket, which includes a reference to the async_context).
Created attachment 119639 [details] [review] Extension of context-test to test for dangling ref in SoupAsyncSession
I'd like to reopen this bug. In libsoup-2.4, I'm still observing the problem. I have patched the context-test (attachment #119639 [details]) to also check the ref_count of the session. Note that if the call to soup_session_abort() is removed, the ref_count is correct. In soup_session_async::do_idle_run_queue, the session is reffed. However, if the session is aborted before the idle callback happens, it is never unreffed. I'm working on a patch that registers a DestroyNotify function for the idle callback.
Created attachment 119642 [details] [review] Fix for dangling ref in SoupAsyncSession and extension of context-test to test for it
The fix is even simpler than expected: since 2.12, glib has the g_source_is_destroyed () function, which makes the trick with the reffing and unreffing and weak pointer unnecessary. Patch is against 2.4.1, not trunk. I hope that's fine.
(In reply to comment #8) > I'd like to reopen this bug. In general, it's better to file a new bug than to add more comments to a year-old bug. (Because that lets the developers decide how to handle it; bugzilla lets you merge two bugs together, but doesn't let you easily split two bugs apart.) Anyway (In reply to comment #10) > The fix is even simpler than expected: since 2.12, glib has the > g_source_is_destroyed () function, which makes the trick with the reffing and > unreffing and weak pointer unnecessary. That doesn't seem like it should work... if idle_run_queue() is getting run, then it would do the unref of the session, and so there's no dangling ref. And likewise, if idle_run_queue() is getting run, then the idle source can't have been destroyed, can it? At any rate, the patch on bug 533473 complicates this code a bit more and would break your patch. I can confirm that the refcount of the session is wrong after the abort call though, so I will look in to this.
(In reply to comment #11) > (In reply to comment #10) > > The fix is even simpler than expected: since 2.12, glib has the > > g_source_is_destroyed () function, which makes the trick with the reffing and > > unreffing and weak pointer unnecessary. > > That doesn't seem like it should work... if idle_run_queue() is getting run, > then it would do the unref of the session, and so there's no dangling ref. And > likewise, if idle_run_queue() is getting run, then the idle source can't have > been destroyed, can it? As far as I understand, it is possible that a source still triggers after it has been destroyed. "Destroyed", in that case, does not equal "freed". You're right, though, that the fix doesn't work. It is still possible that the async_context (and therefore the idle source) is kept alive externally even after the session has been aborted, unreffed and freed, so the g_source_is_destroyed() function is no guarantee that the session still exists. > At any rate, the patch on bug 533473 complicates this code a bit more and would > break your patch. ACK. The solution would therefore be to use a DestroyNotify after all. Alternatively, a more general solution could be to use g_object_weak_ref() to register a cancellation callback (g_source_destroy()) for all the situations where weak pointers are used now. But that basically boils down to the same thing. Note that in that case the g_source_is_destroyed() check still needs to be done (I think). By the way, I wonder if attachment 118879 [details] [review] of bug 533473 mightn't create a similar situation. Isn't it possible that the connection is destroyed/aborted without call to got_connection, leaving a dangling ref to the session?
fixed in trunk; I changed soup-session-async to keep track of the idle_run_queue's GSource, and destroy it when finalizing the session, rather than having do_idle_run_queue ref the session. I'm not sure why i didn't do it that way originally. (In reply to comment #12) > By the way, I wonder if attachment 118879 [details] [review] [edit] of bug 533473 mightn't create a > similar situation. Isn't it possible that the connection is destroyed/aborted > without call to got_connection, leaving a dangling ref to the session? No. No one except the session ever sees the connection object, so the only way it could be destroyed is if the session is destroyed, and that can't happen while we're holding an extra ref on it. And aborting the session would cause the soup connection attempt to fail, resulting it it calling the callback. So it's all good. (Or at least, if it's not, I don't want to fix it, because the SoupConnection stuff is all getting rewritten to be less messy anyway :)
Unfortunately, the original problem persists: when the main loop is quit after a connection is initiated but before the connection is established, there is a dangling ref to the session. I think it is caused by the g_object_ref in SoupSessionAsync::run_queue(). The main_loop is stopped before got_connection is called. This is because soup_socket_connect_async() goes into CONTINUE mode, but the connect_watch is never triggered. I'm not entirely sure why not.