GNOME Bugzilla – Bug 329454
Ekiga freezes in config dialog when accessibility is on
Last modified: 2011-08-09 19:38:16 UTC
When I turn accessibility on in Gnome configuration and run ekiga-snapshot (version 1.99.0-20060125-02) for the first time (such that the configuration dialog appears), then on the 5th step of the configuration dialog (after skipping the free registration by saying ``no'') which deals with NAT configuration, the application freezes and I have to kill it. This only happens with accessibility turned on, when the GTK accessibility module gets loaded. Additionally, when I do this while running Gnopernicus and Gnome Speech, the speech halts and festival-synthesis-driver takes 100% CPU. I don't know how closely is this related, but it happens in the same moment. I'm including a backtrace bellow (with some symbols unresolved, unfortunatelly). However, the bug is very well reproducible. It was independently found on two different machines by two different users, one of them being me. It happens every time in my configuration. So I think it could be a good opportunity to fix a bug. Steps to reproduce: 1) Install Ekiga (formerly gnome-meeting) 2) Create a new user, login to Gnome under this user (Important since when running the Ekiga configuration druid from Ekiga menu, this freeze does not happen -- I think because Ekiga already stores some configuration. I stress you must have an unconfigured instance of Ekiga to be able to reproduce this.) 3) Turn on accessibility in preferences, relogin 4) Run Ekiga, it should display the configuration guide 5) Proceed with the configuration, skip the offered registration 6) At step 5 (NAT discovery), the application freezes and you are no longer able to push any button GDB backtrace of Ekiga: 0xb675a1a9 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 (gdb) bt
+ Trace 65814
If you need more information, please indicate how can I get them and I'll quickly send them back to you.
Hi Hanke: <standard stack trace improvement request> Thanks for the bug report. Unfortunately, that stack trace is not very useful in determining the cause of the crash. Can you get us one with debugging symbols? Please see http://live.gnome.org/GettingTraces for more information on how to do so. </standard stack trace improvement request> Actually stack traces don't help much for freeze bugs such as this. Are you using the latest glib and ORBit2? I think this bug needs to be reassigned to Ekiga for more evaluation, there's no indication of where things are hanging. Maybe a truss output would help?
Hmmmm... on that page, ekiga launches a firewall detection in a popup with a progress bar. Could it explain the freeze ?
progress bar.... could Ekiga be calling progress-bar-update APIs unnecessarily (i.e. too often) during this process? This could cause the progress bar to emit an enormous stream of RPC notifications...
> Actually stack traces don't help much for freeze bugs such as this. Are you > using the latest glib and ORBit2? Not CVS. I tried it with Orbit2 version 2.12.4 and glib 2.8.6 from Debian testing. Is this ok? > I think this bug needs to be reassigned to Ekiga for more evaluation, there's > no indication of where things are hanging. Maybe a truss output would help? I'm attaching strace output. > progress bar.... could Ekiga be calling progress-bar-update APIs unnecessarily > (i.e. too often) during this process? This could cause the progress bar to > emit an enormous stream of RPC notifications... This might be possible. If I disable accessibility and look at what would happen normally, I see it is not an ordinary progress bar but rather there is a bar going right and left to indicate activity. However, with accessibility on, I don't even see the progress bar window popup. I still think it should be reproducible. As I said, it happened on two different computers (whose configuration and library versions are likely different).
Created attachment 58546 [details] Stack trace up to the freeze I don't know if it is useful. It shows the same as the GDB trace. It stops in a mutex.
Here is a complete backtrace from GDB including the other threads too. Thread 2 looks somehow suspicous. Unfortunatelly, I didn't get more information about thread one. I might try to install libavahi-glib from sources if necessary. [Switching to thread 1 (Thread -1242863936 (LWP 11994))]#0 0xb683a1a9 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
+ Trace 65834
*** Bug 330647 has been marked as a duplicate of this bug. ***
Reassigning to correct module. It is not an Ekiga bug as Ekiga works without accessibility turned on. The only specificity is that the dialog is displayed from a threads. So I guess it is a threading problem in ATK.
*** Bug 327889 has been marked as a duplicate of this bug. ***
"Reassigning to correct module. It is not an Ekiga bug as Ekiga works without accessibility turned on." That's actually not necessarily a true statement - accessibility exposes application bugs fairly frequently. I suspect however that the problem does have to do with an Ekiga thread exiting, which is why gnome_accessibility_module_shutdown is appearing in the stack trace. If an app has more than one thread that posts gtk+ widgets in a gui, and one of those threads exits, then bad things are indeed likely to happen.
"That's actually not necessarily a true statement - accessibility exposes application bugs fairly frequently." Indeed, but I do not think it is the case here. "I suspect however that the problem does have to do with an Ekiga thread exiting, which is why gnome_accessibility_module_shutdown is appearing in the stack trace." That is absolutely correct. A thread is run, is doing the detection, and then displays the result from the detection before exiting. "If an app has more than one thread that posts gtk+ widgets in a gui, and one of those threads exits, then bad things are indeed likely to happen." So is it a limitation of ATK?
Not so much ATK, probably, as at-spi. While I don't quite see a "smoking gun" here yet, I think at-spi is probably the component in need of modification, if indeed the problem is support for exiting threads in apps which use more than one thread for gtk+ access. Damien, are you seeing this with ORBit2 HEAD, by the way? I recall Michael having put some important ORBit2 fixes in regarding multi-mainloop/thread support this year, and want to rule that area of investigation out for the time being, if possible. Frankly I don't understand what it waiting for who, in the traces above, so I'll need some help figuring out why the ORB is stuck waiting. (Please remember to add the "accessibility" keyword to all a11y bugs, otherwise they have limited visibility - thanks!)
I'm not able to test with ORBIT head (I would have to compile too many components by myself and I'm already too late on many aspects to be able to sync ekiga 2.00 and GNOME 2.13.), but perhaps the original reporter can try?
I am sorry, but I don't have any of the infrastructure setup here, nor the time at the moment. I am only beta-testing ekiga here, because Damien made it soooo incredibly easy with the daily deb packages (;
*** Bug 331296 has been marked as a duplicate of this bug. ***
I've done some test with 2.13 (HEAD) branch of GNOME : it is a little better : ekiga doesn't freeze when the "testing "NAT dialog appears. But freeze occurs when the next dialog "do you want to enable STUN support" popups. Stacktrace : (gdb) thread apply all bt
+ Trace 66263
Thread 2 (Thread -1269670992 (LWP 30694))
It actually freezes much more often than in the config dialog, when accessibility is on. The result is it is unusable with accessibility switched on in Gnome. However, these other crashes are not so specificly reproducible (probably because there are many possible crash situations) as the one in the configuration dialog which always occurs at the same place.
None of these stack traces have enough info to tell anything about at-spi here; the one that looks the oddest, which includes at-spi code, doesn't have line numbers (thread 2 of comment #6) and gnome_accessibility_module_shutdown is appears twice... basically the stack traces don't make sense!
better stacktrace, obtained on 2.14 branch, by pressing "test NAT" in the wizard, and after NAT test is complete and "enable STUN dialog" is displayed, program is freezed :
+ Trace 66657
Thread 11 (Thread -1276626000 (LWP 3839))
*** Bug 333466 has been marked as a duplicate of this bug. ***
I think this should be reopened now with the new info in comment #19 and #20
*** Bug 335838 has been marked as a duplicate of this bug. ***
Apologies for spam... ensuring Sun a11y folks are cc'ed on all current accessibility bugs.
IS this still happenning with at-spi-1.7.6 and later? Micheal's change to at-spi/atk-bridge/bridge.c (version 1.86 of that file) may have fixed this...
still freezing with at-spi 1.7.7 :(
I would like to know why the reporter changed from NEEDINFO to UNCONFIRMED without comment... did the problem magically disappear ? I must say that it annoys me greatly that ekiga doesn't work correctly with a11y : after all, if someone has sight issues for example, what better way to keep in touch with people than with voice !?
(In reply to comment #26) > I would like to know why the reporter changed from NEEDINFO to UNCONFIRMED > without comment... did the problem magically disappear ? > > I must say that it annoys me greatly that ekiga doesn't work correctly with > a11y : after all, if someone has sight issues for example, what better way to > keep in touch with people than with voice !? Hi, I just observed Frederic already provided the information which was needed in comment #25, so I reopened the bug. I thought it was obvious, but probably should have filled in a separate comment anyway, sorry.
Snark said: >I must say that it annoys me greatly that ekiga doesn't work correctly with >a11y : after all, if someone has sight issues for example, what better way to >keep in touch with people than with voice !? NO need to be snarky, snark :-) Unless you want to help us fix this. Ekiga does a bunch of stuff that most other gtk+ programs don't do, for instance it uses threading heavily, etc. This means it exposes new and difficult issues that we haven't had to deal with before. While configuration is important, it's not as bad as freezing during normal operation, since at least there are ugly workarounds available (for instance turning off a11y and having a sighted user configure ekiga). These workarounds are of course not acceptable long-term, but our resources are very limited where debugging these issues are concerned. If you wanted to get more involved in debugging this, that would be much appreciated and would move us forward to a solution faster.
Well, I'm not "snarky" (whatever that means) ; I was just worried to see the state of the bug just change. I don't know exactly what you would need to debug this... what "bunch of stuff" does ekiga do which other programs don't, apart from the threading (which annoys me too) ?
I had a look at this using GNOME 2.16 on Solarix x86. When I start ekiga the first configuration assistant runs. When I reach page 5 NAT Type, a dialog box with Detection in Progress appears. Eventually this stops. At this time the system is hung. Killing at-spi-regiustryd in an ssh session makes the system usable again. I have pasted stack trace for ekiga and at-spi-registryd below. 25110: ekiga ----------------- lwp# 1 / thread# 1 -------------------- fef476cb lwp_park (0, 0, 0) fef40814 slow_lock (fadf2000, 819b0b0, 0) + 40 fef4096b mutex_lock_impl (819b0b0, 0) + 14b fef40a52 mutex_lock (819b0b0) + 1d fee0bd0a gdk_threads_impl_lock (81bb390, fe981b24, 8046a1c, fadf2000, 8046a1c, fe914493) + 36 fee34ed6 gdk_event_check (81bb390) + 22 fe914493 g_main_context_check (81bb3d8, 7fffffff, 8546228, 5) + 1a3 fe914b11 g_main_context_iterate (81bb3d8, 1, 1, 819eb08) + 3bd fe915124 g_main_loop_run (8570758) + 1b8 feb5a2ba gtk_main (8046cf0, 8046b88, feffa7c0, 80bf17d, fe8262e0, 81188db) + b2 080b2391 main (1, 8046bcc, 8046bd4) + 485 080907fa _start (1, 8046d58, 0, 8046d5e, 8046d75, 8046d9c) + 7a ----------------- lwp# 2 / thread# 2 -------------------- fef476cb lwp_park (0, f9c3be38, 0) fef41902 cond_wait_queue (823cbf8, 823cbe0, f9c3be38, 0) + 3e fef41c97 cond_wait_common (823cbf8, 823cbe0, f9c3be38) + 1e1 fef41ebc _cond_timedwait (823cbf8, 823cbe0, f9c3beb8) + 4a fef41f4b cond_timedwait (823cbf8, 823cbe0, f9c3beb8) + 27 fef41f88 pthread_cond_timedwait (823cbf8, 823cbe0, f9c3beb8) + 21 fe768fbc __1cKPSyncPointEWait6MrknNPTimeInterval__i_ (823cbd0, f9c3bf04) + 9c fdb379b7 __1cLOpalManagerLGarbageMain6MrnHPThread_i_v_ (823c6c0, 823ddb0, 0) + 5b fdb38c59 __1cLOpalManagerVGarbageMain_PNotifierECall6kMrnHPObject_i_v_ (823d600, 823ddb0, 0) + 1d fe77f030 __1cNPSimpleThreadEMain6M_v_ (823ddb0, 0, fadf2400, fef8f000, 0, 0) + 40 fe767f05 __1cHPThreadOPX_ThreadStart6Fpv_1_ (823ddb0) + 85 fef47434 _thr_setup (fadf2400) + 52 fef47690 _lwp_start (fadf2400, 0, 0, 0, 0, 0) ----------------- lwp# 3 / thread# 3 -------------------- fef476cb lwp_park (0, 0, 0) fef40814 slow_lock (f9a20000, 819b0b0, 0) + 40 fef4096b mutex_lock_impl (819b0b0, 0) + 14b fef40a52 mutex_lock (819b0b0) + 1d fee0bd0a gdk_threads_impl_lock (f9b37cbc, 8615208, f9b37cec, f9b37cec, f9b37cec, 80b2740) + 36 fee0bcad gdk_threads_enter (823cd04, 1, f9b37d04, 0, fed90ba0, 3) + 21 080b2740 __1cbAgnomemeeting_threads_enter6F_v_ (823cd04, 1, 823ccf4, f9b37d98, 8167e30, 8615178) + 58 080c8d67 __1cJGMManagerLOnIPChanged6MrnGPTimer_i_v_ (823c6c0, 823ccf4, 1) + 4b 080cdec5 __1cJGMManagerVOnIPChanged_PNotifierECall6kMrnHPObject_i_v_ (823e080, 823ccf4, 1) + 1d fe77c0d7 __1cGPTimerJOnTimeout6M_v_ (823ccf4) + 4f fe77c1ad __1cGPTimerHProcess6MrknNPTimeInterval_r1_v_ (823ccf4, f9b37ec0, f9b37ea8) + cd fe77c43f __1cKPTimerListHProcess6M_nNPTimeInterval__ (f9b37f38, 8163a68) + 17f fe7662b9 __1cTPHouseKeepingThreadEMain6M_v_ (823dfd8, 0, f9a20000, fef8f000, 0, 0) + 39 fe767f05 __1cHPThreadOPX_ThreadStart6Fpv_1_ (823dfd8) + 85 fef47434 _thr_setup (f9a20000) + 52 fef47690 _lwp_start (f9a20000, 0, 0, 0, 0, 0) ----------------- lwp# 11 / thread# 11 -------------------- fef480b7 pollsys (f59fdae0, 2, 0, 0) fef03fde pselect (2b, 84f02c0, 84f02e8, 84f0310, 0, 0) + 19e fef042ee select (2b, 84f02c0, 84f02e8, 84f0310, 0) + 7e fe755673 __1cHPSocketGSelect6Frn0AKSelectList_22rknNPTimeInterval__nIPChannelGErrors__ (f59fddcc, f59fdd78, f59fdd84, f59fdef0) + fb fe772529 __1cHPSocketGSelect6Frn0AKSelectList_rknNPTimeInterval__nIPChannelGErrors__ (f59fddcc, f59fdef0) + 65 fdb6568d __1cPOpalListenerUDPGAccept6MrknNPTimeInterval__pnNOpalTransport__ (84f1480, f59fdef0) + 125 fdb632ac __1cMOpalListenerUListenForConnections6MrnHPThread_i_v_ (84f1480, 84f1a08, 0) + 194 fdb6c442 __1cMOpalListenerbEListenForConnections_PNotifierECall6kMrnHPObject_i_v_ (84f1f70, 84f1a08, 0) + 1a fe77f030 __1cNPSimpleThreadEMain6M_v_ (84f1a08, 0, f9a21000, fef8f000, fef478ae, 0) + 40 fe767f05 __1cHPThreadOPX_ThreadStart6Fpv_1_ (84f1a08) + 85 fef47434 _thr_setup (f9a21000) + 52 fef47690 _lwp_start (f9a21000, 0, 0, 0, 0, 0) ----------------- lwp# 16 / thread# 16 -------------------- fef476cb lwp_park (0, 0, 0) fef41902 cond_wait_queue (85db548, 86026c8, 0, 0) + 3e fef41ded _cond_wait (85db548, 86026c8) + 69 fef41e2b cond_wait (85db548, 86026c8) + 24 fef41e65 pthread_cond_wait (85db548, 86026c8) + 1e fe9af67d giop_recv_buffer_get (f56fcaa8) + c5 fe9b2b95 ORBit_small_invoke_stub (8232b60, f9df0480, 0, f56fcb9c, 0, f9cd4fe0) + 139 fe9b2a3d ORBit_small_invoke_stub_n (8232b60, f9df0460, 0, 0, f56fcb9c, 0) + 3d fe9c41ea ORBit_c_stub_invoke (8232b60, f9df0460, 0, 0, f56fcb9c, 0) + 12e f9dba0b8 Accessibility_EventListener_notifyEvent (8232b60, f56fcbc4, f9cd4fe0) + 4c f9cc3519 spi_atk_emit_eventv (8609b60, 0, 0, f56fcc44, f9cc443c, fed0ea50) + 1b1 f9cc41c2 spi_atk_bridge_window_event_listener (f56fccac, 1, f56fcd4c, 8233378) + 8e fedc1b28 signal_emit_unlocked_R (8236300, 0, 8609b60, 0, f56fcd4c) + 5a0 fedc0fe0 g_signal_emit_valist (8609b60, 91, 0, f56fcfb4) + 8c4 fedc1175 g_signal_emit (8609b60, 91, 0) + 25 f9d34591 window_removed (81b5968, 1, 8609b60, 0) + 99 fedc2df2 g_cclosure_marshal_VOID__UINT_POINTER (8233398, 0, 3, f56fd15c, f56fd0bc, 0) + 5e fedae073 g_closure_invoke (8233398, 0, 3, f56fd15c, f56fd0bc) + 107 fedc1cce signal_emit_unlocked_R (822d4c0, 22f, 81b5968, 0, f56fd15c) + 746 fedc0fe0 g_signal_emit_valist (81b5968, 72, 22f, f56fd440) + 8c4 fedc14d7 g_signal_emit_by_name (81b5968, f9d3b234, 1, 8609b60, 0) + 35b f9d2db10 _gail_toplevel_remove_child (81b5968, 84a7b98) + bc f9d2da46 gail_toplevel_hide_event_watcher (f56fd4cc, 1, f56fd56c, 81b5968) + 5e fedc1b28 signal_emit_unlocked_R (82232b8, 0, 84a7b98, 0, f56fd56c) + 5a0 fedc0fe0 g_signal_emit_valist (84a7b98, 1e, 0, f56fd7d4) + 8c4 fedc1175 g_signal_emit (84a7b98, 1e, 0) + 25 fec5caa8 gtk_widget_hide (84a7b98) + 70 fec62adb gtk_widget_dispose (84a7b98) + 3f fec66f0d gtk_window_dispose (84a7b98, 84a7b98, fed45e5c, fec5c804, f56fd868, f56fd86c) + 39 fedb00e5 g_object_run_dispose (84a7b98) + 3d feb7b588 gtk_object_destroy (84a7b98) + 3c fec5c834 gtk_widget_destroy (84a7b98) + 30 080e3aac gnomemeeting_threads_widget_destroy (84a7b98) + 20 080d7438 __1cMGMStunClientEMain6M_v_ (8602cf8, 0, f9a21c00, fef8f000, fef478ae, 0) + 528 fe767f05 __1cHPThreadOPX_ThreadStart6Fpv_1_ (8602cf8) + 85 fef47434 _thr_setup (f9a21c00) + 52 fef47690 _lwp_start (f9a21c00, 0, 0, 0, 0, 0) ----------------- lwp# 9 / thread# 9 -------------------- fef480b7 pollsys (8493200, c, 0, 0) feeff7c2 poll (8493200, c, ffffffff) + 52 fe914aeb g_main_context_iterate (8490e00, 1, 1, 83c7770) + 397 fe915124 g_main_loop_run (8497dc0) + 1b8 fe9c9b1e link_io_thread_fn (0) + 1e fe92e676 g_thread_create_proxy (83c7770) + 11a fef47434 _thr_setup (f9a20800) + 52 fef47690 _lwp_start (f9a20800, 0, 0, 0, 0, 0) ----------------- lwp# 10 / thread# 10 -------------------- fef480b7 pollsys (f97ada20, 2, 0, 0) fef03fde pselect (28, 84f0248, 84f0270, 84f0298, 0, 0) + 19e fef042ee select (28, 84f0248, 84f0270, 84f0298, 0) + 7e fe7681d7 __1cHPThreadLPXBlockOnIO6MiirknNPTimeInterval__i_ (84f12d8, 26, 2, f97adcc4) + 1af fe75e05f __1cIPChannelMPXSetIOBlock6Mn0ALPXBlockType_rknNPTimeInterval__i_ (84f1060, 2, f97adcc4) + 153 fe7554f5 __1cHPSocketJos_accept6Mr0pnIsockaddr_pi_i_ (84f0e00, 84f1060, f97add00, f97ade10) + 29 fe775432 __1cKPTCPSocketGAccept6MrnHPSocket__i_ (84f0e00, 84f1060) + 72 fdb64b1c __1cPOpalListenerTCPGAccept6MrknNPTimeInterval__pnNOpalTransport__ (84f1028, f97adef0) + 11c fdb632ac __1cMOpalListenerUListenForConnections6MrnHPThread_i_v_ (84f1028, 84f12d8, 0) + 194 fdb6c442 __1cMOpalListenerbEListenForConnections_PNotifierECall6kMrnHPObject_i_v_ (84f1248, 84f12d8, 0) + 1a fe77f030 __1cNPSimpleThreadEMain6M_v_ (84f12d8, 0, f9a20400, fef8f000, fef478ae, 0) + 40 fe767f05 __1cHPThreadOPX_ThreadStart6Fpv_1_ (84f12d8) + 85 fef47434 _thr_setup (f9a20400) + 52 fef47690 _lwp_start (f9a20400, 0, 0, 0, 0, 0) ----------------- lwp# 8 / thread# 8 -------------------- fef480b7 pollsys (8496580, 1, 0, 0) feeff7c2 poll (8496580, 1, ffffffff) + 52 fe914aeb g_main_context_iterate (8496700, 1, 1, 8496610) + 397 fe915124 g_main_loop_run (8497f10) + 1b8 fb3599fb startup_mainloop (0) + 2b fe92e676 g_thread_create_proxy (8496610) + 11a fef47434 _thr_setup (f9a21400) + 52 fef47690 _lwp_start (f9a21400, 0, 0, 0, 0, 0) bash-3.00$ pstack `pgrep regis` 25112: /usr/lib/at-spi-registryd --oaf-activate-iid=OAFIID:Accessibility_Regi fef480b7 pollsys (80eb038, 7, 0, 0) feeff7c2 poll (80eb038, 7, ffffffff) + 52 feca4aeb g_main_context_iterate (8072450, 1, 1, 806e8d0) + 397 feca4d7b g_main_context_iteration (8072450, 1) + 87 fee099a5 link_main_iteration (1) + 21 fedef617 giop_recv_buffer_get (8046618) + 5f fedf2b95 ORBit_small_invoke_stub (80eaeb8, fec51b80, 0, 0, 0, 804672c) + 139 fedf2a3d ORBit_small_invoke_stub_n (80eaeb8, fec51b5c, 0, 0, 0, 0) + 3d fee041ea ORBit_c_stub_invoke (80eaeb8, fec51b5c, 0, 0, 0, 0) + 12e fec396f5 Bonobo_Unknown_ref (80eaeb8, 804672c) + 41 fc9b66da bonobo_object_dup_ref (80eaeb8, 0) + 5a 080589bb registry_filter_event (8076818, 8046788, 80469bc) + 187 08058b01 impl_registry_notify_event (807682c, 80eddac, 80469bc) + 85 fee6b62a _ORBIT_skel_small_Accessibility_EventListener_notifyEvent (807682c, 0, 8046850, 8046870, 80469bc, 8058a7c) + 16 fee0112a ORBit_POAObject_invoke (8099c88, 0, 8046850, 8046870, 80468f8, 80469bc) + 22 fee05519 ORBit_OAObject_invoke (8099c88, 0, 8046850, 8046870, 80468f8, 80469bc) + 21 fedf2f1b ORBit_small_invoke_adaptor (8099c88, 80ee5d0, fee90480, 80468f8, 80469bc) + 2b3 fee01562 ORBit_POAObject_handle_request (8099c88, 80f3104, 0, 0, 0, 80ee5d0) + 32a fee01960 ORBit_POAObject_invoke_incoming_request (8099c88, 80ee5d0, 80469bc) + 54 fee01de6 ORBit_POA_handle_request (8073070, 80ee5d0, 80ee5e8) + 2ea fee05317 ORBit_handle_request (8072ff8, 80ee5d0) + 4b fedf014a giop_connection_handle_input (8089108) + 2e2 fee0b4e9 link_connection_io_handler (0, 1, 8089108) + 55 fee0d10d link_source_dispatch (80ea050, fee0b494, 8089108) + 41 feca3615 g_main_dispatch (80724d8) + 1d9 feca4705 g_main_context_dispatch (80724d8) + 85 feca4b22 g_main_context_iterate (80724d8, 1, 1, 806e8d0) + 3ce feca5124 g_main_loop_run (808efb8) + 1b8 fc9b201d bonobo_main (8046d8c, 8046c38, feffa7c0, feffa7c0, 8046c38, 8046d8c) + 5d 08057ddf main (3, 8046c7c, 8046c8c) + cb 08054b4a _start (3, 8046df4, 8046e0e, 8046e43, 0, 8046e53) + 7a
I compiled at-spi-registryd without optimization and got the stack trace below. I assume that that at-spi-registryd is attempting to call back into the application. I have just noticed that thread 16 is waiting on a condition variable instead of being in a poll call. I wonder why... 27065: /usr/lib/at-spi-registryd fef480b7 pollsys (80a29a8, 7, 0, 0) feeff7c2 poll (80a29a8, 7, ffffffff) + 52 fed14aeb g_main_context_iterate (807a750, 1, 1, 8076f18) + 397 fed14d7b g_main_context_iteration (807a750, 1) + 87 fec699a5 link_main_iteration (1) + 21 fec4f617 giop_recv_buffer_get (8046548) + 5f fec52b95 ORBit_small_invoke_stub (80f36a8, fecc1b80, 0, 0, 0, 804665c) + 139 fec52a3d ORBit_small_invoke_stub_n (80f36a8, fecc1b5c, 0, 0, 0, 0) + 3d fec641ea ORBit_c_stub_invoke (80f36a8, fecc1b5c, 0, 0, 0, 0) + 12e feca96f5 Bonobo_Unknown_ref (80f36a8, 804665c) + 41 fc9b66da bonobo_object_dup_ref (80f36a8, 0) + 5a 0805f1fb registry_clone_notify_context (80466d8) + 2b 0805f722 registry_queue_event (807e818, 80466d8) + 42 0805f827 registry_filter_event (807e818, 80466d8, 80468ec) + c7 0805f9a9 impl_registry_notify_event (807e82c, 80f6a3c, 80468ec) + 89 fee6b62a _ORBIT_skel_small_Accessibility_EventListener_notifyEvent (807e82c, 0, 8046780, 80467a0, 80468ec, 805f920) + 16 fec6112a ORBit_POAObject_invoke (80a1f50, 0, 8046780, 80467a0, 8046828, 80468ec) + 22 fec65519 ORBit_OAObject_invoke (80a1f50, 0, 8046780, 80467a0, 8046828, 80468ec) + 21 fec52f1b ORBit_small_invoke_adaptor (80a1f50, 80f30e0, fee90480, 8046828, 80468ec) + 2b3 fec61562 ORBit_POAObject_handle_request (80a1f50, 80f5e2c, 0, 0, 0, 80f30e0) + 32a fec61960 ORBit_POAObject_invoke_incoming_request (80a1f50, 80f30e0, 80468ec) + 54 fec61de6 ORBit_POA_handle_request (807b370, 80f30e0, 80f30f8) + 2ea fec65317 ORBit_handle_request (807b2f8, 80f30e0) + 4b fec5014a giop_connection_handle_input (8091108) + 2e2 fec6b4e9 link_connection_io_handler (0, 1, 8091108) + 55 fec6d10d link_source_dispatch (80f3088, fec6b494, 8091108) + 41 fed13615 g_main_dispatch (807a7d8) + 1d9 fed14705 g_main_context_dispatch (807a7d8) + 85 fed14b22 g_main_context_iterate (807a7d8, 1, 1, 8076f18) + 3ce fed15124 g_main_loop_run (80f2b38) + 1b8 fc9b201d bonobo_main (8075e20, 807e818, 807bf08, 807bf0a, 807bf08, 8061484) + 5d 0805e109 main (1, 8046bb0, 8046bb8) + 119 08058d1a _start (1, 8046d3c, 0, 8046d56, 8046d6d, 8046d94) + 7a
OK - this is a quite typical threading related evilness. Ekiga is highly threaded and uses this in the GUI parts AFAIR. The problem is this - when we designed ORBit2 it was not possible to tell whether there was a 'glib main loop' running - so this is a little messy. AFAIR the punch-line is, you need to have a gtk+ mainloop running somewhere so we can do the moral equivalent of g_idle_add (process_message); in the I/O thread as/when a request comes in for a thread-unsafe [ ie. all ATK interfaces ] object. Now from trace #19 [ from Fedrich Crozay ] which is most helpful we see:
+ Trace 87038
Thread 1 (Thread -1247610016 (LWP 3812))
ie. the guy we are waiting for to become idle and process the incoming request on the CORBA wrapper for the underlying AtkObject is blocking on the GDK_THREADS_ENTER/LEAVE lock we hold in thread 11 instead. That is bad. ;-( Unfortunately (of course) the at-spi-registryd needs to take a reference on that object before it can allow the 'notifyEvent' to return, which necessitates calling back into the client; which needs to be handled on the main thread, which needs the same threads lock.
Probably, the simplest (and perhaps) least risky fix is to add:
+ Trace 87041
GDK_THREADS_LEAVE(); Accessibility_Registry_notifyEvent(...); GDK_THREADS_ENTER(); immediately around the long-running CORBA calls here. Of course at the same time it would pay to do an audit of the code in the bridge to ensure we hold references / dup state where we need to, but a quick glance suggests most things are fine there. HTH :-)
Wouldn't my idea : http://bugzilla.gnome.org/show_bug.cgi?id=353533 solve the issue too?
Snark: Your idea is a workaround to the fact that there is a bug in atk. So, yes, after 2 or 3 months of reworking inside Ekiga, we would have hidden the bug.
Michael: That makes sense, and thanks also Padraig. I'll start preparing a patch to add GDK_THREADS guards around the notify emissions in atk-bridge. BTW Damien, despite the "atk" namespace in the code, it's at-spi (the "atk-bridge" component") that seems to have the bug.
oh, and while I was concerned about re-enterancy problems - they have (broadly) always existed at these points, so this should be no issue for us in fact [ but worth checking of course anyway ;-].
*** Bug 380010 has been marked as a duplicate of this bug. ***
*** Bug 369321 has been marked as a duplicate of this bug. ***
Li Yuan will try to make the patch to add GDK_THREADS guards around the notify emissions in atk-bridge and do some testing to see if it works.
One problem is that the bridge currently does not depend on GDK and it is not clear that we should be adding such a dependency to the bridge.
Interesting :-) good question wrt. what to do. You could -try- to push-down the gtk+ locking code into atk: of course you need to be careful, OO.o is tightly bound to gtk+'s custom locking functions. At worst you can do some fugly OO.o style hack where you do a dlsym on yourself to see if you have gdk_threads_enter / leave, hook those functions out and call them yourself. How does that sound ? :->
Outsider view : insane.
Padraig is correct - atk-bridge should not depend on GDK. Why isn't the GDK_THREAD stuff in glib instead, I wonder?
Another possibility - which fits better with the ATK/GAIL/ATK-bridge design I think, but which would require quick action since Gnome 2.17 ABI freeze is very soon: 1) toolkit dependencies should be restricted to the ATK implementation code; thus gdk/gtk+ dependencies should be restricted to gail. 2) specifically we've used AtkUtil API in the past to deal with this sort of issue, when an atk client like atk-bridge needs to access implementation-dependent code. Thus we could add ATK_THREADS_ENTER/ATK_THREADS_LEAVE, which would wrap some AtkUtil function pointers which were assigned in the atk initialization code, for instance something like atk_util_threads_enter (util) gailutil.c could then be modified to provide the appropriate implementation for atk_util_threads_ methods. Apps/toolkits wishing to use ATK without GDK/GTK+ could then provide their own appropriate thread locking code. Because AtkUtil doesn't seem to have any unused vfunc pointers in its AtkUtilClass struct, we might need to do something creative in order to add these virtual methods. But since currently we don't instantiate any AtkUtil objects, I would hope this would be feasible. At worst we could use some kind of versioning/dlsym approach to detect whether the new version of ATK was in use (preserving back-compat). We need to change/replace (via deprecation) the signatures on the AtkUtil prototypes anyway, because the current prototypes (which don't take object instances as their first parameters) are difficult to bind with python. (Ask 'ddrake at brontes3d dot com' for more info about that) Sound OK? Bill
Glib has nothing to do with gui.
Yes - that would be do-able I guess. Of course, there should be no need to break AtkUtil - surely you can add another AtkMisc [ or whatever ] object to contain these new methods, with a great big gpointer expansion[32] or whatever] in the Class ? :-)
Snark: Glib would be a sensible place for a general purpose threads utility. As this bug demonstrates, code that is called from inside a GDK/GTK+ program may need this without wanting to incur a gdk dependency. ATK/atk-bridge is just finding it first, but something like a threaded media lib could easily encounter the same sort of issue.
Bill: glib already has threading utilities. Ekiga is heavily threaded media application, which is why the problem showed up there. But it protects dangerous paths with _threads_enter/_leave calls. Obviously there's something somewhere which doesn't.
At a second thought, there might be a problem in Michael's suggestions in comment #33. If we do sth. like: GDK_THREADS_LEAVE(); Accessibility_Registry_notifyEvent(...); GDK_THREADS_ENTER(); It might work with ekiga's code: GDK_THREADS_ENTER(); gtk_widget_destroy (widget); GDK_THREADS_LEAVE(); However, if the outside code doesn't call GDK_THREADS_ENTER() as in most cases, calling GDK_THREADS_ENTER() in bridge will mark the beginning of a critical section without its end pair. I think this will cause bigger problems. Or did I missing sth?
Harry - I am not sure I understand you, but I think you are pointing out that the in-process code which handles the event from Accessibility_Registry_notifyEvent() needs to call GDK_THREADS_ENTER/LEAVE in order for the above code to work correctly. Is this what you meant? Anyway, we CANNOT use GDK_THREADS_ENTER/LEAVE here, since the atk-bridge must not depend on GDK. If we need to use thread guards here, they must be added to ATK API instead. We have only a few days to do this for gnome-2.17. However I don't think we have a clear solution to the problem yet, as Harry points out. There are a couple of things I am not clear about yet; 1) why doesn't the notifyEvent call complete here - i.e. what code is causing thread 1 to be called into? i.e. what ATK call is being fulfilled in thread 1? Can it be fulfilled safely without holding the GDK lock? In response to Harry's question in comment #50 - should we be putting thread guards in all atk-bridge wrappers, or in all the GAIL implementations? That would ensure that the appropriate GDK lock was held. If we put the thread guards (GDK_THREADS_ENTER/LEAVE) in all the gail implementations, then we could also release the lock in gail just before emitting ATK events and re-claim the lock afterwards (this might avoid any need for new ATK locking API). What do you think of this last suggestion, Harry and Li?
Harry - there is little to ~no risk to dropping the GDK_THREADS lock while we do a remote CORBA call, if other code is not taking the lock correctly - that is broken anyway. Bill - wrt. the things you're not clear about: did you read my analysis in conjunction with trace #19 ? - the reason the notifyEvent call doesn't complete is that it is synchronous - somewhere during the processing of that synchronous event an AT (or something) wants to call -into- this process again. Unfortunately at some stage before that call could enter an X event came along to be processed: thread 11 is holding the GDK mutex that thread 1 needs to acquire in order to execute that incoming method ergo deadlock. Ergo we never get to processing the incoming CORBA call anyway. However, this is not really X specific at all - we -should- be taking the GDK_THREADS mutex on every (idle) incoming CORBA call anyhow [ I guess ] - most likely we forget to do that ATM ;-) but we should be doing it. Ergo - this needs careful thought, analysis of the traces, a little understanding, and a way to move the locking primitives to somewhere where they can be used consistently in the bridge. Having that in 'gail' sounds sensible, ultimately I think we just need 2 new methods 'atk_enter' 'atk_leave' (or somesuch) that are implemented in gail, and call them at suitable places.
Bill, as Michael said in above, in comment #19, thread 1 is ekiga's main thread and is calling gtk_main(). Thread 11 is calling some code like: GDK_THREADS_ENTER(); gtk_widget_destroy (widget); GDK_THREADS_LEAVE(); And from Michael's analysis you can see the deadlock reason. Michael, from your comment, "there is little to ~no risk to dropping the GDK_THREADS lock while we do a remote CORBA call, if other code is not taking the lock correctly - that is broken anyway." Do you mean we can call gdk_thread_leave() in a thread even if it doesn't call gdk_thread_enter() before? From gdk.c, gdk_threads_impl_unlock() is calling g_mutex_unlock (). And in glib/gthread/gthread-impl.c, g_mutex_unlock_errorcheck_impl () will lead to g_error or g_warning if current thread doesn't lock the mutex now. Michael and Bill, are you suggesting we add gdk_thread_enter() and gdk_thread_leave() in gail's functions which are calling CORBA functions? Also from the code above, it seems a thread can not call gdk_thread_enter() again if it has called gdk_thread_enter() before. So if ekiga's code has called GDK_THREADS_ENTER(), we cannot call it again in gail or bridge in the same thread. Is there a way to check whether current thread has called gdk_thread_enter() already (holding the mutex)?
Harry; yes we need to call gdk_thread_leave() in code that didn't -take- that mutex; and of course we need to call gdk_thread_enter() again when we've done whatever call we want to release the guard during. This is normal (though possibly counter-intuitive) cf. eg. gtk/gtkmain.c (gtk_main): { GDK_THREADS_LEAVE (); g_main_loop_run (loop); GDK_THREADS_ENTER (); gdk_flush (); }
Michael, I guess the code above works because the thread is holding the gdk mutex already. Anyway, any threads that dealing with gtk should hold the gdk mutex already so I think this should not be a problem. So do we still need to add thread guards (GDK_THREADS_ENTER/LEAVE) in gail as it should be added by the thread calling gtk functions? Calling GDK_THREADS_LEAVE() to release the lock in gail just before emitting ATK events and calling GDK_THREADS_ENTER() to re-claim the lock afterwards might be a possible solution, as Bill suggested. For this bug, from thread 11 in comment 19, around stack 12, I think adding LEAVE and ENTER pair around g_signal_emit() in window_removed() of gailutil.c should work. But how could we avoide signal nesting problem? I still think we cannot call GDK_THREAD_LEAVE twice in one thread without calling GDK_THREAD_ENTER. So if one ATK signal triggers another ATK signal, there might be a problem.
Hi Harry. I added some GDK_THREADS guards in gail some time ago when we started using gail/atk in OO.o - I -think- I caught almost all places. In summary, the only place you need to add them is where the lock is not already held - eg. g_idle / g_timeout callbacks. And yes - taking it when you don't need to is a problem; it's a non-recursive mutex. I'm confused wrt. your suggestion wrt. window_removed, AFAICS this is gail code called by gtk+ code - and should => be covered by the same mutex. There is no need to do any LEAVE/ENTER stuff there (that I can see (?)). At least - signals themselves should not be a problem. The only place we need to drop / re-take the lock is over long CORBA calls (eg. event emission) that in order to complete may require processing incoming calls on other threads [ ie. mostly isolated to at-spi/bridge surely ? ]
Michael, we will do a check to ensure all g_idle / g_timeout callbacks have the GDK_THREADS guards in gail. As for my suggestion in comment #55, actually it was suggested by Bill in last paragraph of comment #51, "If we put the thread guards (GDK_THREADS_ENTER/LEAVE) in all the gail implementations, then we could also release the lock in gail just before emitting ATK events and re-claim the lock afterwards (this might avoid any need for new ATK locking API)." The problem for the above suggestion is that signals could be nested, so we could not just wrap around all g_signal_emit* by GDK_THREADS_LEAVE and GDK_THREADS_ENTER in gail. So it looks we have to do sth. in bridge. We have several options here: 1. in bridge, "do a dlsym on yourself to see if you have gdk_threads_enter / leave, hook those functions out and call them yourself." 2. add atk_util_threads_enter/leave pair in atkutil.c, implement them in gail using gdk_threads_enter/leave, and call them in bridge. We might be able to fix the python binding problems at the same time. This will break ABI, right? 3. add atk_misc_threads_enter/leave pair in new atkmisc.c as suggested by Michael, implement them in gail, and call them in bridge. So which one should we choose?
personally, I like '3' - ultimately the Qt guys will have the same problem in future, and it'd be nice to have API for them to use. Also - if we find a better way in future, it's easy to deprecate these methods :-) Finally - when you create AtkMisc - please do add gpointer dummy[32]; or something - the cost of that is only 128/256 bytes per process if an AtkMisc is instantiated, and it can save a lot of hassle in future :-) HTH.
Created attachment 79758 [details] [review] ATK patch as applied applied to cvs Jan 8, 2006, and in ATK_1_13_0 tarball. ATK_1_13_1 tarball (to be released later today) will have some minor fixups.
Created attachment 79760 [details] [review] proposed gail patch, which includes GDK_THREADS_ENTER implementation. Li, does this seem correct to you?
gail patch has been committed to gail 1.9.5.
I will make bridge patch this week.
Created attachment 80585 [details] [review] atk part of patch
Created attachment 80586 [details] [review] gail part of patch
Created attachment 80587 [details] [review] spi part of patch
Comment on attachment 80585 [details] [review] atk part of patch Why do we need to do ATK_MISC_GET_CLASS instead of g_type_class_ref(ATK_TYPE_MISC) here?
Comment on attachment 80586 [details] [review] gail part of patch yes, I agree we probably need to unref/ref GAIL_TYPE_MISC as well as GAIL_TYPE_UTIL here.
because we use AtkMiscClass *miscclass = ATK_MISC_CLASS (klass); in gail_misc_class_init. This only overload "gail_misc_class", not "atk_misc_class". They are different instances. If we use g_type_class_peek (ATK_TYPE_MISC); in gail_misc_class_init, then we can use g_type_class_ref (ATK_TYPE_MISC); here.
Is it FIXED like in "proposed solution committed and tested in ekiga : it doesn't freeze anymore", or just FIXED like in "the proposed solution was committed" ? There was no comment in the last mail, so I prefer asking...
I tested on my system, it doesn't freeze anymore.
Wonderful! It was terrifying to have a phone not working for people it could help much... That rocks!
Very good job, thanks so much !!!!!!
Thank you.
Li Yuan, Bill, can you please take a look at bug 410636? thanks.
Andre I believe 410636 is fixed in cvs already.
While the committed fix may have fixed the ekiga problem, it breaks a large number of other threaded gtk applications. E.g gftp freezes now if a11y is enabled. This is because it is common to do the following: g_threads_init (NULL); gdk_threads_init (); gtk_init (&argc, &argv); /* setup code */ GDK_THREADS_ENTER(); gtk_main (); GDK_THREADS_LEAVE(); The problem is that the a11y stack is not doing thread leave/enter pairs before the application code calls GDK_THREADS_ENTER. Compiling with errorcheck mutexes clearly exposes that the a11y code tries to unlock an unlocked mutex. Maybe this is the correct way to handle things, and we just have to fix all applications to take the lock immediately after initializing threads. Anyway, reopening this for further consideration, since it does break applications which have worked with all previous GTK+ releases.
From the samples in http://developer.gnome.org/doc/API/2.0/gdk/gdk-Threads.html, the correct call order is: g_thread_init (NULL); gdk_threads_init (); gdk_threads_enter (); gtk_init (&argc, &argv); gtk_main (); gdk_threads_leave (); So I think we should fix all the broken apps.
I know that that's what the example shows. The point is that g_thread_init (NULL); gdk_threads_init (); gtk_init (&argc, &argv); gdk_threads_enter (); gtk_main (); gdk_threads_leave (); did work just fine until recently...
of course we could fix all broken apps, but i wonder if two weeks before a major release is the best time for it... right now i'm just interested in a smooth workaround (or better: solution) for a stable GNOME 2.18 (and also the GNOME apps that are not directly part of GNOME, but that are shipped by the distros).
Another application that breaks due to this is the mail-notification applet for evolution.
So - surely one simple fix is to avoid doing any locking during 'init' time - right ? :-) While we are initializing there is almost certainly no likelihood that some horribly complicated threading scenario is underway: so, we can simply have some guard against this to inhibit the locking there. Would that fix it ? :-)
So - code wise this should be trivial; if I could compile / test it I'd do it myself: static int during_init_shutdown = 0; #define bridge_threads_leave() \ if (!during_init_shutdown) atk_misc_threads_leave(misc) #define bridge_threads_enter() \ if (!during_init_shutdown) atk_misc_threads_enter(misc) Then do a search/replace of the existing locking calls, At top/tail of atk_bridge_init inc/dec 'during_init_shutdown' and similarly at top/tail of 'spi_ati_bridge_exit_func' and 'gnome_accessibility_module_shutdown' HTH.
The application doesn't hang when atk-bridge init. You know, gftp also does some setup work before it calls GDK_THREADS_ENTER, like g_threads_init (NULL); gdk_threads_init (); gtk_init (&argc, &argv); /*...*/ gtk_widget_show(window); GDK_THREADS_ENTER(); gtk_main (); GDK_THREADS_LEAVE(); The gtk_widget_show will emit signals to atk-bridge, this will hang the application. And we can not tell which signal is before the first GDK_THREADS_ENTER and which is after it. So I think the best way is changing the application's init code.
Michael, I think it is pretty much impossible for a library/module deep in the stack to know if the app has already taken the lock or not. One could try to check if we are in a mainloop already, but that sounds somewhat error-prone, too.
Ah - yes, this is horribly difficult then :-) and yes - basically it's a grotesquely evil problem. FWIW, we suffer almost the same problem with CORBA calls being called without a mainloop running & not know if there is such a loop, and so on. What you might be able to do is use the new: gboolean g_main_context_is_owner (GMainContext *context); which is the method I wanted to try and work around / detect / emulate some of the old behaviors in the threaded ORB. It should tell you if you're a child of the main glib mainloop, which -might- be what you want to find out :-) It seems unreasonable to expect people to take the toolkit lock as the 1st step - at least, so much code doesn't do that that ...
I have tried this. Unfortunately g_main_context_is_owner will return FALSE if application has not called gtk_main. So it can not resolve our problems.
Well, the imporant thing is that it returns TRUE if you are inside the main loop, which is the only place where we can be reasonably sure that the app will hold the GDK lock, so it is safe to do the leave/enter thing.
Li - sure, but when you havn't called gtk_main is surely precisely what you want to know ? :-) [ such that you don't drop / take the lock ? ;-] Sigh - perhaps that's not what you want: perhaps a -better- approach is to guard the init (as we discussed before), and do: g_idle_add (post_init, NULL); and the post_init turns on the locking: that would perhaps work better (including for all the strange threaded scenarios where g_main_context_is_owner will not be happy at all ;-). How does that sound ?
It sounds great! :-) I will attach a patch.
Created attachment 83613 [details] [review] new patch
Does it look good?
Looks fine to me - though not an ideal solution of course, but compatible. Worth checking your g_idle_add function triggers only once [ I don't recall the polarity of the return value ]; otherwise great.
Committed, thanks Michael and Matthias!
I just tried the patch and Ekiga no longer freezes when accessibility is on. But if Orca is running, Ekiga freezes at step 4. Not sure if that's an at-spi issue or an Orca issue....
Could you remind us which step the fourth is ?
Connection Type
That means a place where we show a popup window with a progression bar from another thread :-) Since this is not a problem when no accessibility is used... probably this bug should be reopened.
Are you sure the freeze only happens with the patch? I should not be the patch's problem...
Nice: of course, it'd be rather useful to get a new stack-trace for step 4: it's most likely a different problem - hopefully something somewhat different :-)
Yes I got the freeze today. And I found the problem is if Orca does some corba calls to ekiga, the handle function (impl_*) in ekiga will not be in gdk_threads_enter/leave guards. So if the handle function emit some signals, our bridge will call gdk_threads_leave firstly. Here is the trace: libgdk-x11-2.0.so.0.1000.9`gdk_threads_impl_unlock+0x3b libatk-1.0.so.0.1709.1`atk_misc_threads_leave+0x4c libatk-bridge.so`spi_atk_emit_eventv+0x206 libatk-bridge.so`spi_atk_bridge_property_event_listener+0x5d7 libgobject-2.0.so.0.1200.9`signal_emit_unlocked_R+0x5a0 libgobject-2.0.so.0.1200.9`g_signal_emit_valist+0x8c4 libgobject-2.0.so.0.1200.9`g_signal_emit+0x25 libatk-1.0.so.0.1709.1`atk_object_notify+0xa3 libgobject-2.0.so.0.1200.9`g_cclosure_marshal_VOID__PARAM+0x5b libgobject-2.0.so.0.1200.9`g_type_class_meta_marshal+0x40 libgobject-2.0.so.0.1200.9`g_closure_invoke+0x107 libgobject-2.0.so.0.1200.9`signal_emit_unlocked_R+0x414 libgobject-2.0.so.0.1200.9`g_signal_emit_valist+0x8c4 libgobject-2.0.so.0.1200.9`g_signal_emit+0x25 libgobject-2.0.so.0.1200.9`g_object_dispatch_properties_changed+0x4e libgobject-2.0.so.0.1200.9`g_object_notify_dispatcher+0x17 libgobject-2.0.so.0.1200.9`g_object_notify_queue_thaw+0xc9 libgobject-2.0.so.0.1200.9`g_object_notify+0x17f libatk-1.0.so.0.1709.1`atk_object_set_name+0xeb libbonoboui-2.so.0.0.0`grip_item_a11y_initialize+0x2b libatk-1.0.so.0.1709.1`atk_object_initialize+0x92 libbonoboui-2.so.0.0.0`bonobo_a11y_set_atk_object_ret+0x22 libbonoboui-2.so.0.0.0`bonobo_dock_item_grip_get_accessible+0xdb libgtk-x11-2.0.so.0.1000.9`gtk_widget_get_accessible+0x39 libgail.so`gail_container_ref_child+0x153 libatk-1.0.so.0.1709.1`atk_object_ref_accessible_child+0x94 libspi.so.0.10.11`impl_accessibility_accessible_get_child_at_index+0x7d libspi.so.0.10.11`_ORBIT_skel_small_Accessibility_Accessible_getChildAtIndex+0x38 libORBit-2.so.0.1.0`ORBit_POAObject_invoke+0x22 libORBit-2.so.0.1.0`ORBit_OAObject_invoke+0x21 since all the corba call will be handled in main_loop, so I guess we can use g_main_context_is_owner now :) I will attach a patch, if it is OK, I will ask for a hard code freeze break.
Created attachment 84148 [details] [review] patch to handle the "corba call" freeze.
Sigh...
Well - Li - the problem (really) is that all the incoming CORBA calls need to take the GDK_THREADS lock as they enter, and release it as they leave I think. That is quite a biggish patch to at-spi/libspi/ but that's conceptually correct. Simply not taking the GDK_THREADS lock when we own the GMainContext doesn't get us there really. OTOH - if anyone is playing games where in gtk+ code they do: Accessible_doFoo () on an in-process implementation (as they can) things will wedge: since the GDK_THREADS mutex is non-recursive [ why, oh why ? for VCL it is recursive and life is so much better ]. Anyhow - I think that situation should be extremely unlikely [ the API is ~only used remotely ] and of course it's only a problem in a threaded environment I guess. Of course, the good news is that for other threaded apps (like OO.o) this will make everything rather more robust too. Thanks Li & sorry for the pain. PS. I would wrap the threads enter/leave in a similar fashion for use in at-spi/libspi FWIW.
Thanks Mechael. Do you have an example which would break applications if we don't have a lock in CORBA calls?
Hi Li - sure; that's easy: Thread 1: g_lib_main() -> Accessible_getChildCount() -> g_list_length (a11y_list) Thread 2: GDK_THREADS_ENTER() ... a11y_list = g_list_remove (a11y_list, baa) ... How about that ? ;-) if you add a GDK_THREADS_ENTER to the impl_accessible_get_child_count method the problem is gone :-)
OK, thanks :) So I guess it is impossible to add threads guards in ORBit, right?
you want me to add a toolkit specific thread guard inside ORBit ? :-) unfortunately, we would have the same problems there ... But I guess it might be possible to do something with a custom POA (in the abstract) that would take & release the lock as calls came in & executed - however: ORBit is not under active (feature) development really :-)
Hi Micheal, I have made the patch. But found ekiga still cannot startup. Because in ekiga's code: gdk_threads_enter (); ... GnomeMeeting::Process ()->BuildGUI (); ... gtk_main (); In BuildGUI, ekiga will call bonobo_activation_activate. And at this time, Orca calls in. In impl_accessibility_accessible_get_state, we call gdk_threads_enter again. And hang... So application should make sure gdk_threads_leave first when they call some bonobo functions?
Created attachment 84560 [details] freeze trace after "threads_enter" in impl*
Urgh - un-controlled re-enterancy is a serious hazard. I think we bump against the limits of what it's possible to achieve with a non-recursive gdk-threads mutex. ie. here are the problems: * it's simply not possible (or correct) to guard every call that may do a CORBA call with a gdk_threads_unlock. * it's not possible to take / drop a lock in the ORB for each call (what lock) * it's not possible to detect if the thread guard is taken in the 1st place: people do immediate calls before running a mainloop / taking the lock anyway [ this is worked around by the idle handler of course, but it's not a pleasantly generalized solution ]. * it's not possible to use the 'AT_IDLE' POA policy since we need to process incoming a11y calls during an outgoing event emission. ... so ... since we already have a pile of not so pleasant implementations here - what we could do, is to stack some more up: By using the gdk_threads_set_lock_functions - we can substitute a recursive mutex for a non-recursive one - or (perhaps simpler) add a: thread_has_gdk_lock () type method - by hooking & proxying gdk_thread requests - we could (presumably) track what thread has the lock at any given time, and simply avoid taking it again if it's us (to catch these rather unpleasant & hard to detect scenarios). This may run into interesting problems with OO.o (the main user of that hook) - that is worth checking: I'll poke obr he'll want to follow this I think.
(In reply to comment #110) > - that is worth checking: I'll poke obr he'll want to follow this I think. Yes, he does ;-).
gdk_threads_set_lock_functions must be called before gdk_threads_init, but in ekiga, they call gdk_threads_init before gnome_program_init. So we cannot set the lock functions...
Now is the right time to discuss things ekiga does at startup, and in which order it should do them : we're adding a plugin system, so we will have to load them before some steps and after others.
I don't think we should call gdk_threads_set_lock_functions in bridge. An application could call it for its own purpose. Should we file a bug for gdk team to let them add a gdk_thread_has_lock() function?
libraries or modules _cannot_ use gdk_threads_set_lock_functions(). that does not work.
IIRC the OOo object model (UNO) avoids the original problem by dispatching callbacks from any (remote) call to the originating thread instead of the main thread. Don't know whether this would be feasable to do in Orbit2/CORBA though.
Hi Oliver - I guess, we could try to add some thread affinity here; we do have a 'thread-per-poa' policy that could be adapted to group together all the a11y calls - but unfortunately, it starts a new thread for the 1st call, which would be rather sub-optimal. So - we'd require a new ORBit2 custom policy here (not -that- tough to write cf. ORBit2/src/orb/poa/poa.c & orbit-adaptor.c etc.)
For we do not have another way to fix this problem, I'd like to commit #84148 patch into trunk first. The bug is blocking ekiga's testing now. But we are still looking for the "perfect" patch.
committed #84148 into HEAD for now.
Hi all, this turns out to be a really helpful fix. Not only does it fix ekiga's hang but also resolves hangs in the following applications. wifi-radar gftp update-manager and gnome-bit torrent. Thanks a lot! this really helps. I'll keep testing for negative side effects but things are looking good so far.
Glad to know we're not the only ones bitten by this :-)
*** Bug 418880 has been marked as a duplicate of this bug. ***
When I debug another bug reported in Ekiga, I found this patch has potential threat to thread safety. "atk_misc_threads_leave(misc)" will unlock the thread mutex which is needed to protect the critical section. Once another thread has the chance to run between the leave and enter, thread safety can't be sured.