Bug 329454 – Ekiga freezes in config dialog when accessibility is on

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 329454 - Ekiga freezes in config dialog when accessibility is on


Summary:	Ekiga freezes in config dialog when accessibility is on


Status:	RESOLVED FIXED

Product:	at-spi
Classification:	Platform
Component:	atkbridge
Version:	unspecified
Hardware:	Other Linux

Importance:	High critical
Target Milestone:	---
Assigned To:	Li Yuan
QA Contact:	Li Yuan

URL:
Whiteboard:

Duplicates:	327889 330647 331296 333466 335838 369321 380010 (view as bug list)
Depends on:
Blocks:	404155

Reported:	2006-02-01 12:50 UTC by hanke
Modified:	2011-08-09 19:38 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Stack trace up to the freeze (10.00 KB, application/x-compressed-tar) 2006-02-01 20:31 UTC, hanke		Details
ATK patch as applied (10.23 KB, patch) 2007-01-08 15:58 UTC, bill.haneman	committed	Details \| Review
proposed gail patch, which includes GDK_THREADS_ENTER implementation. (3.89 KB, patch) 2007-01-08 16:15 UTC, bill.haneman	committed	Details \| Review
atk part of patch (787 bytes, patch) 2007-01-18 12:14 UTC, Li Yuan	committed	Details \| Review
gail part of patch (342 bytes, patch) 2007-01-18 12:15 UTC, Li Yuan	committed	Details \| Review
spi part of patch (2.37 KB, patch) 2007-01-18 12:15 UTC, Li Yuan	committed	Details \| Review
new patch (3.14 KB, patch) 2007-03-01 04:00 UTC, Li Yuan	committed	Details \| Review
patch to handle the "corba call" freeze. (649 bytes, patch) 2007-03-07 10:34 UTC, Li Yuan	committed	Details \| Review
freeze trace after "threads_enter" in impl* (4.50 KB, text/plain) 2007-03-14 08:33 UTC, Li Yuan		Details

Description hanke 2006-02-01 12:50:28 UTC

When I turn accessibility on in Gnome configuration and run ekiga-snapshot
(version 1.99.0-20060125-02) for the first time (such that the configuration dialog appears), then on the 5th step of the configuration dialog (after skipping the free registration by saying ``no'') which deals with NAT configuration, the application freezes and I have to kill it.

This only happens with accessibility turned on, when the GTK accessibility module gets loaded.

Additionally, when I do this while running Gnopernicus and Gnome Speech, the speech halts and festival-synthesis-driver takes 100% CPU. I don't know how closely is this related, but it happens in the same moment.

I'm including a backtrace bellow (with some symbols unresolved, unfortunatelly).
However, the bug is very well reproducible. It was independently found on two different machines by two different users, one of them being me. It happens every time in my configuration. So I think it could be a good opportunity to fix a bug.

Steps to reproduce:
1) Install Ekiga (formerly gnome-meeting)
2) Create a new user, login to Gnome under this user
(Important since when running the Ekiga configuration druid
from Ekiga menu, this freeze does not happen -- I think because Ekiga
already stores some configuration. I stress you must have an unconfigured instance of Ekiga to be able to reproduce this.)
3) Turn on accessibility in preferences, relogin
4) Run Ekiga, it should display the configuration guide
5) Proceed with the configuration, skip the offered registration
6) At step 5 (NAT discovery), the application freezes and you are no longer able to push any button

GDB backtrace of Ekiga:
0xb675a1a9 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
(gdb) bt

+ Trace 65814

#0 __lll_mutex_lock_wait
from /lib/tls/libpthread.so.0
#1 _L_mutex_lock_29
from /lib/tls/libpthread.so.0
#2 ??
#3 ??
from /usr/lib/libavahi-glib.so.1
#4 ??
#5 ??
#6 ??
#7 ??
from /usr/lib/libgdk-x11-2.0.so.0
#8 ??
#9 ??
#10 ??
#11 gdk_threads_leave
from /usr/lib/libgdk-x11-2.0.so.0
#12 gdk_threads_leave
from /usr/lib/libgdk-x11-2.0.so.0
#13 _gdk_events_queue
from /usr/lib/libgdk-x11-2.0.so.0
#14 g_main_context_check
from /usr/lib/libglib-2.0.so.0
#15 g_main_context_check
from /usr/lib/libglib-2.0.so.0
#16 g_main_loop_run
from /usr/lib/libglib-2.0.so.0
#17 gtk_main
from /usr/lib/libgtk-x11-2.0.so.0
#18 main


If you need more information, please indicate how can I get them and I'll quickly send them back to you.

Comment 1 bill.haneman 2006-02-01 14:35:29 UTC

Hi Hanke:

<standard stack trace improvement request>
Thanks for the bug report. Unfortunately, that stack trace is not very useful in determining the cause of the crash. Can you get us one with debugging symbols? Please see http://live.gnome.org/GettingTraces for more information on how to do so.
</standard stack trace improvement request>

Actually stack traces don't help much for freeze bugs such as this.  Are you using the latest glib and ORBit2?

I think this bug needs to be reassigned to Ekiga for more evaluation, there's no indication of where things are hanging.  Maybe a truss output would help?

Comment 2 Snark 2006-02-01 17:35:56 UTC

Hmmmm... on that page, ekiga launches a firewall detection in a popup with a progress bar.

Could it explain the freeze ?

Comment 3 bill.haneman 2006-02-01 19:38:11 UTC

progress bar.... could Ekiga be calling progress-bar-update APIs unnecessarily (i.e. too often) during this process?  This could cause the progress bar to emit an enormous stream of RPC notifications...

Comment 4 hanke 2006-02-01 20:25:33 UTC

> Actually stack traces don't help much for freeze bugs such as this.  Are you
> using the latest glib and ORBit2?

Not CVS. I tried it with Orbit2 version 2.12.4 and glib 2.8.6 from Debian testing. Is this ok?
 
> I think this bug needs to be reassigned to Ekiga for more evaluation, there's
> no indication of where things are hanging.  Maybe a truss output would help?

I'm attaching strace output.

> progress bar.... could Ekiga be calling progress-bar-update APIs unnecessarily
> (i.e. too often) during this process?  This could cause the progress bar to
> emit an enormous stream of RPC notifications...

This might be possible. If I disable accessibility and look at what would happen normally, I see it is not an ordinary progress bar but rather there
is a bar going right and left to indicate activity. However, with accessibility on, I don't even see the progress bar window popup.

I still think it should be reproducible. As I said, it happened on two different computers (whose configuration and library versions are likely different).

Comment 5 hanke 2006-02-01 20:31:36 UTC

Created attachment 58546 [details]
Stack trace up to the freeze

I don't know if it is useful. It shows the same as the GDB trace. It stops in a mutex.

Comment 6 hanke 2006-02-01 21:03:25 UTC

Here is a complete backtrace from GDB including the other threads too. Thread 2 looks somehow suspicous. Unfortunatelly, I didn't get more information about thread one. I might try to install libavahi-glib from sources if necessary.

[Switching to thread 1 (Thread -1242863936 (LWP 11994))]#0  0xb683a1a9 in __lll_mutex_lock_wait ()
   from /lib/tls/libpthread.so.0

+ Trace 65834

#0 __lll_mutex_lock_wait
from /lib/tls/libpthread.so.0
#1 _L_mutex_lock_29
from /lib/tls/libpthread.so.0
#2 ??
#3 ??
from /usr/lib/libavahi-glib.so.1
#4 ??
#5 ??
#0 pthread_cond_wait
from /lib/tls/libpthread.so.0
#1 giop_recv_buffer_get
from /usr/lib/libORBit-2.so.0
#2 ORBit_small_invoke_stub
from /usr/lib/libORBit-2.so.0
#3 ORBit_small_invoke_stub_n
from /usr/lib/libORBit-2.so.0
#4 ORBit_c_stub_invoke
from /usr/lib/libORBit-2.so.0
#5 Bonobo_Unknown_unref
from /usr/lib/libbonobo-activation.so.4
#6 gnome_accessibility_module_shutdown
from /usr/lib/gtk-2.0/modules/libatk-bridge.so
#7 gnome_accessibility_module_shutdown
from /usr/lib/gtk-2.0/modules/libatk-bridge.so
#8 g_signal_stop_emission
from /usr/lib/libgobject-2.0.so.0
#9 g_signal_emit_valist
from /usr/lib/libgobject-2.0.so.0
#10 g_signal_emit
from /usr/lib/libgobject-2.0.so.0
#11 atk_object_initialize
from /usr/lib/libatk-1.0.so.0
#12 g_cclosure_marshal_VOID__PARAM
from /usr/lib/libgobject-2.0.so.0
#13 g_cclosure_new_swap
from /usr/lib/libgobject-2.0.so.0
#14 g_closure_invoke
from /usr/lib/libgobject-2.0.so.0
#15 g_signal_stop_emission
from /usr/lib/libgobject-2.0.so.0
#16 g_signal_emit_valist
from /usr/lib/libgobject-2.0.so.0
#17 g_signal_emit
from /usr/lib/libgobject-2.0.so.0
#18 g_object_interface_list_properties
from /usr/lib/libgobject-2.0.so.0
#19 g_value_get_flags
from /usr/lib/libgobject-2.0.so.0
#20 g_object_notify
from /usr/lib/libgobject-2.0.so.0
#21 atk_object_set_parent
from /usr/lib/libatk-1.0.so.0
#22 gail_toplevel_new
from /usr/lib/gtk-2.0/modules/libgail.so
#23 g_signal_stop_emission
from /usr/lib/libgobject-2.0.so.0
#24 g_signal_emit_valist
from /usr/lib/libgobject-2.0.so.0
#25 g_signal_emit
from /usr/lib/libgobject-2.0.so.0
#26 gtk_widget_show
from /usr/lib/libgtk-x11-2.0.so.0
#27 gtk_window_present_with_time
from /usr/lib/libgtk-x11-2.0.so.0
#28 gtk_window_present
from /usr/lib/libgtk-x11-2.0.so.0
#29 gnomemeeting_threads_dialog_show_all
#30 GMStunClient::Main
#31 PThread::PX_ThreadStart
from /usr/lib/libpt.so.1.9.3
#32 start_thread
from /lib/tls/libpthread.so.0
#33 clone
from /lib/tls/libc.so.6
#0 select
from /lib/tls/libc.so.6
#1 PSocket::Select
from /usr/lib/libpt.so.1.9.3
#2 PSocket::Select
from /usr/lib/libpt.so.1.9.3
#3 OpalListenerUDP::Accept
from /usr/lib/libopal_linux_x86_r.so.2
#4 OpalListener::ListenForConnections
from /usr/lib/libopal_linux_x86_r.so.2
#5 OpalListener::ListenForConnections_PNotifier::Call
from /usr/lib/libopal_linux_x86_r.so.2
#6 PSimpleThread::Main
from /usr/lib/libpt.so.1.9.3
#7 PThread::PX_ThreadStart
from /usr/lib/libpt.so.1.9.3
#8 start_thread
from /lib/tls/libpthread.so.0
#9 clone
from /lib/tls/libc.so.6
#0 select
from /lib/tls/libc.so.6
#1 PThread::PXBlockOnIO
from /usr/lib/libpt.so.1.9.3
#2 PChannel::PXSetIOBlock
from /usr/lib/libpt.so.1.9.3
#3 PSocket::os_accept
from /usr/lib/libpt.so.1.9.3
#4 PTCPSocket::Accept
from /usr/lib/libpt.so.1.9.3
#5 OpalListenerTCP::Accept
from /usr/lib/libopal_linux_x86_r.so.2
#6 OpalListener::ListenForConnections
from /usr/lib/libopal_linux_x86_r.so.2
#7 OpalListener::ListenForConnections_PNotifier::Call
from /usr/lib/libopal_linux_x86_r.so.2
#8 PSimpleThread::Main
from /usr/lib/libpt.so.1.9.3
#9 PThread::PX_ThreadStart
from /usr/lib/libpt.so.1.9.3
#10 start_thread
from /lib/tls/libpthread.so.0
#11 clone
from /lib/tls/libc.so.6
#0 poll
from /lib/tls/libc.so.6
#1 g_main_context_check
from /usr/lib/libglib-2.0.so.0
#2 g_main_loop_run
from /usr/lib/libglib-2.0.so.0
#3 link_thread_io_context
from /usr/lib/libORBit-2.so.0
#4 g_static_private_free
from /usr/lib/libglib-2.0.so.0
#5 start_thread
from /lib/tls/libpthread.so.0
#6 clone
from /lib/tls/libc.so.6
#0 poll
from /lib/tls/libc.so.6
#1 g_main_context_check
from /usr/lib/libglib-2.0.so.0
#2 g_main_loop_run
from /usr/lib/libglib-2.0.so.0
#3 e_book_set_default_source
from /usr/lib/libebook-1.2.so.5
#4 g_static_private_free
from /usr/lib/libglib-2.0.so.0
#5 start_thread
from /lib/tls/libpthread.so.0
#6 clone
from /lib/tls/libc.so.6
#0 select
from /lib/tls/libc.so.6
#1 PHouseKeepingThread::Main
from /usr/lib/libpt.so.1.9.3
#2 PThread::PX_ThreadStart
from /usr/lib/libpt.so.1.9.3
#3 start_thread
from /lib/tls/libpthread.so.0
#4 clone
from /lib/tls/libc.so.6

Comment 7 Damien Sandras 2006-02-10 13:36:32 UTC

*** Bug 330647 has been marked as a duplicate of this bug. ***

Comment 8 Damien Sandras 2006-02-10 13:39:04 UTC

Reassigning to correct module. It is not an Ekiga bug as Ekiga works without accessibility turned on. The only specificity is that the dialog is displayed from a threads. So I guess it is a threading problem in ATK.

Comment 9 Damien Sandras 2006-02-10 13:40:27 UTC

*** Bug 327889 has been marked as a duplicate of this bug. ***

Comment 10 bill.haneman 2006-02-10 14:36:08 UTC

"Reassigning to correct module. It is not an Ekiga bug as Ekiga works without
accessibility turned on."

That's actually not necessarily a true statement - accessibility exposes application bugs fairly frequently.

I suspect however that the problem does have to do with an Ekiga thread exiting, which is why gnome_accessibility_module_shutdown is appearing in the stack trace.  

If an app has more than one thread that posts gtk+ widgets in a gui, and one of those threads exits, then bad things are indeed likely to happen.

Comment 11 Damien Sandras 2006-02-10 14:43:21 UTC

"That's actually not necessarily a true statement - accessibility exposes
application bugs fairly frequently."

Indeed, but I do not think it is the case here.

"I suspect however that the problem does have to do with an Ekiga thread
exiting, which is why gnome_accessibility_module_shutdown is appearing in the
stack trace."

That is absolutely correct. A thread is run, is doing the detection, and then displays the result from the detection before exiting.

"If an app has more than one thread that posts gtk+ widgets in a gui, and one of
those threads exits, then bad things are indeed likely to happen."

So is it a limitation of ATK?

Comment 12 bill.haneman 2006-02-10 14:57:29 UTC

Not so much ATK, probably, as at-spi.  While I don't quite see a "smoking gun" here yet, I think at-spi is probably the component in need of modification, if indeed the problem is support for exiting threads in apps which use more than one thread for gtk+ access.

Damien, are you seeing this with ORBit2 HEAD, by the way?  I recall Michael having put some important ORBit2 fixes in regarding multi-mainloop/thread support this year, and want to rule that area of investigation out for the time being, if possible.

Frankly I don't understand what it waiting for who, in the traces above, so I'll need some help figuring out why the ORB is stuck waiting.

(Please remember to add the "accessibility" keyword to all a11y bugs, otherwise they have limited visibility - thanks!)

Comment 13 Damien Sandras 2006-02-10 15:07:50 UTC

I'm not able to test with ORBIT head (I would have to compile too many components by myself and I'm already too late on many aspects to be able to sync ekiga 2.00 and GNOME 2.13.), but perhaps the original reporter can try?

Comment 14 Andreas Leitner 2006-02-10 15:18:26 UTC

I am sorry, but I don't have any of the infrastructure setup here, nor the time at the moment. I am only beta-testing ekiga here, because Damien made it soooo incredibly easy with the daily deb packages (;

Comment 15 Damien Sandras 2006-02-15 18:29:22 UTC

*** Bug 331296 has been marked as a duplicate of this bug. ***

Comment 16 Frederic Crozat 2006-02-17 11:17:49 UTC

I've done some test with 2.13 (HEAD) branch of GNOME :

it is a little better : ekiga doesn't freeze when the "testing "NAT dialog appears. But freeze occurs when the next dialog "do you want to enable STUN support" popups. 

Stacktrace : 
(gdb) thread apply all bt

+ Trace 66263

Thread 2 (Thread -1269670992 (LWP 30694))

#0 ??
#1 ??
#2 ??
#3 ORBit_small_invoke_stub
at orbit-small.c line 657
#4 ORBit_small_invoke_stub_n
#5 ORBit_c_stub_invoke
at poa.c line 2644
#6 Accessibility_EventListener_notifyEvent
at Accessibility-stubs.c line 321
#7 spi_atk_emit_eventv
at bridge.c line 677
#8 spi_atk_bridge_window_event_listener
at bridge.c line 1121
#9 signal_emit_unlocked_R
at gsignal.c line 2404
#10 IA__g_signal_emit_valist
at gsignal.c line 2197
#11 IA__g_signal_emit
at gsignal.c line 2241
#12 window_removed
at gailutil.c line 518
#13 IA__g_cclosure_marshal_VOID__UINT_POINTER
at gmarshal.c line 672
#14 IA__g_closure_invoke
at gclosure.c line 490
#15 signal_emit_unlocked_R
at gsignal.c line 2438
#16 IA__g_signal_emit_valist
at gsignal.c line 2197
#17 IA__g_signal_emit_by_name
at gsignal.c line 2265
#18 _gail_toplevel_remove_child
at gailtoplevel.c line 334
#19 gail_toplevel_hide_event_watcher
at gailtoplevel.c line 303
#20 signal_emit_unlocked_R
at gsignal.c line 2404
#21 IA__g_signal_emit_valist
#22 IA__g_signal_emit
at gsignal.c line 2241
#23 IA__gtk_widget_hide
at gtkwidget.c line 2133
#24 gtk_widget_dispose
at gtkwidget.c line 6650
#25 gtk_window_dispose
at gtkwindow.c line 1762
#26 IA__g_object_run_dispose
at gobject.c line 571
#27 IA__gtk_object_destroy
at gtkobject.c line 388
#28 IA__gtk_widget_destroy
at gtkwidget.c line 1995
#29 gnomemeeting_threads_widget_destroy
at gmdialog.c line 296
#30 GMStunClient::Main
at clients/stun.cpp line 392
#31 PThread::PX_ThreadStart
at tlibthrd.cxx line 1330
#32 start_thread
from /lib/tls/libpthread.so.0
#33 clone
from /lib/tls/libc.so.6

Comment 17 hanke 2006-02-25 11:20:39 UTC

It actually freezes much more often than in the config dialog, when accessibility is on. The result is it is unusable with accessibility switched on in Gnome.

However, these other crashes are not so specificly reproducible (probably because there are many possible crash situations) as the one in the configuration dialog which always occurs at the same place.

Comment 18 bill.haneman 2006-03-03 14:41:45 UTC

None of these stack traces have enough info to tell anything about at-spi here; the one that looks the oddest, which includes at-spi code, doesn't have line numbers (thread 2 of comment #6) and gnome_accessibility_module_shutdown is appears twice... basically the stack traces don't make sense!

Comment 19 Frederic Crozat 2006-03-03 15:40:11 UTC

better stacktrace, obtained on 2.14 branch, by pressing "test NAT" in the wizard, and after NAT test is complete and "enable STUN dialog" is displayed, program is freezed :

+ Trace 66657

Thread 11 (Thread -1276626000 (LWP 3839))

#0 __kernel_vsyscall
#1 pthread_cond_wait
from /lib/tls/libpthread.so.0
#2 giop_recv_buffer_get
at giop-recv-buffer.c line 708
#3 ORBit_small_invoke_stub
at orbit-small.c line 657
#4 ORBit_small_invoke_stub_n
at orbit-small.c line 575
#5 ORBit_c_stub_invoke
at poa.c line 2644
#6 Accessibility_EventListener_notifyEvent
at Accessibility-stubs.c line 321
#7 spi_atk_emit_eventv
at bridge.c line 677
#8 spi_atk_bridge_window_event_listener
at bridge.c line 1123
#9 signal_emit_unlocked_R
at gsignal.c line 2404
#10 IA__g_signal_emit_valist
at gsignal.c line 2197
#11 IA__g_signal_emit
at gsignal.c line 2241
#12 window_removed
at gailutil.c line 519
#13 IA__g_cclosure_marshal_VOID__UINT_POINTER
at gmarshal.c line 672
#14 IA__g_closure_invoke
at gclosure.c line 490
#15 signal_emit_unlocked_R
at gsignal.c line 2438
#16 IA__g_signal_emit_valist
at gsignal.c line 2197
#17 IA__g_signal_emit_by_name
at gsignal.c line 2265
#18 _gail_toplevel_remove_child
at gailtoplevel.c line 334
#19 gail_toplevel_hide_event_watcher
at gailtoplevel.c line 303
#20 signal_emit_unlocked_R
at gsignal.c line 2404
#21 IA__g_signal_emit_valist
at gsignal.c line 2197
#22 IA__g_signal_emit
at gsignal.c line 2241
#23 IA__gtk_widget_hide
at gtkwidget.c line 2133
#24 gtk_widget_dispose
at gtkwidget.c line 6647
#25 gtk_window_dispose
at gtkwindow.c line 1762
#26 IA__g_object_run_dispose
at gobject.c line 571
#27 IA__gtk_object_destroy
at gtkobject.c line 388
#28 IA__gtk_widget_destroy
at gtkwidget.c line 1995
#29 gnomemeeting_threads_widget_destroy
at gmdialog.c line 296
#30 GMStunClient::Main
at clients/stun.cpp line 392
#31 PThread::PX_ThreadStart
at tlibthrd.cxx line 1330
#32 start_thread
from /lib/tls/libpthread.so.0
#33 clone
from /lib/tls/libc.so.6

Comment 20 Snark 2006-03-05 11:05:25 UTC

*** Bug 333466 has been marked as a duplicate of this bug. ***

Comment 21 Kjartan Maraas 2006-03-05 12:42:54 UTC

I think this should be reopened now with the new info in comment #19 and #20

Comment 22 Snark 2006-03-24 22:02:35 UTC

*** Bug 335838 has been marked as a duplicate of this bug. ***

Comment 23 Calum Benson 2006-04-26 17:05:19 UTC

Apologies for spam... ensuring Sun a11y folks are cc'ed on all current accessibility bugs.

Comment 24 bill.haneman 2006-06-14 14:59:47 UTC

IS this still happenning with at-spi-1.7.6 and later?

Micheal's change to at-spi/atk-bridge/bridge.c (version 1.86 of that file) may have fixed this...

Comment 25 Frederic Crozat 2006-06-14 15:19:21 UTC

still freezing with at-spi 1.7.7 :(

Comment 26 Snark 2006-07-05 10:48:26 UTC

I would like to know why the reporter changed from NEEDINFO to UNCONFIRMED without comment... did the problem magically disappear ?

I must say that it annoys me greatly that ekiga doesn't work correctly with a11y : after all, if someone has sight issues for example, what better way to keep in touch with people than with voice !?

Comment 27 hanke 2006-07-05 11:36:15 UTC

(In reply to comment #26)
> I would like to know why the reporter changed from NEEDINFO to UNCONFIRMED
> without comment... did the problem magically disappear ?
> 
> I must say that it annoys me greatly that ekiga doesn't work correctly with
> a11y : after all, if someone has sight issues for example, what better way to
> keep in touch with people than with voice !?
 
Hi,

I just observed Frederic already provided the information which was needed in comment #25, so I reopened the bug. I thought it was obvious, but probably should have filled in a separate comment anyway, sorry.

Comment 28 bill.haneman 2006-08-23 11:38:41 UTC

Snark said:
>I must say that it annoys me greatly that ekiga doesn't work correctly with
>a11y : after all, if someone has sight issues for example, what better way to
>keep in touch with people than with voice !?

NO need to be snarky, snark :-)  Unless you want to help us fix this.

Ekiga does a bunch of stuff that most other gtk+ programs don't do, for instance it uses threading heavily, etc.  This means it exposes new and difficult issues that we haven't had to deal with before.  While configuration is important, it's not as bad as freezing during normal operation, since at least there are ugly workarounds available (for instance turning off a11y and having a sighted user configure ekiga).  These workarounds are of course not acceptable long-term, but our resources are very limited where debugging these issues are concerned.  

If you wanted to get more involved in debugging this, that would be much appreciated and would move us forward to a solution faster.

Comment 29 Snark 2006-08-23 13:13:39 UTC

Well, I'm not "snarky" (whatever that means) ; I was just worried to see the state of the bug just change.

I don't know exactly what you would need to debug this... what "bunch of stuff" does ekiga do which other programs don't, apart from the threading (which annoys me too) ?

Comment 30 padraig.obriain 2006-11-15 09:43:31 UTC

I had a look at this using GNOME 2.16 on Solarix x86.

When I start ekiga the first configuration assistant runs. When I reach page 5 NAT Type, a dialog box with Detection in Progress appears. Eventually this stops.

At this time the system is hung. Killing at-spi-regiustryd in an ssh session makes the system usable again.

I have pasted stack trace for ekiga and at-spi-registryd below.


25110:  ekiga
-----------------  lwp# 1 / thread# 1  --------------------
 fef476cb lwp_park (0, 0, 0)
 fef40814 slow_lock (fadf2000, 819b0b0, 0) + 40
 fef4096b mutex_lock_impl (819b0b0, 0) + 14b
 fef40a52 mutex_lock (819b0b0) + 1d
 fee0bd0a gdk_threads_impl_lock (81bb390, fe981b24, 8046a1c, fadf2000, 8046a1c, fe914493) + 36
 fee34ed6 gdk_event_check (81bb390) + 22
 fe914493 g_main_context_check (81bb3d8, 7fffffff, 8546228, 5) + 1a3
 fe914b11 g_main_context_iterate (81bb3d8, 1, 1, 819eb08) + 3bd
 fe915124 g_main_loop_run (8570758) + 1b8
 feb5a2ba gtk_main (8046cf0, 8046b88, feffa7c0, 80bf17d, fe8262e0, 81188db) + b2
 080b2391 main     (1, 8046bcc, 8046bd4) + 485
 080907fa _start   (1, 8046d58, 0, 8046d5e, 8046d75, 8046d9c) + 7a
-----------------  lwp# 2 / thread# 2  --------------------
 fef476cb lwp_park (0, f9c3be38, 0)
 fef41902 cond_wait_queue (823cbf8, 823cbe0, f9c3be38, 0) + 3e
 fef41c97 cond_wait_common (823cbf8, 823cbe0, f9c3be38) + 1e1
 fef41ebc _cond_timedwait (823cbf8, 823cbe0, f9c3beb8) + 4a
 fef41f4b cond_timedwait (823cbf8, 823cbe0, f9c3beb8) + 27
 fef41f88 pthread_cond_timedwait (823cbf8, 823cbe0, f9c3beb8) + 21
 fe768fbc __1cKPSyncPointEWait6MrknNPTimeInterval__i_ (823cbd0, f9c3bf04) + 9c
 fdb379b7 __1cLOpalManagerLGarbageMain6MrnHPThread_i_v_ (823c6c0, 823ddb0, 0) + 5b
 fdb38c59 __1cLOpalManagerVGarbageMain_PNotifierECall6kMrnHPObject_i_v_ (823d600, 823ddb0, 0) + 1d
 fe77f030 __1cNPSimpleThreadEMain6M_v_ (823ddb0, 0, fadf2400, fef8f000, 0, 0) + 40
 fe767f05 __1cHPThreadOPX_ThreadStart6Fpv_1_ (823ddb0) + 85
 fef47434 _thr_setup (fadf2400) + 52
 fef47690 _lwp_start (fadf2400, 0, 0, 0, 0, 0)
-----------------  lwp# 3 / thread# 3  --------------------
 fef476cb lwp_park (0, 0, 0)
 fef40814 slow_lock (f9a20000, 819b0b0, 0) + 40
 fef4096b mutex_lock_impl (819b0b0, 0) + 14b
 fef40a52 mutex_lock (819b0b0) + 1d
 fee0bd0a gdk_threads_impl_lock (f9b37cbc, 8615208, f9b37cec, f9b37cec, f9b37cec, 80b2740) + 36
 fee0bcad gdk_threads_enter (823cd04, 1, f9b37d04, 0, fed90ba0, 3) + 21
 080b2740 __1cbAgnomemeeting_threads_enter6F_v_ (823cd04, 1, 823ccf4, f9b37d98, 8167e30, 8615178) + 58
 080c8d67 __1cJGMManagerLOnIPChanged6MrnGPTimer_i_v_ (823c6c0, 823ccf4, 1) + 4b
 080cdec5 __1cJGMManagerVOnIPChanged_PNotifierECall6kMrnHPObject_i_v_ (823e080, 823ccf4, 1) + 1d
 fe77c0d7 __1cGPTimerJOnTimeout6M_v_ (823ccf4) + 4f
 fe77c1ad __1cGPTimerHProcess6MrknNPTimeInterval_r1_v_ (823ccf4, f9b37ec0, f9b37ea8) + cd
 fe77c43f __1cKPTimerListHProcess6M_nNPTimeInterval__ (f9b37f38, 8163a68) + 17f
 fe7662b9 __1cTPHouseKeepingThreadEMain6M_v_ (823dfd8, 0, f9a20000, fef8f000, 0, 0) + 39
 fe767f05 __1cHPThreadOPX_ThreadStart6Fpv_1_ (823dfd8) + 85
 fef47434 _thr_setup (f9a20000) + 52
 fef47690 _lwp_start (f9a20000, 0, 0, 0, 0, 0)
-----------------  lwp# 11 / thread# 11  --------------------
 fef480b7 pollsys  (f59fdae0, 2, 0, 0)
 fef03fde pselect  (2b, 84f02c0, 84f02e8, 84f0310, 0, 0) + 19e
 fef042ee select   (2b, 84f02c0, 84f02e8, 84f0310, 0) + 7e
 fe755673 __1cHPSocketGSelect6Frn0AKSelectList_22rknNPTimeInterval__nIPChannelGErrors__ (f59fddcc, f59fdd78, f59fdd84, f59fdef0) + fb
 fe772529 __1cHPSocketGSelect6Frn0AKSelectList_rknNPTimeInterval__nIPChannelGErrors__ (f59fddcc, f59fdef0) + 65
 fdb6568d __1cPOpalListenerUDPGAccept6MrknNPTimeInterval__pnNOpalTransport__ (84f1480, f59fdef0) + 125
 fdb632ac __1cMOpalListenerUListenForConnections6MrnHPThread_i_v_ (84f1480, 84f1a08, 0) + 194
 fdb6c442 __1cMOpalListenerbEListenForConnections_PNotifierECall6kMrnHPObject_i_v_ (84f1f70, 84f1a08, 0) + 1a
 fe77f030 __1cNPSimpleThreadEMain6M_v_ (84f1a08, 0, f9a21000, fef8f000, fef478ae, 0) + 40
 fe767f05 __1cHPThreadOPX_ThreadStart6Fpv_1_ (84f1a08) + 85
 fef47434 _thr_setup (f9a21000) + 52
 fef47690 _lwp_start (f9a21000, 0, 0, 0, 0, 0)
-----------------  lwp# 16 / thread# 16  --------------------
 fef476cb lwp_park (0, 0, 0)
 fef41902 cond_wait_queue (85db548, 86026c8, 0, 0) + 3e
 fef41ded _cond_wait (85db548, 86026c8) + 69
 fef41e2b cond_wait (85db548, 86026c8) + 24
 fef41e65 pthread_cond_wait (85db548, 86026c8) + 1e
 fe9af67d giop_recv_buffer_get (f56fcaa8) + c5
 fe9b2b95 ORBit_small_invoke_stub (8232b60, f9df0480, 0, f56fcb9c, 0, f9cd4fe0) + 139
 fe9b2a3d ORBit_small_invoke_stub_n (8232b60, f9df0460, 0, 0, f56fcb9c, 0) + 3d
 fe9c41ea ORBit_c_stub_invoke (8232b60, f9df0460, 0, 0, f56fcb9c, 0) + 12e
 f9dba0b8 Accessibility_EventListener_notifyEvent (8232b60, f56fcbc4, f9cd4fe0) + 4c
 f9cc3519 spi_atk_emit_eventv (8609b60, 0, 0, f56fcc44, f9cc443c, fed0ea50) + 1b1
 f9cc41c2 spi_atk_bridge_window_event_listener (f56fccac, 1, f56fcd4c, 8233378) + 8e
 fedc1b28 signal_emit_unlocked_R (8236300, 0, 8609b60, 0, f56fcd4c) + 5a0
 fedc0fe0 g_signal_emit_valist (8609b60, 91, 0, f56fcfb4) + 8c4
 fedc1175 g_signal_emit (8609b60, 91, 0) + 25
 f9d34591 window_removed (81b5968, 1, 8609b60, 0) + 99
 fedc2df2 g_cclosure_marshal_VOID__UINT_POINTER (8233398, 0, 3, f56fd15c, f56fd0bc, 0) + 5e
 fedae073 g_closure_invoke (8233398, 0, 3, f56fd15c, f56fd0bc) + 107
 fedc1cce signal_emit_unlocked_R (822d4c0, 22f, 81b5968, 0, f56fd15c) + 746
 fedc0fe0 g_signal_emit_valist (81b5968, 72, 22f, f56fd440) + 8c4
 fedc14d7 g_signal_emit_by_name (81b5968, f9d3b234, 1, 8609b60, 0) + 35b
 f9d2db10 _gail_toplevel_remove_child (81b5968, 84a7b98) + bc
 f9d2da46 gail_toplevel_hide_event_watcher (f56fd4cc, 1, f56fd56c, 81b5968) + 5e
 fedc1b28 signal_emit_unlocked_R (82232b8, 0, 84a7b98, 0, f56fd56c) + 5a0
 fedc0fe0 g_signal_emit_valist (84a7b98, 1e, 0, f56fd7d4) + 8c4
 fedc1175 g_signal_emit (84a7b98, 1e, 0) + 25
 fec5caa8 gtk_widget_hide (84a7b98) + 70
 fec62adb gtk_widget_dispose (84a7b98) + 3f
 fec66f0d gtk_window_dispose (84a7b98, 84a7b98, fed45e5c, fec5c804, f56fd868, f56fd86c) + 39
 fedb00e5 g_object_run_dispose (84a7b98) + 3d
 feb7b588 gtk_object_destroy (84a7b98) + 3c
 fec5c834 gtk_widget_destroy (84a7b98) + 30
 080e3aac gnomemeeting_threads_widget_destroy (84a7b98) + 20
 080d7438 __1cMGMStunClientEMain6M_v_ (8602cf8, 0, f9a21c00, fef8f000, fef478ae, 0) + 528
 fe767f05 __1cHPThreadOPX_ThreadStart6Fpv_1_ (8602cf8) + 85
 fef47434 _thr_setup (f9a21c00) + 52
 fef47690 _lwp_start (f9a21c00, 0, 0, 0, 0, 0)
-----------------  lwp# 9 / thread# 9  --------------------
 fef480b7 pollsys  (8493200, c, 0, 0)
 feeff7c2 poll     (8493200, c, ffffffff) + 52
 fe914aeb g_main_context_iterate (8490e00, 1, 1, 83c7770) + 397
 fe915124 g_main_loop_run (8497dc0) + 1b8
 fe9c9b1e link_io_thread_fn (0) + 1e
 fe92e676 g_thread_create_proxy (83c7770) + 11a
 fef47434 _thr_setup (f9a20800) + 52
 fef47690 _lwp_start (f9a20800, 0, 0, 0, 0, 0)
-----------------  lwp# 10 / thread# 10  --------------------
 fef480b7 pollsys  (f97ada20, 2, 0, 0)
 fef03fde pselect  (28, 84f0248, 84f0270, 84f0298, 0, 0) + 19e
 fef042ee select   (28, 84f0248, 84f0270, 84f0298, 0) + 7e
 fe7681d7 __1cHPThreadLPXBlockOnIO6MiirknNPTimeInterval__i_ (84f12d8, 26, 2, f97adcc4) + 1af
 fe75e05f __1cIPChannelMPXSetIOBlock6Mn0ALPXBlockType_rknNPTimeInterval__i_ (84f1060, 2, f97adcc4) + 153
 fe7554f5 __1cHPSocketJos_accept6Mr0pnIsockaddr_pi_i_ (84f0e00, 84f1060, f97add00, f97ade10) + 29
 fe775432 __1cKPTCPSocketGAccept6MrnHPSocket__i_ (84f0e00, 84f1060) + 72
 fdb64b1c __1cPOpalListenerTCPGAccept6MrknNPTimeInterval__pnNOpalTransport__ (84f1028, f97adef0) + 11c
 fdb632ac __1cMOpalListenerUListenForConnections6MrnHPThread_i_v_ (84f1028, 84f12d8, 0) + 194
 fdb6c442 __1cMOpalListenerbEListenForConnections_PNotifierECall6kMrnHPObject_i_v_ (84f1248, 84f12d8, 0) + 1a
 fe77f030 __1cNPSimpleThreadEMain6M_v_ (84f12d8, 0, f9a20400, fef8f000, fef478ae, 0) + 40
 fe767f05 __1cHPThreadOPX_ThreadStart6Fpv_1_ (84f12d8) + 85
 fef47434 _thr_setup (f9a20400) + 52
 fef47690 _lwp_start (f9a20400, 0, 0, 0, 0, 0)
-----------------  lwp# 8 / thread# 8  --------------------
 fef480b7 pollsys  (8496580, 1, 0, 0)
 feeff7c2 poll     (8496580, 1, ffffffff) + 52
 fe914aeb g_main_context_iterate (8496700, 1, 1, 8496610) + 397
 fe915124 g_main_loop_run (8497f10) + 1b8
 fb3599fb startup_mainloop (0) + 2b
 fe92e676 g_thread_create_proxy (8496610) + 11a
 fef47434 _thr_setup (f9a21400) + 52
 fef47690 _lwp_start (f9a21400, 0, 0, 0, 0, 0)


bash-3.00$ pstack `pgrep regis`
25112:  /usr/lib/at-spi-registryd --oaf-activate-iid=OAFIID:Accessibility_Regi
 fef480b7 pollsys  (80eb038, 7, 0, 0)
 feeff7c2 poll     (80eb038, 7, ffffffff) + 52
 feca4aeb g_main_context_iterate (8072450, 1, 1, 806e8d0) + 397
 feca4d7b g_main_context_iteration (8072450, 1) + 87
 fee099a5 link_main_iteration (1) + 21
 fedef617 giop_recv_buffer_get (8046618) + 5f
 fedf2b95 ORBit_small_invoke_stub (80eaeb8, fec51b80, 0, 0, 0, 804672c) + 139
 fedf2a3d ORBit_small_invoke_stub_n (80eaeb8, fec51b5c, 0, 0, 0, 0) + 3d
 fee041ea ORBit_c_stub_invoke (80eaeb8, fec51b5c, 0, 0, 0, 0) + 12e
 fec396f5 Bonobo_Unknown_ref (80eaeb8, 804672c) + 41
 fc9b66da bonobo_object_dup_ref (80eaeb8, 0) + 5a
 080589bb registry_filter_event (8076818, 8046788, 80469bc) + 187
 08058b01 impl_registry_notify_event (807682c, 80eddac, 80469bc) + 85
 fee6b62a _ORBIT_skel_small_Accessibility_EventListener_notifyEvent (807682c, 0, 8046850, 8046870, 80469bc, 8058a7c) + 16
 fee0112a ORBit_POAObject_invoke (8099c88, 0, 8046850, 8046870, 80468f8, 80469bc) + 22
 fee05519 ORBit_OAObject_invoke (8099c88, 0, 8046850, 8046870, 80468f8, 80469bc) + 21
 fedf2f1b ORBit_small_invoke_adaptor (8099c88, 80ee5d0, fee90480, 80468f8, 80469bc) + 2b3
 fee01562 ORBit_POAObject_handle_request (8099c88, 80f3104, 0, 0, 0, 80ee5d0) + 32a
 fee01960 ORBit_POAObject_invoke_incoming_request (8099c88, 80ee5d0, 80469bc) + 54
 fee01de6 ORBit_POA_handle_request (8073070, 80ee5d0, 80ee5e8) + 2ea
 fee05317 ORBit_handle_request (8072ff8, 80ee5d0) + 4b
 fedf014a giop_connection_handle_input (8089108) + 2e2
 fee0b4e9 link_connection_io_handler (0, 1, 8089108) + 55
 fee0d10d link_source_dispatch (80ea050, fee0b494, 8089108) + 41
 feca3615 g_main_dispatch (80724d8) + 1d9
 feca4705 g_main_context_dispatch (80724d8) + 85
 feca4b22 g_main_context_iterate (80724d8, 1, 1, 806e8d0) + 3ce
 feca5124 g_main_loop_run (808efb8) + 1b8
 fc9b201d bonobo_main (8046d8c, 8046c38, feffa7c0, feffa7c0, 8046c38, 8046d8c) + 5d
 08057ddf main     (3, 8046c7c, 8046c8c) + cb
 08054b4a _start   (3, 8046df4, 8046e0e, 8046e43, 0, 8046e53) + 7a

Comment 31 padraig.obriain 2006-11-15 10:27:35 UTC

I compiled at-spi-registryd without optimization and got the stack trace below.

I assume that that at-spi-registryd is attempting to call back into the application. 

I have just noticed that thread 16 is waiting on a condition variable instead of being in a poll call. I wonder why...

27065:  /usr/lib/at-spi-registryd
 fef480b7 pollsys  (80a29a8, 7, 0, 0)
 feeff7c2 poll     (80a29a8, 7, ffffffff) + 52
 fed14aeb g_main_context_iterate (807a750, 1, 1, 8076f18) + 397
 fed14d7b g_main_context_iteration (807a750, 1) + 87
 fec699a5 link_main_iteration (1) + 21
 fec4f617 giop_recv_buffer_get (8046548) + 5f
 fec52b95 ORBit_small_invoke_stub (80f36a8, fecc1b80, 0, 0, 0, 804665c) + 139
 fec52a3d ORBit_small_invoke_stub_n (80f36a8, fecc1b5c, 0, 0, 0, 0) + 3d
 fec641ea ORBit_c_stub_invoke (80f36a8, fecc1b5c, 0, 0, 0, 0) + 12e
 feca96f5 Bonobo_Unknown_ref (80f36a8, 804665c) + 41
 fc9b66da bonobo_object_dup_ref (80f36a8, 0) + 5a
 0805f1fb registry_clone_notify_context (80466d8) + 2b
 0805f722 registry_queue_event (807e818, 80466d8) + 42
 0805f827 registry_filter_event (807e818, 80466d8, 80468ec) + c7
 0805f9a9 impl_registry_notify_event (807e82c, 80f6a3c, 80468ec) + 89
 fee6b62a _ORBIT_skel_small_Accessibility_EventListener_notifyEvent (807e82c, 0, 8046780, 80467a0, 80468ec, 805f920) + 16
 fec6112a ORBit_POAObject_invoke (80a1f50, 0, 8046780, 80467a0, 8046828, 80468ec) + 22
 fec65519 ORBit_OAObject_invoke (80a1f50, 0, 8046780, 80467a0, 8046828, 80468ec) + 21
 fec52f1b ORBit_small_invoke_adaptor (80a1f50, 80f30e0, fee90480, 8046828, 80468ec) + 2b3
 fec61562 ORBit_POAObject_handle_request (80a1f50, 80f5e2c, 0, 0, 0, 80f30e0) + 32a
 fec61960 ORBit_POAObject_invoke_incoming_request (80a1f50, 80f30e0, 80468ec) + 54
 fec61de6 ORBit_POA_handle_request (807b370, 80f30e0, 80f30f8) + 2ea
 fec65317 ORBit_handle_request (807b2f8, 80f30e0) + 4b
 fec5014a giop_connection_handle_input (8091108) + 2e2
 fec6b4e9 link_connection_io_handler (0, 1, 8091108) + 55
 fec6d10d link_source_dispatch (80f3088, fec6b494, 8091108) + 41
 fed13615 g_main_dispatch (807a7d8) + 1d9
 fed14705 g_main_context_dispatch (807a7d8) + 85
 fed14b22 g_main_context_iterate (807a7d8, 1, 1, 8076f18) + 3ce
 fed15124 g_main_loop_run (80f2b38) + 1b8
 fc9b201d bonobo_main (8075e20, 807e818, 807bf08, 807bf0a, 807bf08, 8061484) + 5d
 0805e109 main     (1, 8046bb0, 8046bb8) + 119
 08058d1a _start   (1, 8046d3c, 0, 8046d56, 8046d6d, 8046d94) + 7a

Comment 32 Michael Meeks 2006-11-16 19:50:25 UTC

OK - this is a quite typical threading related evilness. Ekiga is highly threaded and uses this in the GUI parts AFAIR.

The problem is this - when we designed ORBit2 it was not possible to tell whether there was a 'glib main loop' running - so this is a little messy.

AFAIR the punch-line is, you need to have a gtk+ mainloop running somewhere so we can do the moral equivalent of g_idle_add (process_message); in the I/O thread as/when a request comes in for a thread-unsafe [ ie. all ATK interfaces ] object. Now from trace #19 [ from Fedrich Crozay ] which is most helpful we see:

+ Trace 87038

Thread 1 (Thread -1247610016 (LWP 3812))

#0 __kernel_vsyscall
#1 __lll_mutex_lock_wait
from /lib/tls/libpthread.so.0
#2 _L_mutex_lock_36
from /lib/tls/libpthread.so.0
#3 ??
#4 ??
from /home/gnome/prefix214/lib/libgdk-x11-2.0.so.0
#5 ??
#6 ??
#7 ??
#8 gdk_threads_impl_lock
at gdk.c line 377
#9 gdk_threads_impl_lock
at gdk.c line 377
#10 gdk_event_check
at gdkevents-x11.c line 2262
#11 IA__g_main_context_check
at gmain.c line 2422
#12 g_main_context_iterate
at gmain.c line 2544
#13 IA__g_main_loop_run
at gmain.c line 2751
#14 IA__gtk_main
at gtkmain.c line 991


ie. the guy we are waiting for to become idle and process the incoming request on the CORBA wrapper for the underlying AtkObject is blocking on the GDK_THREADS_ENTER/LEAVE lock we hold in thread 11 instead.

That is bad. ;-(

Unfortunately (of course) the at-spi-registryd needs to take a reference on that object before it can allow the 'notifyEvent' to return, which necessitates calling back into the client; which needs to be handled on the main thread, which needs the same threads lock.

Comment 33 Michael Meeks 2006-11-16 20:01:02 UTC

Probably, the simplest (and perhaps) least risky fix is to add:

+ Trace 87041

#7 spi_atk_emit_eventv
at bridge.c line 677


GDK_THREADS_LEAVE();
Accessibility_Registry_notifyEvent(...);
GDK_THREADS_ENTER();

immediately around the long-running CORBA calls here. Of course at the same time it would pay to do an audit of the code in the bridge to ensure we hold references / dup state where we need to, but a quick glance suggests most things are fine there.

HTH :-)

Comment 34 Snark 2006-11-16 20:19:33 UTC

Wouldn't my idea : http://bugzilla.gnome.org/show_bug.cgi?id=353533
solve the issue too?

Comment 35 Damien Sandras 2006-11-16 20:33:40 UTC

Snark: Your idea is a workaround to the fact that there is a bug in atk. So, yes, after 2 or 3 months of reworking inside Ekiga, we would have hidden the bug.

Comment 36 bill.haneman 2006-11-16 21:52:59 UTC

Michael: That makes sense, and thanks also Padraig.  I'll start preparing a patch to add GDK_THREADS guards around the notify emissions in atk-bridge.

BTW Damien, despite the "atk" namespace in the code, it's at-spi (the "atk-bridge" component") that seems to have the bug.

Comment 37 Michael Meeks 2006-11-17 07:59:49 UTC

oh, and while I was concerned about re-enterancy problems - they have (broadly) always existed at these points, so this should be no issue for us in fact [ but worth checking of course anyway ;-].

Comment 38 padraig.obriain 2006-11-28 09:03:14 UTC

*** Bug 380010 has been marked as a duplicate of this bug. ***

Comment 39 Li Yuan 2006-12-11 06:26:59 UTC

*** Bug 369321 has been marked as a duplicate of this bug. ***

Comment 40 Harry Lu 2006-12-13 09:33:27 UTC

Li Yuan will try to make the patch to add GDK_THREADS guards around the notify emissions in atk-bridge and do some testing to see if it works.

Comment 41 padraig.obriain 2006-12-13 09:54:47 UTC

One problem is that the bridge currently does not depend on GDK and it is not clear that we should be adding such a dependency to the bridge.

Comment 42 Michael Meeks 2006-12-14 12:52:51 UTC

Interesting :-) good question wrt. what to do.
You could -try- to push-down the gtk+ locking code into atk: of course you need to be careful, OO.o is tightly bound to gtk+'s custom locking functions.

At worst you can do some fugly OO.o style hack where you do a dlsym on yourself to see if you have gdk_threads_enter / leave, hook those functions out and call them yourself.

How does that sound ? :->

Comment 43 Snark 2006-12-14 13:12:10 UTC

Outsider view : insane.

Comment 44 bill.haneman 2006-12-14 13:26:54 UTC

Padraig is correct - atk-bridge should not depend on GDK.

Why isn't the GDK_THREAD stuff in glib instead, I wonder?

Comment 45 bill.haneman 2006-12-14 13:46:54 UTC

Another possibility - which fits better with the ATK/GAIL/ATK-bridge design I think, but which would require quick action since Gnome 2.17 ABI freeze is very soon:

1) toolkit dependencies should be restricted to the ATK implementation code; thus
   gdk/gtk+ dependencies should be restricted to gail.  

2) specifically we've used AtkUtil API in the past to deal with this sort of issue, when an atk client like atk-bridge needs to access implementation-dependent code.

Thus we could add ATK_THREADS_ENTER/ATK_THREADS_LEAVE, which would wrap
some AtkUtil function pointers which were assigned in the atk initialization code, for instance something like atk_util_threads_enter (util)

gailutil.c could then be modified to provide the appropriate implementation for atk_util_threads_ methods.  Apps/toolkits wishing to use ATK without GDK/GTK+ could then provide their own appropriate thread locking code.

Because AtkUtil doesn't seem to have any unused vfunc pointers in its AtkUtilClass struct, we might need to do something creative in order to add these virtual methods.  But since currently we don't instantiate any AtkUtil objects, I would hope this would be feasible.  At worst we could use some kind of versioning/dlsym approach to detect whether the new version of ATK was in use (preserving back-compat).  We need to change/replace (via deprecation) the signatures on the AtkUtil prototypes anyway, because the current prototypes (which don't take object instances as their first parameters) are difficult to bind with python. (Ask 'ddrake at brontes3d dot com' for more info about  that)

Sound OK?

Bill

Comment 46 Snark 2006-12-14 13:52:52 UTC

Glib has nothing to do with gui.

Comment 47 Michael Meeks 2006-12-14 14:10:46 UTC

Yes - that would be do-able I guess.
Of course, there should be no need to break AtkUtil - surely you can add another AtkMisc [ or whatever ] object to contain these new methods, with a great big gpointer expansion[32] or whatever] in the Class ? :-)

Comment 48 bill.haneman 2006-12-14 17:04:44 UTC

Snark: Glib would be a sensible place for a general purpose threads utility.  As this bug demonstrates, code that is called from inside a GDK/GTK+ program may need this without wanting to incur a gdk dependency.  ATK/atk-bridge is just finding it first, but something like a threaded media lib could easily encounter the same sort of issue.

Comment 49 Snark 2006-12-14 17:53:52 UTC

Bill: glib already has threading utilities. Ekiga is heavily threaded media application, which is why the problem showed up there. But it protects dangerous paths with _threads_enter/_leave calls. Obviously there's something somewhere which doesn't.

Comment 50 Harry Lu 2006-12-26 01:42:17 UTC

At a second thought, there might be a problem in Michael's suggestions in comment #33.

If we do sth. like:

GDK_THREADS_LEAVE();
Accessibility_Registry_notifyEvent(...);
GDK_THREADS_ENTER();

It might work with ekiga's code:
GDK_THREADS_ENTER();
gtk_widget_destroy (widget);
GDK_THREADS_LEAVE();

However, if the outside code doesn't call GDK_THREADS_ENTER() as in most cases, calling GDK_THREADS_ENTER() in bridge will mark the beginning of a critical section without its end pair. I think this will cause bigger problems.

Or did I missing sth?

Comment 51 bill.haneman 2007-01-02 11:05:22 UTC

Harry - I am not sure I understand you, but I think you are pointing out that the in-process code which handles the event from Accessibility_Registry_notifyEvent() needs to call GDK_THREADS_ENTER/LEAVE in order for the above code to work correctly.  Is this what you meant?

Anyway, we CANNOT use GDK_THREADS_ENTER/LEAVE here, since the atk-bridge must not depend on GDK.  If we need to use thread guards here, they must be added to ATK API instead.  We have only a few days to do this for gnome-2.17.  However I don't think we have a clear solution to the problem yet, as Harry points out.

There are a couple of things I am not clear about yet; 
1) why doesn't the notifyEvent call complete here - i.e. what code is causing thread 1 to be called into?   i.e. what ATK call is being fulfilled in thread 1?  Can it be fulfilled safely without holding the GDK lock?

In response to Harry's question in comment #50 - should we be putting thread guards in all atk-bridge wrappers, or in all the GAIL implementations?  That would ensure that the appropriate GDK lock was held.  If we put the thread guards (GDK_THREADS_ENTER/LEAVE) in all the gail implementations, then we could also release the lock in gail just before emitting ATK events and re-claim the lock afterwards (this might avoid any need for new ATK locking API).

What do you think of this last suggestion, Harry and Li?

Comment 52 Michael Meeks 2007-01-02 12:27:04 UTC

Harry - there is little to ~no risk to dropping the GDK_THREADS lock while we do a remote CORBA call, if other code is not taking the lock correctly - that is broken anyway.

Bill - wrt. the things you're not clear about: did you read my analysis in conjunction with trace #19 ? - the reason the notifyEvent call doesn't complete is that it is synchronous - somewhere during the processing of that synchronous event an AT (or something) wants to call -into- this process again. Unfortunately at some stage before that call could enter an X event came along to be processed: thread 11 is holding the GDK mutex that thread 1 needs to acquire in order to execute that incoming method ergo deadlock. Ergo we never get to processing the incoming CORBA call anyway. However, this is not really X specific at all - we -should- be taking the GDK_THREADS mutex on every (idle) incoming CORBA call anyhow [ I guess ] - most likely we forget to do that ATM ;-) but we should be doing it.

Ergo - this needs careful thought, analysis of the traces, a little understanding, and a way to move the locking primitives to somewhere where they can be used consistently in the bridge. Having that in 'gail' sounds sensible, ultimately I think we just need 2 new methods 'atk_enter' 'atk_leave' (or somesuch) that are implemented in gail, and call them at suitable places.

Comment 53 Harry Lu 2007-01-02 14:28:13 UTC

Bill, as Michael said in above, in comment #19, thread 1 is ekiga's main thread and is calling gtk_main(). Thread 11 is calling some code like:

GDK_THREADS_ENTER();
gtk_widget_destroy (widget);
GDK_THREADS_LEAVE();

And from Michael's analysis you can see the deadlock reason.



Michael, from your comment, "there is little to ~no risk to dropping the GDK_THREADS lock while we do a remote CORBA call, if other code is not taking the lock correctly - that is broken anyway." Do you mean we can call gdk_thread_leave() in a thread even if it doesn't call gdk_thread_enter() before? From gdk.c, gdk_threads_impl_unlock() is calling g_mutex_unlock (). And in glib/gthread/gthread-impl.c, g_mutex_unlock_errorcheck_impl () will lead to g_error or g_warning if current thread doesn't lock the mutex now.


Michael and Bill, are you suggesting we add gdk_thread_enter() and gdk_thread_leave() in gail's functions which are calling CORBA functions? Also from the code above, it seems a thread can not call gdk_thread_enter() again if it has called gdk_thread_enter() before.

So if ekiga's code has called GDK_THREADS_ENTER(), we cannot call it again in gail or bridge in the same thread. 

Is there a way to check whether current thread has called gdk_thread_enter() already (holding the mutex)?

Comment 54 Michael Meeks 2007-01-02 14:48:59 UTC

Harry; yes we need to call gdk_thread_leave() in code that didn't -take- that mutex; and of course we need to call gdk_thread_enter() again when we've done whatever call we want to release the guard during. This is normal (though possibly counter-intuitive) cf.
eg.
gtk/gtkmain.c (gtk_main):
    {
      GDK_THREADS_LEAVE ();
      g_main_loop_run (loop);
      GDK_THREADS_ENTER ();
      gdk_flush ();
    }

Comment 55 Harry Lu 2007-01-03 15:01:24 UTC

Michael, I guess the code above works because the thread is holding the gdk mutex already.  Anyway, any threads that dealing with gtk should hold the gdk mutex already so I think this should not be a problem.

So do we still need to add thread guards (GDK_THREADS_ENTER/LEAVE) in gail as it should be added by the thread calling gtk functions?

Calling GDK_THREADS_LEAVE() to release the lock in gail just before emitting ATK events and calling GDK_THREADS_ENTER() to re-claim the lock afterwards might be a possible solution, as Bill suggested.

For this bug, from thread 11 in comment 19, around stack 12, I think adding LEAVE and ENTER pair around g_signal_emit() in window_removed() of gailutil.c should work.

But how could we avoide signal nesting problem? I still think we cannot call GDK_THREAD_LEAVE twice in one thread without calling GDK_THREAD_ENTER. So if one ATK signal triggers another ATK signal, there might be a problem.

Comment 56 Michael Meeks 2007-01-03 17:49:06 UTC

Hi Harry.
I added some GDK_THREADS guards in gail some time ago when we started using gail/atk in OO.o - I -think- I caught almost all places.
In summary, the only place you need to add them is where the lock is not already held - eg. g_idle / g_timeout callbacks.

And yes - taking it when you don't need to is a problem; it's a non-recursive mutex. I'm confused wrt. your suggestion wrt. window_removed, AFAICS this is gail code called by gtk+ code - and should => be covered by the same mutex. There is no need to do any LEAVE/ENTER stuff there (that I can see (?)). At least - signals themselves should not be a problem.

The only place we need to drop / re-take the lock is over long CORBA calls (eg. event emission) that in order to complete may require processing incoming calls on other threads [ ie. mostly isolated to at-spi/bridge surely ? ]

Comment 57 Harry Lu 2007-01-04 10:53:38 UTC

Michael, we will do a check to ensure all g_idle / g_timeout callbacks have the GDK_THREADS guards in gail.

As for my suggestion in comment #55, actually it was suggested by Bill in last paragraph of comment #51, "If we put the thread guards (GDK_THREADS_ENTER/LEAVE) in all the gail implementations, then we could also release the lock in gail just before emitting ATK events and re-claim the lock afterwards (this might avoid any need for new ATK locking API)."

The problem for the above suggestion is that signals could be nested, so we could not just wrap around all g_signal_emit* by GDK_THREADS_LEAVE and GDK_THREADS_ENTER in gail.

So it looks we have to do sth. in bridge. We have several options here:

1. in bridge, "do a dlsym on yourself to see if you have gdk_threads_enter / leave, hook those functions out and call them yourself."

2. add atk_util_threads_enter/leave pair in atkutil.c, implement them in gail using gdk_threads_enter/leave, and call them in bridge. We might be able to fix the python binding problems at the same time. This will break ABI, right?

3. add atk_misc_threads_enter/leave pair in new atkmisc.c as suggested by Michael, implement them in gail, and call them in bridge.

So which one should we choose?

Comment 58 Michael Meeks 2007-01-04 11:29:54 UTC

personally, I like '3' - ultimately the Qt guys will have the same problem in future, and it'd be nice to have API for them to use. Also - if we find a better way in future, it's easy to deprecate these methods :-)

Finally - when you create AtkMisc - please do add gpointer dummy[32]; or something - the cost of that is only 128/256 bytes per process if an AtkMisc is instantiated, and it can save a lot of hassle in future :-)

HTH.

Comment 59 bill.haneman 2007-01-08 15:58:38 UTC

Created attachment 79758 [details] [review]
ATK patch as applied

applied to cvs Jan 8, 2006, and in ATK_1_13_0 tarball.  ATK_1_13_1 tarball (to be released later today) will have some minor fixups.

Comment 60 bill.haneman 2007-01-08 16:15:44 UTC

Created attachment 79760 [details] [review]
proposed gail patch, which includes GDK_THREADS_ENTER implementation.

Li, does this seem correct to you?

Comment 61 bill.haneman 2007-01-08 16:38:20 UTC

gail patch has been committed to gail 1.9.5.

Comment 62 Li Yuan 2007-01-15 09:54:46 UTC

I will make bridge patch this week.

Comment 63 Li Yuan 2007-01-18 12:14:35 UTC

Created attachment 80585 [details] [review]
atk part of patch

Comment 64 Li Yuan 2007-01-18 12:15:05 UTC

Created attachment 80586 [details] [review]
gail part of patch

Comment 65 Li Yuan 2007-01-18 12:15:40 UTC

Created attachment 80587 [details] [review]
spi part of patch

Comment 66 bill.haneman 2007-01-18 12:20:45 UTC

Comment on attachment 80585 [details] [review]
atk part of patch

Why do we need to do ATK_MISC_GET_CLASS instead of g_type_class_ref(ATK_TYPE_MISC) here?

Comment 67 bill.haneman 2007-01-18 12:22:23 UTC

Comment on attachment 80586 [details] [review]
gail part of patch

yes, I agree we probably need to unref/ref GAIL_TYPE_MISC as well as GAIL_TYPE_UTIL here.

Comment 68 Li Yuan 2007-01-18 12:38:03 UTC

because we use 
AtkMiscClass *miscclass = ATK_MISC_CLASS (klass);
in gail_misc_class_init.
This only overload "gail_misc_class", not "atk_misc_class".
They are different instances.

If we use g_type_class_peek (ATK_TYPE_MISC);
in gail_misc_class_init, then we can use
g_type_class_ref (ATK_TYPE_MISC);
here.

Comment 69 Snark 2007-01-19 04:43:08 UTC

Is it FIXED like in "proposed solution committed and tested in ekiga : it doesn't freeze anymore", or just FIXED like in "the proposed solution was committed" ? 

There was no comment in the last mail, so I prefer asking...

Comment 70 Li Yuan 2007-01-19 04:48:00 UTC

I tested on my system, it doesn't freeze anymore.

Comment 71 Snark 2007-01-19 05:15:54 UTC

Wonderful!

It was terrifying to have a phone not working for people it could help much...

That rocks!

Comment 72 Damien Sandras 2007-01-19 11:27:30 UTC

Very good job, thanks so much !!!!!!

Comment 73 hanke 2007-01-19 13:50:28 UTC

Thank you.

Comment 74 André Klapper 2007-02-23 09:47:27 UTC

Li Yuan, Bill, can you please take a look at bug 410636? thanks.

Comment 75 bill.haneman 2007-02-23 12:41:37 UTC

Andre I believe 410636 is fixed in cvs already.

Comment 76 Matthias Clasen 2007-02-26 06:22:28 UTC

While the committed fix may have fixed the ekiga problem, it breaks a large number of other threaded gtk applications. E.g gftp freezes now if a11y is enabled.
This is because it is common to do the following:

g_threads_init (NULL);
gdk_threads_init ();

gtk_init (&argc, &argv);

/* setup code */

GDK_THREADS_ENTER();
gtk_main ();
GDK_THREADS_LEAVE();


The problem is that the a11y stack is not doing thread leave/enter pairs
before the application code calls GDK_THREADS_ENTER. Compiling with errorcheck
mutexes clearly exposes that the a11y code tries to unlock an unlocked mutex.

Maybe this is the correct way to handle things, and we just have to fix all
applications to take the lock immediately after initializing threads. 

Anyway, reopening this for further consideration, since it does break applications 
which have worked with all previous GTK+ releases.

Comment 77 Harry Lu 2007-02-26 07:19:17 UTC

From the samples in http://developer.gnome.org/doc/API/2.0/gdk/gdk-Threads.html,
the correct call order is:

g_thread_init (NULL);
gdk_threads_init ();
gdk_threads_enter ();
gtk_init (&argc, &argv);
gtk_main ();
gdk_threads_leave ();

So I think we should fix all the broken apps.

Comment 78 Matthias Clasen 2007-02-26 13:48:18 UTC

I know that that's what the example shows. The point is that 

g_thread_init (NULL);
gdk_threads_init ();
gtk_init (&argc, &argv);
gdk_threads_enter ();
gtk_main ();
gdk_threads_leave ();

did work just fine until recently...

Comment 79 André Klapper 2007-02-26 14:50:18 UTC

of course we could fix all broken apps, but i wonder if two weeks before a major release is the best time for it...

right now i'm just interested in a smooth workaround (or better: solution) for a stable GNOME 2.18 (and also the GNOME apps that are not directly part of GNOME, but that are shipped by the distros).

Comment 80 Matthias Clasen 2007-02-27 06:39:21 UTC

Another application that breaks due to this is the mail-notification applet for evolution.

Comment 81 Michael Meeks 2007-02-27 10:09:30 UTC

So - surely one simple fix is to avoid doing any locking during 'init' time - right ? :-)

While we are initializing there is almost certainly no likelihood that some horribly complicated threading scenario is underway: so, we can simply have some guard against this to inhibit the locking there.

Would that fix it ? :-)

Comment 82 Michael Meeks 2007-02-27 10:15:46 UTC

So - code wise this should be trivial; if I could compile / test it I'd do it myself:

static int during_init_shutdown = 0;
#define bridge_threads_leave() \
    if (!during_init_shutdown) atk_misc_threads_leave(misc)
#define bridge_threads_enter() \
    if (!during_init_shutdown) atk_misc_threads_enter(misc)

Then do a search/replace of the existing locking calls,
At top/tail of atk_bridge_init inc/dec 'during_init_shutdown'
and similarly at top/tail of 'spi_ati_bridge_exit_func' and 'gnome_accessibility_module_shutdown'

HTH.

Comment 83 Li Yuan 2007-02-27 10:44:53 UTC

The application doesn't hang when atk-bridge init. You know, gftp also does some setup work before it calls GDK_THREADS_ENTER, like

g_threads_init (NULL);
gdk_threads_init ();

gtk_init (&argc, &argv);

/*...*/
gtk_widget_show(window);

GDK_THREADS_ENTER();
gtk_main ();
GDK_THREADS_LEAVE();

The gtk_widget_show will emit signals to atk-bridge, this will hang the application. And we can not tell which signal is before the first GDK_THREADS_ENTER and which is after it.

So I think the best way is changing the application's init code.

Comment 84 Matthias Clasen 2007-02-27 15:19:42 UTC

Michael, I think it is pretty much impossible for a library/module deep in the stack to know if the app has already taken the lock or not.

One could try to check if we are in a mainloop already, but that sounds somewhat error-prone, too.

Comment 85 Michael Meeks 2007-02-27 15:52:31 UTC

Ah - yes, this is horribly difficult then :-) and yes - basically it's a grotesquely evil problem. FWIW, we suffer almost the same problem with CORBA calls being called without a mainloop running & not know if there is such a loop, and so on.

What you might be able to do is use the new:

gboolean g_main_context_is_owner (GMainContext *context);

which is the method I wanted to try and work around / detect / emulate some of the old behaviors in the threaded ORB. It should tell you if you're a child of the main glib mainloop, which -might- be what you want to find out :-)

It seems unreasonable to expect people to take the toolkit lock as the 1st step - at least, so much code doesn't do that that ...

Comment 86 Li Yuan 2007-02-28 10:46:55 UTC

I have tried this. Unfortunately g_main_context_is_owner will return FALSE if application has not called gtk_main. So it can not resolve our problems.

Comment 87 Matthias Clasen 2007-02-28 12:45:11 UTC

Well, the imporant thing is that it returns TRUE if you are inside the main loop,
which is the only place where we can be reasonably sure that the app will hold the 
GDK lock, so it is safe to do the leave/enter thing.

Comment 88 Michael Meeks 2007-02-28 12:47:11 UTC

Li - sure, but when you havn't called gtk_main is surely precisely what you want to know ? :-) [ such that you don't drop / take the lock ? ;-]

Sigh - perhaps that's not what you want: perhaps a -better- approach is to guard the init (as we discussed before), and do:

g_idle_add (post_init, NULL);

and the post_init turns on the locking: that would perhaps work better (including for all the strange threaded scenarios where g_main_context_is_owner will not be happy at all ;-).

How does that sound ?

Comment 89 Li Yuan 2007-03-01 03:59:33 UTC

It sounds great! :-)
I will attach a patch.

Comment 90 Li Yuan 2007-03-01 04:00:30 UTC

Created attachment 83613 [details] [review]
new patch

Comment 91 Li Yuan 2007-03-01 04:01:00 UTC

Does it look good?

Comment 92 Michael Meeks 2007-03-01 14:37:49 UTC

Looks fine to me - though not an ideal solution of course, but compatible. Worth checking your g_idle_add function triggers only once [ I don't recall the polarity of the return value ]; otherwise great.

Comment 93 Li Yuan 2007-03-02 09:28:37 UTC

Committed, thanks Michael and Matthias!

Comment 94 Joanmarie Diggs (IRC: joanie) 2007-03-02 16:02:39 UTC

I just tried the patch and Ekiga no longer freezes when accessibility is on.  But if Orca is running, Ekiga freezes at step 4.  Not sure if that's an at-spi issue or an Orca issue....

Comment 95 Snark 2007-03-02 16:49:46 UTC

Could you remind us which step the fourth is ?

Comment 96 Joanmarie Diggs (IRC: joanie) 2007-03-02 16:58:21 UTC

Connection Type

Comment 97 Snark 2007-03-02 17:09:46 UTC

That means a place where we show a popup window with a progression bar from another thread :-)

Since this is not a problem when no accessibility is used... probably this bug should be reopened.

Comment 98 Li Yuan 2007-03-03 07:07:09 UTC

Are you sure the freeze only happens with the patch? I should not be the patch's problem...

Comment 99 Michael Meeks 2007-03-03 16:35:35 UTC

Nice: of course, it'd be rather useful to get a new stack-trace for step 4: it's most likely a different problem - hopefully something somewhat different :-)

Comment 100 Li Yuan 2007-03-07 10:32:48 UTC

Yes I got the freeze today. And I found the problem is if Orca does some corba calls to ekiga, the handle function (impl_*) in ekiga will not be in gdk_threads_enter/leave guards. So if the handle function emit some signals, our bridge will call gdk_threads_leave firstly.

Here is the trace:

              libgdk-x11-2.0.so.0.1000.9`gdk_threads_impl_unlock+0x3b
              libatk-1.0.so.0.1709.1`atk_misc_threads_leave+0x4c
              libatk-bridge.so`spi_atk_emit_eventv+0x206
              libatk-bridge.so`spi_atk_bridge_property_event_listener+0x5d7
              libgobject-2.0.so.0.1200.9`signal_emit_unlocked_R+0x5a0
              libgobject-2.0.so.0.1200.9`g_signal_emit_valist+0x8c4
              libgobject-2.0.so.0.1200.9`g_signal_emit+0x25
              libatk-1.0.so.0.1709.1`atk_object_notify+0xa3
              libgobject-2.0.so.0.1200.9`g_cclosure_marshal_VOID__PARAM+0x5b
              libgobject-2.0.so.0.1200.9`g_type_class_meta_marshal+0x40
              libgobject-2.0.so.0.1200.9`g_closure_invoke+0x107
              libgobject-2.0.so.0.1200.9`signal_emit_unlocked_R+0x414
              libgobject-2.0.so.0.1200.9`g_signal_emit_valist+0x8c4
              libgobject-2.0.so.0.1200.9`g_signal_emit+0x25
              libgobject-2.0.so.0.1200.9`g_object_dispatch_properties_changed+0x4e
              libgobject-2.0.so.0.1200.9`g_object_notify_dispatcher+0x17
              libgobject-2.0.so.0.1200.9`g_object_notify_queue_thaw+0xc9
              libgobject-2.0.so.0.1200.9`g_object_notify+0x17f
              libatk-1.0.so.0.1709.1`atk_object_set_name+0xeb
              libbonoboui-2.so.0.0.0`grip_item_a11y_initialize+0x2b
              libatk-1.0.so.0.1709.1`atk_object_initialize+0x92
              libbonoboui-2.so.0.0.0`bonobo_a11y_set_atk_object_ret+0x22
              libbonoboui-2.so.0.0.0`bonobo_dock_item_grip_get_accessible+0xdb
              libgtk-x11-2.0.so.0.1000.9`gtk_widget_get_accessible+0x39
              libgail.so`gail_container_ref_child+0x153
              libatk-1.0.so.0.1709.1`atk_object_ref_accessible_child+0x94
              libspi.so.0.10.11`impl_accessibility_accessible_get_child_at_index+0x7d
              libspi.so.0.10.11`_ORBIT_skel_small_Accessibility_Accessible_getChildAtIndex+0x38
              libORBit-2.so.0.1.0`ORBit_POAObject_invoke+0x22
              libORBit-2.so.0.1.0`ORBit_OAObject_invoke+0x21


since all the corba call will be handled in main_loop, so I guess we can use g_main_context_is_owner now :)

I will attach a patch, if it is OK, I will ask for a hard code freeze break.

Comment 101 Li Yuan 2007-03-07 10:34:49 UTC

Created attachment 84148 [details] [review]
patch to handle the "corba call" freeze.

Comment 102 Snark 2007-03-07 10:42:27 UTC

Sigh...

Comment 103 Michael Meeks 2007-03-07 11:00:22 UTC

Well - Li - the problem (really) is that all the incoming CORBA calls need to take the GDK_THREADS lock as they enter, and release it as they leave I think. That is quite a biggish patch to at-spi/libspi/ but that's conceptually correct.

Simply not taking the GDK_THREADS lock when we own the GMainContext doesn't get us there really.

OTOH - if anyone is playing games where in gtk+ code they do:

Accessible_doFoo () on an in-process implementation (as they can) things will wedge: since the GDK_THREADS mutex is non-recursive [ why, oh why ? for VCL it is recursive and life is so much better ]. Anyhow - I think that situation should be extremely unlikely [ the API is ~only used remotely ] and of course it's only a problem in a threaded environment I guess.

Of course, the good news is that for other threaded apps (like OO.o) this will make everything rather more robust too. Thanks Li & sorry for the pain.

PS. I would wrap the threads enter/leave in a similar fashion for use in at-spi/libspi FWIW.

Comment 104 Li Yuan 2007-03-07 12:57:39 UTC

Thanks Mechael. 
Do you have an example which would break applications if we don't have a lock in CORBA calls?

Comment 105 Michael Meeks 2007-03-07 14:55:39 UTC

Hi Li - sure; that's easy:

Thread 1: g_lib_main() -> Accessible_getChildCount() -> g_list_length (a11y_list)
Thread 2: GDK_THREADS_ENTER() ... a11y_list = g_list_remove (a11y_list, baa) ...

How about that ? ;-) if you add a GDK_THREADS_ENTER to the impl_accessible_get_child_count method the problem is gone :-)

Comment 106 Li Yuan 2007-03-07 15:04:12 UTC

OK, thanks :)
So I guess it is impossible to add threads guards in ORBit, right?

Comment 107 Michael Meeks 2007-03-07 15:22:00 UTC

you want me to add a toolkit specific thread guard inside ORBit ? :-) unfortunately, we would have the same problems there ... But I guess it might be possible to do something with a custom POA (in the abstract) that would take & release the lock as calls came in & executed - however: ORBit is not under active (feature) development really :-)

Comment 108 Li Yuan 2007-03-14 08:32:11 UTC

Hi Micheal, I have made the patch. But found ekiga still cannot startup. Because in ekiga's code:

gdk_threads_enter ();
...
GnomeMeeting::Process ()->BuildGUI ();
...
gtk_main ();

In BuildGUI, ekiga will call bonobo_activation_activate. And at this time, Orca calls in. In impl_accessibility_accessible_get_state, we call gdk_threads_enter again. And hang...

So application should make sure gdk_threads_leave first when they call some bonobo functions?

Comment 109 Li Yuan 2007-03-14 08:33:24 UTC

Created attachment 84560 [details]
freeze trace after "threads_enter" in impl*

Comment 110 Michael Meeks 2007-03-14 10:01:52 UTC

Urgh - un-controlled re-enterancy is a serious hazard. I think we bump against the limits of what it's possible to achieve with a non-recursive gdk-threads mutex. ie. here are the problems:

  * it's simply not possible (or correct) to guard every call that may do a CORBA call with a gdk_threads_unlock.
  * it's not possible to take / drop a lock in the ORB for each call (what lock)
  * it's not possible to detect if the thread guard is taken in the 1st place: people do immediate calls before running a mainloop / taking the lock anyway [ this is worked around by the idle handler of course, but it's not a pleasantly generalized solution ].
  * it's not possible to use the 'AT_IDLE' POA policy since we need to process incoming a11y calls during an outgoing event emission.

  ... so ... since we already have a pile of not so pleasant implementations here  - what we could do, is to stack some more up:

  By using the gdk_threads_set_lock_functions - we can substitute a recursive mutex for a non-recursive one - or (perhaps simpler) add a:

  thread_has_gdk_lock () type method - by hooking & proxying gdk_thread requests - we could (presumably) track what thread has the lock at any given time, and simply avoid taking it again if it's us (to catch these rather unpleasant & hard to detect scenarios).

  This may run into interesting problems with OO.o (the main user of that hook) - that is worth checking: I'll poke obr he'll want to follow this I think.

Comment 111 Oliver Braun 2007-03-14 11:24:34 UTC

(In reply to comment #110)
> - that is worth checking: I'll poke obr he'll want to follow this I think.

Yes, he does ;-).

Comment 112 Li Yuan 2007-03-15 05:49:08 UTC

gdk_threads_set_lock_functions must be called before gdk_threads_init, but in ekiga, they call gdk_threads_init before gnome_program_init. So we cannot set the lock functions...

Comment 113 Snark 2007-03-15 06:12:51 UTC

Now is the right time to discuss things ekiga does at startup, and in which order it should do them : we're adding a plugin system, so we will have to load them before some steps and after others.

Comment 114 Harry Lu 2007-03-15 07:41:19 UTC

I don't think we should call gdk_threads_set_lock_functions in bridge. An application could call it for its own purpose.

Should we file a bug for gdk team to let them add a gdk_thread_has_lock() function?

Comment 115 Matthias Clasen 2007-03-15 13:53:55 UTC

libraries or modules _cannot_ use gdk_threads_set_lock_functions().
that does not work.

Comment 116 Oliver Braun 2007-03-20 07:45:50 UTC

IIRC the OOo object model (UNO) avoids the original problem by dispatching callbacks from any (remote) call to the originating thread instead of the main thread. Don't know whether this would be feasable to do in Orbit2/CORBA though.

Comment 117 Michael Meeks 2007-03-20 09:24:09 UTC

Hi Oliver - I guess, we could try to add some thread affinity here; we do have a 'thread-per-poa' policy that could be adapted to group together all the a11y calls - but unfortunately, it starts a new thread for the 1st call, which would be rather sub-optimal. So - we'd require a new ORBit2 custom policy here (not -that- tough to write cf. ORBit2/src/orb/poa/poa.c & orbit-adaptor.c etc.)

Comment 118 Li Yuan 2007-03-22 03:43:15 UTC

For we do not have another way to fix this problem, I'd like to commit #84148 patch into trunk first. The bug is blocking ekiga's testing now. But we are still looking for the "perfect" patch.

Comment 119 Harry Lu 2007-03-22 06:55:04 UTC

committed #84148 into HEAD for now.

Comment 120 Mike Pedersen 2007-03-26 07:21:28 UTC

Hi all, this turns out to be a really helpful fix.  Not only does it fix ekiga's hang but also resolves hangs in the following applications.  wifi-radar gftp update-manager and gnome-bit torrent.  Thanks a lot! this really helps.  I'll keep testing for negative side effects but things are looking good so far.

Comment 121 Snark 2007-03-26 07:26:44 UTC

Glad to know we're not the only ones bitten by this :-)

Comment 122 Li Yuan 2007-05-09 06:17:55 UTC

*** Bug 418880 has been marked as a duplicate of this bug. ***

Comment 123 elaine 2007-10-12 12:03:44 UTC

When I debug another bug reported in Ekiga, I found this patch has potential threat to thread safety. "atk_misc_threads_leave(misc)" will unlock the thread mutex which is needed to protect the critical section. Once another thread has the chance to run between the leave and enter, thread safety can't be sured.