GNOME Bugzilla – Bug 552329
Crash on at-spi while calling to desktop_get_child_at_index
Last modified: 2011-06-02 18:25:27 UTC
Please describe the problem:
In concrete environments, if you close an application, and immediately you try to get all the active applications (this could be common on a automatic testing script), desktop_get_child_at_index can causes a crash.
In order to allow to reproduce it on a really easy way, I will attach a dogtail test script.
I need to note that I was only able to reproduce this bug on a N810 device, making some tests for automatic tests. In order to configure it, you require several packages, as for default this aren't included. A general installation description:
(Note: anyway, this is not easy configuration)
Steps to reproduce:
1. Configure a testing environment for a n810 device.
2. Run the script crash-atspi.py: "$python crash-atspi.py"
3. Start a gtk-gnome application
4. Close the application
A crash on at-spi
at-spi running properly
Does this happen every time?
It crashes on most of the gtk-applications, but I found some exceptions, like Mah-Jong game
I will attach a testing script
I will attach a provisional patch
Created attachment 118730 [details]
A testing dogtail script
This script basically asks at-spi for all the applications on the desktop 0, it is, it calls continuously desktop.getChildAtIndex
Created attachment 118731 [details] [review]
Fixes the bug
Using this patch I solve the problem in my concrete environment.
IMHO this is a reentrancy problem, although I could be wrong, as I haven't got too many experience with this kind of problems.
Basically, when you close an aplication, the atexit callback is called. Among other things, it calls remove_application on desktop.c
Meanwhile, the script is calling continuously to get_child_at_index.
Looking at the behaviour, it seems that both are called, so it tries to get the child while it is removing the child.
I solve that using a similar solution that I found on other at-spi code, like registry: a queue. I add a safeguard on get_at_index, so it it is set, remove_application will postpone the removal, and add the application to a list. After get the child, get_child_at checks if any application need to be removed. If the list has any element. It flush it.
Last note: I could upload a backtrace too, but it is very uninformative, as it only indicates that the crash is caused because you call getChildAtIndex, but I can upload that if anyone has interest.
I am curious why crash is in get_child_at_index. Could you build at-spi with debug info (with CFLAGS=-g -O0) and paste the trace? Maybe we just add a check here and return NULL is OK.
Gdb backtrace of the crash:
#0 0x41050e74 in raise () from /lib/libc.so.6
#1 0x41052450 in abort () from /lib/libc.so.6
#2 0x40156a20 in IA__g_logv (log_domain=0x0, log_level=1091791928,format=0x400aadf4 "Attempted to marshal a bogus / dead object %p type",args1=0xbedbe17c) at gmessages.c:502
#3 0x40156a60 in IA__g_log (log_domain=0x0, log_level=1767,format=0x400aadf4 "Attempted to marshal a bogus / dead object %p type") at gmessages.c:522
#4 0x4008a6ec in ORBit_marshal_object (buf=0x3bfe0, obj=0x40b08) at corba-object.c:564
#5 0x4008fc48 in ORBit_marshal_value (buf=0x3bfe0, val=0xbedbe1ec, tc=0x40052604) at corba-any.c:152
#6 0x4008fda8 in ORBit_marshal_arg (buf=0x0, val=0xbedbe230, tc=0x400aadf4) at corba-any.c:364
#7 0x4008843c in ORBit_small_invoke_adaptor (adaptor_obj=0x38c30,recv_buffer=0x3f910, m_data=0x40058860, data=0xbedbe2e0, ev=0xbedbe378)at orbit-small.c:907
#8 0x40096e48 in ORBit_POAObject_handle_request (pobj=0x38c30, opname=0x42b3c "getChildAtIndex", ret=0x0, args=0x0, ctx=0x0, recv_buffer=0x3f910, v=0xbedbe378) at poa.c:1354
#9 0x40097394 in ORBit_POAObject_invoke_incoming_request (pobj=0x38c30, recv_buffer=0x3f910, opt_ev=0xbedbe378) at poa.c:1422
#10 0x40097624 in ORBit_POA_handle_request (poa=0x20e70, recv_buffer=0x3f910, objkey=0x38c30) at poa.c:1644
#11 0x4009bd44 in ORBit_handle_request (orb=0x20dd0, recv_buffer=0x3f910) at orbit-adaptor.c:296
#12 0x40084eb0 in giop_connection_handle_input (lcnx=0x0) at giop-recv-buffer.c:1282
#13 0x400a2c9c in link_connection_io_handler (gioc=0x0, condition=G_IO_IN, data=0x400aadf4) at linc-connection.c:1367
#14 0x400a4aa4 in link_source_dispatch (source=0x410a0, callback=0x400a2bb4 <link_connection_io_handler>, user_data=0x32d98) at linc-source.c:159
#15 0x4014defc in IA__g_main_context_dispatch (context=0x1fdb8) at gmain.c:2045
#16 0x4014fda8 in g_main_context_iterate (context=0x1fdb8, block=1, dispatch=1, self=0x40b08) at gmain.c:2677
#17 0x4015016c in IA__g_main_loop_run (loop=0x3bf30) at gmain.c:2881
#18 0x429ee77c in bonobo_main () from /usr/lib/libbonobo-2.so.0
#19 0x0000efb8 in main (argc=1, argv=0x6e7) at registry-main.c:83
As I said on comment 2, I have no problem to past a backtrace, but as far as I see is really uninformative, as it only point where the crash appears, but no further relevant information.
And I think that it is really hard to add a check only on the get_child_at_index. My first thought was try to check if the index that receives impl_desktop_get_child_at_index, but the list still maintains the item at that moment, so "in theory" the index is correct.
Looking how the interaction between the three programs (at-spi, gtk-program, at-program(the test)), I think that the interaction is as explained at comment 2, but in detail:
1. call to _get_child_at_index (from the at-program)
2. call to _remove_application (from gtk-application, when it closes)
2.1 The item is freed on _remove_application
3. _get_child_at_index try to interact with the deleted item
Anyway, as I said previously, I have little experience with reentrancy problems, so I could be wrong.
Set breakpoint at spi_desktop_remove_application, to see what is get_child_at_index doing may be useful. Probably we can add a check in impl_desktop_get_child_at_index to avoid return a dead object.
Sorry to ask the trace again because I cannot reproduce the bug.
(In reply to comment #6)
> Set breakpoint at spi_desktop_remove_application, to see what is
> get_child_at_index doing may be useful. Probably we can add a check in
> impl_desktop_get_child_at_index to avoid return a dead object.
Here you are, new backtraces reproducing again the bug today:
break point at spi_desktop_remove_application
> Sorry to ask the trace again because I cannot reproduce the bug.
No problem, the environment is really concrete. I tried to reproduce it by myself on the desktop by I was not able to do that. I just created the bug because this could be a problem everywhere, although right now it only happens on the concrete environment I pointed on the description.
Created attachment 126247 [details]
log from valgrind
when launch/exit app more quick, at-spi crashed with:
do_unref: assertion failed: (robj->refs < ORBIT_REFCOUNT_MAX && robj->refs > 0)
the patch above does not fix this.
I have put a workaround there, a flag to avoid re-enter impl_registry_notify_event
Created attachment 149590 [details] [review]
Update previous patch
The previous patch has a silly but important error, with the result that if the queue has two or more elements at-spi enters in a infinite loop, so the a11y support became inoperative.
It also adds a g_list_free (so avoiding memory leaks).
Just a quick comment to complement the original description. The original description said that this bug was only detected using a N810 device.
Update: It is also detected in a Nokia N900.
Thanks for the patch. There could be a problem if an AT only calls get_child_at_index when there are signals like APPLICATION_REMOVED. The AT could never know the application has been removed.
(In reply to comment #13)
> Thanks for the patch. There could be a problem if an AT only calls
> get_child_at_index when there are signals like APPLICATION_REMOVED. The AT
> could never know the application has been removed.
Yes you are right, I didn't take that into account, this patch would work if get_child_at_index is called continuously so the queue could be flushed. It works in the environment where I detected the bug because that it is the case.
So get_child_at_index is in fact a bad trigger to flush the applications.
Other option could be add an idle when the queue is created, and check the is_queueing_remove safeguard here, and call the flush.
In this case remove the application would be just postponed to a idle state, instead of being postponed to the next get_child_at_index.
If you think that this would be a good option, I could work in a new patch.
We could meet the same crash if we do it in an idle function. The idle function can be interrupted by the continues get_child_at_index calls also. But I am not very sure.
On these days most of the people are not using at all at-spi with n800 or n900, and two years have passed since then
I think that it doesn't worth to keep this bug open. I will close as WONTFIX. We can open it in the future if it is detected on the desktop, although sincerely, the desktop should use at-spi2 these days.